Microsoft’s new AI agent can control software and robots

February 20, 2025

14

On Wednesday, Microsoft Research introduced Magma, an integrated AI foundation model that combines visual and language processing to control software interfaces and robotic systems. If the results hold up outside of Microsoft’s internal testing, it could mark a meaningful step forward for an all-purpose multimodal AI that can operate interactively in both real and digital spaces.

Microsoft claims that Magma is the first AI model that not only processes multimodal data (like text, images, and video) but can also natively act upon it—whether that’s navigating a user interface or manipulating physical objects. The project is a collaboration between researchers at Microsoft, KAIST, the University of Maryland, the University of Wisconsin-Madison, and the University of Washington.

We’ve seen other large language model-based robotics projects like Google’s PALM-E and RT-2 or Microsoft’s ChatGPT for Robotics that utilize LLMs for an interface. However, unlike many prior multimodal AI systems that require separate models for perception and control, Magma integrates these abilities into a single foundation model.

A combined graphic that shows off various capabilities of the Magma model.

Credit:

Microsoft Research

Microsoft is positioning Magma as a step toward agentic AI, meaning a system that can autonomously craft plans and perform multistep tasks on a human’s behalf rather than just answering questions about what it sees.

“Given a described goal,” Microsoft writes in its research paper. “Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial, and temporal intelligence to navigate complex tasks and settings.”

Microsoft is not alone in its pursuit of agentic AI. OpenAI has been experimenting with AI agents through projects like Operator that can perform UI tasks in a web browser, and Google has explored multiple agentic projects with Gemini 2.0.

Spatial intelligence

While Magma builds off of Transformer-based LLM technology that feeds training tokens into a neural network, it’s different from traditional vision-language models (like GPT-4V, for example) by going beyond what they call “verbal intelligence” to also include “spatial intelligence” (planning and action execution). By training on a mix of images, videos, robotics data, and UI interactions, Microsoft claims that Magma is a true multimodal agent rather than just a perceptual model.

Source link

Microsoft’s new AI agent can control software and robots

Spatial intelligence

Related Articles

Portugal v Spain: Nations League final – football live

6/8: Face the Nation

Newsom mocks Hegseth, Trump over ‘job well done’ post on National Guard

LEAVE A REPLY Cancel reply

Latest Articles

Portugal v Spain: Nations League final – football live

6/8: Face the Nation

Newsom mocks Hegseth, Trump over ‘job well done’ post on National Guard

I Tried the Viral Jalapeño Sauvignon Blanc Trend—And Now I’m Ready for a Hot Wine Summer

Natasha Rothwell Pitched Belinda’s ‘White Lotus’ Season 3 Storyline: ‘She Sees That She Has Power Over a White Man’

Microsoft’s new AI agent can control software and robots

Spatial intelligence

Related Articles

LEAVE A REPLY Cancel reply

Stay Connected

Latest Articles