

Conceptual image (unrelated to study) of robots making and serving coffee. (Image by Es sarawuth on Shutterstock)
In a nutshell
- Researchers have created ELLMER, a robot system that combines language AI (like GPT-4) with physical sensors, allowing robots to understand verbal commands and complete complex tasks like making coffee and drawing pictures.
- Unlike traditional AI that only processes information, this “embodied AI” can feel physical forces and see its environment, enabling it to adapt to real-world changes like moved cups or different ingredient amounts.
- This breakthrough suggests that true machine intelligence might require physical interaction with the world, opening possibilities for more helpful robots in homes, hospitals, and workplaces.
EDINBURGH, Scotland — In the classic cartoon “The Jetsons,” George would often enjoy a hot cup of coffee poured by his robot, Rosie, while he ate breakfast. We may have dreamed of such a situation back in the 1980s, but that lifestyle could be reality sooner than later. Researchers have successfully taught robots to make coffee and draw pictures by combining powerful language algorithms with physical machines that can feel and see their surroundings.
The research team, based at the University of Edinburgh, developed something they call ELLMER (Embodied Large-Language-Model-Enabled Robot), which connects GPT-4’s thinking abilities to a robot’s hands and sensors. This extraordinary robot doesn’t just understand words, it can feel pressure when pouring water and see when objects move.
“We are glimpsing a future where robots with increasingly advanced intelligence become commonplace,” says Ruaridh Mon-Williams, from Edinburgh’s School of Informatics, in a statement. “Human intelligence stems from the integration of reasoning, movement and perception, yet AI and robotics have often advanced separately. Our work demonstrates the power of combining these approaches and underscores the growing need to discuss their societal implications.”
The system was put to the test when given the request, “I’m tired, with friends coming over for cake soon. Can you make me a hot drink, and draw a random animal on a plate?” Incredibly, it responded by figuring out that coffee would help with tiredness, finding ingredients, and completing all necessary steps — from opening drawers to scooping coffee to pouring water. It even sketched a bird on a plate for serving.
“If Deep Blue (the first computer to win a chess match against a reigning world champion) was truly intelligent, then should it not be able to move its own pieces when playing chess?” the researchers ask in their paper. This question cuts to the heart of their work: intelligence isn’t just about calculating moves, but also about physically engaging with the world.


Merging Mind and Body in Machines
Researchers built on the concept of “embodied cognition,” the idea that human thinking isn’t just happening in our brains but is deeply connected to how our bodies interact with our surroundings. Walking, touching, seeing, and manipulating objects aren’t separate from thinking – they’re part of it.
ELLMER differs from other robot systems in how it combines multiple technologies. The GPT-4 language model processes requests and breaks them into steps. But rather than just sending commands, the system pulls from a library of movement examples stored in a knowledge base. Using a method called retrieval-augmented generation (RAG), it finds relevant examples for tasks like “open drawer” or “pour liquid” and adapts them to the current situation.
This approach helps the robot handle real-world messiness. When the researchers tested the system, they found it could track a cup even when someone moved it, pour a precise amount of water (with accuracy around 5.4 grams per 100 grams), and adjust the pressure when drawing on plates to maintain consistent lines.
The robot also showed creativity. For the drawing tasks, it used DALL-E to create silhouettes based on prompts like “random bird” or “random plant,” then translated those images into physical drawings using a pen on plates.
Beyond Theory: Why This Matters for Everyday Life
Previous attempts to make robots work with language models often stumbled because they couldn’t respond to physical feedback in real-time or lacked the ability to feel forces. The ELLMER system bridges this gap by continuously monitoring both what it sees through a camera and what it feels through force sensors attached to its gripper.
The researchers compared their approach against another system called VoxPoser which doesn’t use their knowledge retrieval method or force sensing. ELLMER produced more accurate and reliable results, with fewer errors where the AI might imagine incorrect information about the physical world.
This research, published in Nature Machine Intelligence, opens possibilities for robots that can work in messy, unpredictable places like homes or hospitals. Rather than needing perfectly controlled environments, these robots could adapt to changing conditions – moving cups, different coffee amounts, or unexpected obstacles.


Challenges and Future Directions
The system still has limitations. The vision system sometimes struggled with cluttered spaces, and while the robot could adjust to changes as they happened, it couldn’t proactively switch tasks on its own without being told. Future versions might query the language model more frequently to handle changing priorities better.
Digital assistants like Siri or Alexa can answer questions about the weather, but they can’t hand you an umbrella. The gap between knowing and doing remains wide in artificial intelligence. Bridging this gap – connecting abstract reasoning with physical action – might be key to building machines with something closer to human-like intelligence.
After all, if Deep Blue only exists inside a computer, did it really play chess? Or did it just calculate chess moves? The difference might seem philosophical, but for anyone who’s ever wanted a robot to actually make them coffee when they’re tired, it’s also deeply practical.
Paper Notes
The study “Embodied large language models enable robots to complete complex tasks in unpredictable environments” was written by Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G. Lucas from the University of Edinburgh, with additional affiliations at MIT, Princeton University, and the Alan Turing Institute. It appeared in Nature Machine Intelligence on March 19, 2025, after being received on June 22, 2024, and accepted on January 31, 2025.