Humanoid robots learn to cook by watching YouTube videos
Researchers demonstrate a new approach to robot learning that allows humanoid robots to acquire complex manipulation skills from online video demonstrations.
Popular Articles
© 2026 AW3 Technology, Inc. All Rights Reserved.
Researchers demonstrate a new approach to robot learning that allows humanoid robots to acquire complex manipulation skills from online video demonstrations.
© 2026 AW3 Technology, Inc. All Rights Reserved.


Founder & Editor
Covering the frontier of artificial intelligence, startups, and the technologies reshaping our world.
Get in Touch
Researchers at Stanford’s IRIS Lab have demonstrated a humanoid robot that learned to prepare a complete meal—chopping vegetables, sautéing ingredients, and plating the dish—by watching cooking videos on YouTube. The robot had never been explicitly programmed to cook. It learned by observing human demonstrations and translating visual observations into physical actions.
The breakthrough represents a fundamental shift in how robots learn manipulation skills. Traditional approaches require painstaking manual programming of every motion, or expensive teleoperation sessions where a human operator guides the robot through each task. The new approach, called Video-to-Action Transfer (V2AT), allows robots to learn complex skills from the vast library of instructional videos already available on the internet.
Video-to-Action Transfer works in three stages. First, a vision model processes cooking videos and extracts a structured representation of the actions being performed: what objects are being manipulated, how they are being grasped, and what forces are being applied. Second, a planning model translates this representation into a sequence of robot-executable actions, accounting for differences in morphology between human hands and robot grippers. Third, a control model executes the plan on the physical robot, using real-time visual feedback to adjust for errors and unexpected situations.
The key innovation is the second stage—the cross-embodiment transfer. Human hands and robot grippers are fundamentally different: different numbers of fingers, different ranges of motion, different force capabilities. V2AT uses a learned “embodiment translation” model that maps human manipulation strategies to robot-feasible equivalents, trained on a relatively small dataset of paired human-robot demonstrations.
The demonstration that captured public attention involved a humanoid robot preparing pad thai from scratch. The robot watched three YouTube cooking videos, identified the key steps—preparing ingredients, heating the wok, cooking noodles, combining ingredients, and plating—and executed each one on a real kitchen setup.
The robot was not perfect. It occasionally dropped ingredients, overcooked the noodles on one attempt, and took roughly three times as long as a human cook. But it completed the task without any human intervention, adapting to unexpected situations like a spatula that slipped out of its gripper and a burner that failed to ignite on the first try.
The significance of V2AT goes far beyond cooking. If robots can learn complex manipulation skills from online videos, the bottleneck for robot capabilities shifts from expensive manual programming to the availability of demonstration videos—and the internet has billions of them.
Traditional robot programming is bespoke: each task requires its own carefully engineered controller. V2AT opens the door to a world where robots can continuously learn new skills from the ever-growing library of online video content. A robot could learn to assemble furniture by watching IKEA tutorials, perform first aid by watching medical training videos, or tend a garden by watching horticulture channels.

A humanoid robot executes a cooking task learned entirely from online video demonstrations
If a robot can learn from a YouTube video, then anyone who can create a video can teach a robot. This has profound implications for accessibility and customization. A factory worker could teach a robot a new assembly procedure by filming themselves performing it. A caregiver could teach a home robot to perform specific tasks for a patient with unique needs.
The internet is the largest repository of human knowledge ever created. We just taught robots how to read it.
Dr. Chelsea Finn, Stanford IRIS Lab
V2AT is impressive, but significant challenges remain. The system currently works best with manipulation tasks that involve rigid objects and predictable physics. Handling deformable objects—folding laundry, kneading dough—remains difficult. The cross-embodiment transfer is imperfect, and the robot often needs several attempts to find a viable strategy for a given task.
Safety is also a concern. A robot learning from videos has no inherent understanding of what is dangerous. It does not know that a hot pan can cause burns, that a sharp knife requires careful handling, or that certain ingredients are allergens. Adding safety constraints without undermining the flexibility of the learning system is an active area of research.
V2AT is one piece of a larger puzzle: building robots that can learn and adapt as flexibly as humans do. Combined with advances in language-guided planning, real-time perception, and dexterous manipulation hardware, it points toward a future where robots are not pre-programmed tools but learning systems that continuously expand their capabilities.
The researchers at Stanford are already working on the next version of V2AT, which will incorporate audio and language understanding to learn not just from what people do in videos but from what they say. When robots can learn from both showing and telling, the pace of robot skill acquisition could accelerate dramatically.
Leave a Comment