Humanoid robots learn to cook by watching YouTube videos

The breakthrough represents a fundamental shift in how robots learn manipulation skills. Traditional approaches require painstaking manual programming of every motion, or expensive teleoperation sessions where a human operator guides the robot through each task. The new approach, called Video-to-Action Transfer (V2AT), allows robots to learn complex skills from the vast library of instructional videos already available on the internet.

How V2AT Works

Video-to-Action Transfer works in three stages. First, a vision model processes cooking videos and extracts a structured representation of the actions being performed: what objects are being manipulated, how they are being grasped, and what forces are being applied. Second, a planning model translates this representation into a sequence of robot-executable actions, accounting for differences in morphology between human hands and robot grippers. Third, a control model executes the plan on the physical robot, using real-time visual feedback to adjust for errors and unexpected situations.

The key innovation is the second stage—the cross-embodiment transfer. Human hands and robot grippers are fundamentally different: different numbers of fingers, different ranges of motion, different force capabilities. V2AT uses a learned “embodiment translation” model that maps human manipulation strategies to robot-feasible equivalents, trained on a relatively small dataset of paired human-robot demonstrations.

The Cooking Demonstration

The demonstration that captured public attention involved a humanoid robot preparing pad thai from scratch. The robot watched three YouTube cooking videos, identified the key steps—preparing ingredients, heating the wok, cooking noodles, combining ingredients, and plating—and executed each one on a real kitchen setup.

The robot was not perfect. It occasionally dropped ingredients, overcooked the noodles on one attempt, and took roughly three times as long as a human cook. But it completed the task without any human intervention, adapting to unexpected situations like a spatula that slipped out of its gripper and a burner that failed to ignite on the first try.

Why This Matters

The significance of V2AT goes far beyond cooking. If robots can learn complex manipulation skills from online videos, the bottleneck for robot capabilities shifts from expensive manual programming to the availability of demonstration videos—and the internet has billions of them.

1. Scaling Robot Skills

Traditional robot programming is bespoke: each task requires its own carefully engineered controller. V2AT opens the door to a world where robots can continuously learn new skills from the ever-growing library of online video content. A robot could learn to assemble furniture by watching IKEA tutorials, perform first aid by watching medical training videos, or tend a garden by watching horticulture channels.

2. Democratizing Robot Capabilities

If a robot can learn from a YouTube video, then anyone who can create a video can teach a robot. This has profound implications for accessibility and customization. A factory worker could teach a robot a new assembly procedure by filming themselves performing it. A caregiver could teach a home robot to perform specific tasks for a patient with unique needs.

The internet is the largest repository of human knowledge ever created. We just taught robots how to read it.
Dr. Chelsea Finn, Stanford IRIS Lab

Technical Challenges

V2AT is impressive, but significant challenges remain. The system currently works best with manipulation tasks that involve rigid objects and predictable physics. Handling deformable objects—folding laundry, kneading dough—remains difficult. The cross-embodiment transfer is imperfect, and the robot often needs several attempts to find a viable strategy for a given task.

Safety is also a concern. A robot learning from videos has no inherent understanding of what is dangerous. It does not know that a hot pan can cause burns, that a sharp knife requires careful handling, or that certain ingredients are allergens. Adding safety constraints without undermining the flexibility of the learning system is an active area of research.

The Road to General Robot Intelligence

V2AT is one piece of a larger puzzle: building robots that can learn and adapt as flexibly as humans do. Combined with advances in language-guided planning, real-time perception, and dexterous manipulation hardware, it points toward a future where robots are not pre-programmed tools but learning systems that continuously expand their capabilities.

The researchers at Stanford are already working on the next version of V2AT, which will incorporate audio and language understanding to learn not just from what people do in videos but from what they say. When robots can learn from both showing and telling, the pace of robot skill acquisition could accelerate dramatically.

Subscribe our newsletter
and Stay updated each week

Major Breaches Hit Vercel, McGraw Hill as Zero-Days Surge This Week

AlphaEvolve AI Discovers Mathematical Breakthrough, Saves Google 0.7% Globally

SpaceX Eyes $60B Acquisition of AI Coding Startup Cursor

Google Stops First AI-Generated Zero-Day Attack Before Mass Exploitation

DTCC Targets Tokenized Securities Launch on Stellar by 2027

Microsoft Moves Engineers from Claude Code to GitHub Copilot CLI

Anthropic Raises $65B in Largest AI Funding Round, Valuation Hits $965B

Google Stops First AI-Generated Zero-Day Attack Before Mass Exploitation

DTCC Targets Tokenized Securities Launch on Stellar by 2027

Microsoft Moves Engineers from Claude Code to GitHub Copilot CLI

NVIDIA Unveils Isaac GR00T and Cosmos to Accelerate Physical AI Robots

Anthropic Raises $65B in Largest AI Funding Round, Valuation Hits $965B

Robot Swarms Achieve Autonomous Fleet Coordination in Manufacturing Push

Humanoid robots learn to cook by watching YouTube videos

Will Schulz

The next wave of AI unicorns is coming from outside Silicon Valley

Google Stops First AI-Generated Zero-Day Attack Before Mass Exploitation

Comments (0)

How V2AT Works

The Cooking Demonstration

Why This Matters

1. Scaling Robot Skills

2. Democratizing Robot Capabilities

Technical Challenges

The Road to General Robot Intelligence

Subscribe our newsletter and Stay updated each week

Humanoid robots learn to cook by watching YouTube videos

Comments (0)

How V2AT Works

The Cooking Demonstration

Why This Matters

1. Scaling Robot Skills

2. Democratizing Robot Capabilities

Technical Challenges

The Road to General Robot Intelligence

Subscribe our newsletter
and Stay updated each week