mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Abstract

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.

We introduce mimic-video, a new class of Video-Action Model (VAM) that grounds robotic policies in pretrained video models. Unlike standard VLAs that must learn physical dynamics from scratch, mimic-video leverages the inherent visual dynamics of a video backbone to isolate the control problem. This enables state-of-the-art performance on dexterous manipulation tasks, while achieving 10x greater sample efficiency compared to VLAs.

The Model

We instantiate our framework with a pretrained video generation backbone (Nvidia Cosmos-Predict2), which provides rich physical dynamics priors learned from large-scale video data. We adapt this model for control via a partial denoising strategy, where the video backbone follows the flow to an intermediate flow time τ_v to extract latent visual plans. These representations condition a smaller action decoder, which processes proprioceptive states and predicts action trajectories. The video and action components operate on independent flow schedules (τ_v and τ_a), allowing us to design the learning problem separately for each modality.

Real World Experiment on Bimanual mimic Setup

We train and evaluate mimic-video on a real-world bimanual robot setup with Franka Emika Panda robot arms and mimic 16-DoF dexterous humanoid hands. We execute a real-world evaluation on the bimanual setup with two tasks: package sorting and pick and place of a measuring tape into a box.

Video Generation and Autonomous Execution

For each action chunk, mimic-video generates a latent video plan up to an intermediate noise level and then executes the actions on the real robot. We further fully denoise the predicted video for this visualization.

We compare mimic-video against a standard single-task Diffusion Policy (DP) baseline on the real robot. While mimic-video finetunes the video model backbone with task video, both methods train their action decoders from scratch with the same small dataset of low-dimensional robot actions.

Package Sorting (mimic-video)

Tape Stowing (mimic-video)

Package Sorting (DP Baseline)

Tape Stowing (DP Baseline)

Sample Efficiency and Convergence Results on Simulated Benchmarks

We additionally evaluate mimic-video on the SIMPLER and LIBERO simulated benchmarks, comparing it against a traditional VLA baseline featuring FAST-pretraining and Knowledge-Insulation action decoder training.

Case Study: How Does Video Generation Quality Affect Robot Policy Performance?

We compare success rates when conditioning our action decoder on different visual inputs: video latents generated by either predictions or ground-truth (expert) video for both features from a standard pretrained video model, as well as a video model finetuned on video data from the robot dataset.

The near-perfect performance with ground truth inputs confirms that control effectively reduces to visual prediction, implying policy performance scales directly with video model quality.

Sample Efficiency and Convergence Speed

mimic-video sample efficiency and convergence plots

Sample efficiency (left) and convergence speed (right) on the LIBERO benchmark. mimic-video achieves 10x better sample efficiency compared to a comparable traditional VLA. Decreasing the dataset size to only one episode per task (2% of action data) still yields a 77% success rate. Furthermore, mimic-video converges twice as fast as the VLA baseline, and to a higher asymptotic success rate, despite the VLA baseline having been exposed to task-specific action data during FAST-pretraining.

BibTeX

@article{pai2025mimicvideo,
  author    = {Jonas Pai and Liam Achenbach and Victoriano Montesinos and Benedek Forrai and Oier Mees and Elvis Nava},
  title     = {mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs},
  journal   = {arXiv preprint 2512.15692},
  year      = {2025},
}