Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem.

Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes.

Qwen-Robot-Suite

Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories.

Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another.

The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks.

Here is the core split between the three releases:

ModelProblemBackboneOutputQwen-RobotManipRobotic manipulationQwen3.5-4B (Qwen-VL)Continuous robot actionsQwen-RobotWorldEmbodied world modelingFrozen Qwen2.5-VLPredicted future videoQwen-RobotNavMobile navigationQwen3-VL (2B/4B/8B)Waypoint trajectories

Qwen-RobotManip: Alignment Unlocks Scale for Manipulation

Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions.

A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature.

Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework.

The Unified Alignment Framework

The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking.

This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have.

Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments.

Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates.

A dual-stream co-training strategy runs alongside this. It jointly optimizes manipulation data and a vision-language stream. This prevents the backbone’s perception and reasoning from eroding.

The Data Engine

RobotManip assembles roughly 38,100 hours of manipulation data. It uses only open-source datasets and human videos. No proprietary data collection was used.

A human-to-robot synthesis pipeline produces most of this scale. It converts egocentric hand demonstrations into robot trajectories. The pipeline renders across 15 robot platforms.

This synthesis alone yields about 24,808 hours of demonstrations. The egocentric source data is about 1,933 hours. Open-source robot datasets contribute over 11,000 hours.

The pipeline separates action alignment from visual alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment uses SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics.

A five-stage curation pipeline then filters the combined corpus. It catches sudden changes, temporal misalignment, and extreme values. One check found 81% of episodes in a subset failed state-action alignment.

Benchmark Results

The research report argues standard benchmarks fail to measure generalization. Models without robot pretraining match pretrained ones on in-distribution tests. RobotManip therefore focuses on out-of-distribution (OOD) settings.

Benchmark (OOD)Prev. SOTA (π0.5)Qwen-RobotManipLIBERO-Plus84.491.4RoboTwin-C2R Hard47.969.4EBench27.145.6RoboCasa36516.935.9RoboTwin-IF49.672.2

The largest reported gap is on cross-embodiment transfer. RobotManip reaches 23.9% using camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5.

The model also ranks 1st on the RoboChallenge Table30-v1 generalist track. It scores a 20% relative improvement over the prior best. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms.

RobotManip Canonical Action Vector Explorer

:root{
–bg:#111;–panel:#181818;–panel2:#1f1f1f;–green:#76B900;–green-dim:#4d7a00;
–txt:#e8e8e8;–muted:#9a9a9a;–line:#2a2a2a;–zero:#262626;
}
*{box-sizing:border-box}
#rmcv-root{
background:var(–bg);color:var(–txt);font-family:-apple-system,BlinkMacSystemFont,”Segoe UI”,Roboto,Helvetica,Arial,sans-serif;
padding:22px;border:1px solid var(–line);border-radius:14px;max-width:920px;margin:0 auto;
}
#rmcv-root h2{margin:0 0 4px;font-size:20px;font-weight:700}
#rmcv-root .sub{color:var(–muted);font-size:13px;margin:0 0 18px;line-height:1.5}
#rmcv-root .robots{display:flex;flex-wrap:wrap;gap:8px;margin-bottom:18px}
#rmcv-root .rbtn{
background:var(–panel2);color:var(–txt);border:1px solid var(–line);border-radius:9px;
padding:9px 14px;font-size:13px;cursor:pointer;transition:.15s;font-weight:600
}
#rmcv-root .rbtn:hover{border-color:var(–green-dim)}
#rmcv-root .rbtn.active{background:var(–green);color:#0a0a0a;border-color:var(–green)}
#rmcv-root .legend{display:flex;gap:18px;flex-wrap:wrap;margin-bottom:14px;font-size:12px;color:var(–muted)}
#rmcv-root .legend span{display:inline-flex;align-items:center;gap:6px}
#rmcv-root .swatch{width:13px;height:13px;border-radius:3px;display:inline-block}
#rmcv-root .arm-label{font-size:12px;color:var(–green);font-weight:700;margin:14px 0 7px;letter-spacing:.04em;text-transform:uppercase}
#rmcv-root .grid{display:grid;grid-template-columns:repeat(29,1fr);gap:3px}
#rmcv-root .cell{
aspect-ratio:1;border-radius:3px;background:var(–zero);position:relative;
transition:.25s;border:1px solid transparent;min-width:0
}
#rmcv-root .cell.on{background:var(–green);box-shadow:0 0 7px rgba(118,185,0,.45)}
#rmcv-root .cell.reserved{background:var(–panel2);border:1px dashed #333}
#rmcv-root .seggrid{display:grid;grid-template-columns:repeat(22,1fr);gap:3px;margin-top:7px}
#rmcv-root .groups{display:flex;gap:10px;flex-wrap:wrap;margin-top:16px}
#rmcv-root .gchip{
background:var(–panel2);border:1px solid var(–line);border-radius:8px;padding:8px 12px;font-size:12px;flex:1;min-width:130px
}
#rmcv-root .gchip .gname{color:var(–muted);display:block;margin-bottom:3px}
#rmcv-root .gchip .gval{font-weight:700;font-size:15px}
#rmcv-root .gchip.lit .gval{color:var(–green)}
#rmcv-root .gchip.dim .gval{color:#555}
#rmcv-root .stats{display:flex;gap:14px;margin-top:18px;flex-wrap:wrap}
#rmcv-root .stat{background:var(–panel2);border:1px solid var(–line);border-radius:10px;padding:12px 16px;flex:1;min-width:140px}
#rmcv-root .stat .num{font-size:24px;font-weight:800;color:var(–green)}
#rmcv-root .stat .lab{font-size:11px;color:var(–muted);margin-top:2px}
#rmcv-root .note{margin-top:16px;font-size:12px;color:var(–muted);line-height:1.55;border-top:1px solid var(–line);padding-top:12px}
#rmcv-root .foot{margin-top:14px;font-size:11px;color:#666;text-align:right}
#rmcv-root .foot b{color:var(–green)}
@media(max-width:640px){
#rmcv-root{padding:15px}
#rmcv-root .grid{grid-template-columns:repeat(29,1fr);gap:2px}
#rmcv-root .seggrid{gap:2px}
#rmcv-root .stat .num{font-size:20px}
}

RobotManip: Canonical Action Vector Explorer
RobotManip maps every robot into one 80-dimensional vector: two 29-dim per-arm blocks plus 22 reserved dimensions. Pick an embodiment to see which dimensions it populates. Unlit cells are zero-padded and masked from the training loss.

Populated (supervised)
Zero-padded (masked)
Reserved (22 dims)

Left Arm Block — 29 dims

Right Arm Block — 29 dims

Reserved — 22 dims (shared, e.g. mobile-base velocity)

0Populated dimensions
0Masked dimensions
0%Of vector used

Built from the Qwen-RobotManip technical report · 80-dim canonical state-action representation

(function(){
// Per-arm semantic groups: joints(7) eef_pose(9) gripper(1) hand(12) = 29
var GROUPS=[
{name:”Joint positions”,size:7},
{name:”End-effector pose”,size:9},
{name:”Gripper”,size:1},
{name:”Dexterous hand”,size:12}
];
// Embodiments: which groups each arm uses. arms = number of arms used.
var ROBOTS=[
{id:”franka”,label:”Franka Panda (7-DOF single-arm)”,arms:1,groups:[“Joint positions”,”End-effector pose”,”Gripper”],
note:”A 7-DOF single-arm gripper fills the joint, end-effector, and gripper fields of one arm. The opposite arm and all hand dimensions stay zero, so gradients never flow through them.”},
{id:”ur5″,label:”UR5e (single-arm gripper)”,arms:1,groups:[“Joint positions”,”End-effector pose”,”Gripper”],
note:”UR5e uses the same single-arm layout as Franka. The canonical template lets both share one model despite different kinematics.”},
{id:”aloha”,label:”AgileX ALOHA (dual-arm)”,arms:2,groups:[“Joint positions”,”End-effector pose”,”Gripper”],
note:”A dual-arm ALOHA system fills both per-arm blocks. Two grippers, two joint sets, and two end-effector poses are all supervised.”},
{id:”dex”,label:”Dual-arm + dexterous hands”,arms:2,groups:[“Joint positions”,”End-effector pose”,”Gripper”,”Dexterous hand”],
note:”A robot with dexterous hands additionally populates the 12 hand-joint dimensions per arm. This is the densest configuration in the template.”},
{id:”mobile”,label:”Mobile dual-arm (uses reserved)”,arms:2,groups:[“Joint positions”,”End-effector pose”,”Gripper”],reserved:true,
note:”A mobile dual-arm robot fills both blocks and taps the 22 reserved dimensions for extra degrees of freedom such as mobile-base velocity.”}
];

function activeDims(robot){
// returns Set of populated indices in left(0-28) and right(29-57), plus reserved flag
var left=new Set(),right=new Set();
var idx=0;
GROUPS.forEach(function(g){
var lit = robot.groups.indexOf(g.name)>-1;
for(var i=0;i