PUBLISHER: ResearchInChina | PRODUCT CODE: 2007699
PUBLISHER: ResearchInChina | PRODUCT CODE: 2007699
Research on Robot Large Models: World Models Are About to Become Standard, and OEMs Enter and Accelerate Mass Production and Application
ResearchInChina has released the Embodied AI Robot Large Model (Including VLA) Research Report, 2026, which focuses on the research, analysis, and summary of the following content:
The basic concepts, industrial ecosystem map, multi-dimensional classification (application scope, capability modality, architecture), industry development drivers, key technology development directions, and commercialization modes of Embodied AI robot large models;
The layout planning, team building, core talents, large model products and their applications, detailed introduction and implementation status of Embodied AI robot large model products, Embodied AI ecosystem partners, and recent key dynamics of 11 tech giants in the Embodied AI robot field, including Alibaba Group, NVIDIA, Google DeepMind, OpenAI, Microsoft, Huawei, Tencent RoboticsX, Baidu, ByteDance, iFlytek, and SenseTime;
The profile, development history and planning, robot products and large model installation, detailed introduction of self-developed large models, large model ecosystem cooperation, and recent key dynamics of 10 well-known robot enterprises, including UBTECH Robotics, Unitree Robotics, AgiBot, Leju Robotics, Galbot, RobotEra, FigureAI, Sanctuary AI, 1X Technologies, and Neura Robotics;
The layout planning, team building, core talents, robot products and large model installation, summary of large model products, detailed introduction of Embodied AI robot large model products, Embodied AI ecosystem partners, and recent key dynamics of 11 OEMs in the Embodied AI robot field, including Tesla, Toyota, Honda, Hyundai, Xiaomi, XPeng, GAC Group, Chery, Leapmotor, BYD, and Dongfeng Motor. In addition, this report summarizes the layout of 13 other global OEMs in Embodied AI robot field.
Embodied AI robot large models ("robot large models" for short) can make end-to-end or hierarchical decisions compared with traditional robot control algorithms, without the need for precise modeling, and can operate in unstructured and open environments (families, outdoors, cluttered desktops). Compared with general large models, Embodied AI robot large models pay more attention to the fusion and understanding of multi-modal information (vision + lidar + touch + text, etc.), aiming to complete closed-loop actions in the physical world and output motion commands such as joint angles, speeds, and grasping forces.
In recent years, Embodied AI robot large model field has shown the following development trends:
Currently, robot large models represented by Vision-Language-Action (VLA) models have made significant progress in the "perception-decision-execution" closed loop, enabling robots to understand instructions and generate actions. However, such models still face bottlenecks in coping with the high diversity and uncertainty of physical world. In essence, they are more like "imitating" patterns in training data, lacking the foresight of action consequences and the understanding of physical logic.
The introduction of world models is precisely to break this limitation. The core of a world model is to enable robots to acquire the ability to "imagine the future". Through training with multi-modal data, it constructs an internal dynamic representation of physical environment, and can predict state changes of multiple future steps based on current state and planned actions. This means that robots can transform from passive instruction followers to active decision-makers capable of "brain deduction". For example, when performing the "pouring water" task, a robot equipped with a world model can not only identify cups and kettles but also predict the water flow trajectory, cup tilt angle, and possible spills before action, thereby planning a safer and more accurate action sequence.
Driving forces for the application of world models mainly come from three aspects:
Solving the data bottleneck: The collection of high-quality real robot data is extremely costly and limited in scale, having become a core constraint on capability upgrading. World models can serve as powerful "data generators" and "simulation engines", generating massive, controllable, and high-fidelity synthetic training scenarios, and greatly reducing the reliance on expensive real robot data.
Improving decision and generalization capabilities: Through prediction and deduction, world models enable robots to have a certain degree of causal reasoning and physical intuition, capable of handling new scenarios and new objects not seen in training, and achieving "learning by analogy".
Realizing the collaborative evolution of "cerebrum" and "cerebellum": The industry consensus is that future robots' intelligence will be the result of collaborative evolution of the "cerebrum" (high-level cognition and planning) and the "cerebellum" (low-level motion control). As a key component of the high-level "cerebrum", the world model forms a complementary relationship with execution-oriented models such as VLA, jointly constituting a complete intelligent system.
Many enterprises have developed their own world models, such as Alibaba's WorldVLA, NVIDIA's WAM, Tencent Hunyuan 3D World Model, Unitree Robotics' UnifoLM-WMA-0, and AgiBot's GE-1. Among them, Unitree Robotics' UnifoLM-WMA-0 was released and open-sourced around September 2025. It is an open-source model specially designed for general robot learning. It has been adapted to the company's humanoid and quadruped robots, with two modes: decision and simulation. The decision mode can predict future physical interactions (such as stacking stability and collision risks), correct actions, and improve the robustness of complex tasks. The simulation mode can generate high-fidelity synthetic data to solve the problem of scarcity of real robot training data.
AgiBot's world model GE-1 was released in August 2025, which is a video-generative world model for robot control. With a closed-loop architecture of "video generation + strategy learning + simulation evaluation", it realizes end-to-end reasoning from "seeing" to "thinking" and then to "acting". GE-1 collaborates with AgiBot's GO-1 series base models: GO-1 focuses on general task planning and common-sense knowledge support, while GE-1 specializes in spatiotemporal prediction and action rehearsal, improving the task success rate and stability of G2 in complex scenarios.
GE-1 was officially deployed with the industrial-grade interactive embodied operation robot G2 in October 2025, and AgiBot announced that it had won an order worth hundreds of millions of yuan from Longcheer Technology. The robot has performed tasks such as "making sandwiches", "pouring tea", and "wiping the desktop".
In traditional robot development mode, the software and algorithms of each robot need to be specially developed and optimized for its unique hardware configuration (sensors, actuators, form), leading to high R&D costs, long cycles, and non-reusable capabilities. The cross-platform application of robot large models can break this drawback. By building a powerful end-to-end multi-modal foundation model, it implants transferable general intelligence into robots, enabling them to cross the limitations of different ontologies (such as humanoid, quadruped, robotic arm), different tasks and different environments, and realize rapid generalization and deployment of capabilities.
Starting from 2025, robot large models such as NVIDIA's GR00T series, Google DeepMind's Gemini Robotics, Microsoft's Rho-alpha, Huawei's CloudRobo, and RobotEra's ERA-42 all support cross-robot platform development and cross-scenario applications.
In Q3 2025, NVIDIA released the GR00T N1.6 large model, positioned as a general humanoid robot VLA large model. Through a unified multi-modal interface + modular adaptation layer + simulation-reality collaborative pipeline + hierarchical deployment architecture, it realizes the cross-platform application of "one training, multi-machine adaptation". It supports humanoid dual-arm/mobile robotic arms, warehouse AGVs, medical assistive robots, scientific research robots, etc. It can execute tasks for new objects/new scenarios without a large amount of data, and can be flexibly adapted to various application scenarios such as industrial manufacturing, logistics and warehousing, household and commercial services, medical care and health, and scientific research and development.
RobotEra's end-to-end VLA embodied large model ERA-42 was released in December 2024 and initially adapted to its dexterous hand XHAND1. In mid-2025, the model was successively applied across platforms to the wheeled service robot Q5 and the bipedal humanoid robot L7, enabling rapid adaptation to new tasks without pre-programming.
The open-sourcing of large models is not a simple technical sharing. Open-source models gather the wisdom of global developers and can quickly overcome complex "long-tail problems" in the physical world. At the same time, open-sourcing breaks traditional closed-source business mode, allowing small and medium-sized enterprises to quickly develop based on open-source models, focus resources on hardware innovation and implementation in scenarios, and form an industrial pattern of "giants build the platform, and hundreds of enterprises perform on it".
The core of open-sourcing is to lower the R&D threshold, accelerate technological iteration, build ecosystem barriers, promote large-scale implementation, and form a positive flywheel of "open-source - ecosystem- data - more powerful models".
Xiaomi's VLA large model for Embodied AI robots, Xiaomi-Robotics-0, was officially open-sourced on February 12, 2026, adopting the Apache License 2.0 (allowing commercial use, modification, and distribution without "contagion") open-source protocol, with full-stack and unreserved open-sourcing of complete code, pre-trained weights, technical documents, papers, deployment solutions, etc. Xiaomi-Robotics-0 reuses Xiaomi's autonomous driving perception/decision technology to realize technology interoperability between robots and automobiles. It adopts a Mixture of Experts (MoE) architecture, separating the "cerebrum" (vision-language understanding) and the "cerebellum" (action execution). This design ameliorates the reasoning delay problem that may exist in traditional VLA models, making it more suitable for consumer robot products that require real-time response.
The entry of multiple OEMs into the Embodied AI and humanoid robot track brings massive industrial scenario data, automotive-grade sensor data and a mature autonomous driving technology stack to Embodied AI large models (VLA, world models, etc.). Algorithms such as BEV perception, multi-modal fusion, and end-to-end decision can be directly migrated to robots to train and improve environmental understanding, task planning and motion control capabilities of models. The production line scenarios of OEMs can verify the reliability and success rate of robot large models, expose model defects at the same time, provide high-reliable real robot interaction data for future model correction, and effectively narrow the large gap between simulation and reality.
In addition, OEMs introduce automotive-grade safety standards and hardware collaborative design into robots, greatly optimizing the reasoning delay, reliability and implementation efficiency of large models; the core supply chains of automobiles and robots (batteries, motors, sensors, domain controllers, etc.) have a high degree of overlap. Some institutions estimate that the overlap rate exceeds 50%. The scale effect greatly reduces the cost of core hardware, and the model deployment cost also decreases synchronously.
For example, to solve the data problem, GAC Group borrows the experience of autonomous driving data collection. It sends robots to real scenarios to collect real data, and at the same time carries out in-depth adaptation and field verification of core functions, forming a closed-loop data growth model of "learning by using, using by learning". In terms of cost reduction strategies, its robots first multiplexes vehicle components (such as chips, lidar, etc.) and realize 100% localization of key components. GAC has clearly planned to mass-produce its fourth-generation product GoMate Mini in 2027, taking the security scenario as the first commercial application field for its robots.