Xiaomi, known primarily for its smartphones, smart home devices, and electric vehicle development, has announced a new step into robotics with the introduction of Xiaomi Robotics 0, its first large-scale robot model.

It is an open vision language action (VLA) algorithm with 4.7 billion parameters that combines computer vision, natural language understanding, and real-time physical action. The company claims that this combination is the basis of the concept of “physical intelligence”, and the model itself already demonstrates state-of-the-art results in both simulations and real-world tests.

The general logic of such robotic systems is based on a closed cycle: perception, decision-making, and execution. The robot must first see the environment, understand the task at hand, formulate an action plan, and implement it correctly. According to Xiaomi, Robotics 0 was developed with a focus on the balance between broad contextual understanding and precise motor control.

The architecture of the model is based on the Mixture of Transformers (MoT) approach, in which functions are distributed between two main modules. The first component is the Visual Language Model (VLM), which acts as the central “brain” of the system. It learns to interpret human instructions, including fuzzy queries such as “fold a towel,” and to analyze spatial relationships based on high-resolution images. This module is responsible for recognizing objects, answering visual questions, and drawing logical conclusions.

The second element is the so-called Action Expert. It is built on a multi-level Diffusion Transformer (DiT) and is responsible for the physical execution of movements. Instead of generating individual commands, the system creates an “Action Chunk”, i.e. a sequence of movements that is formed using flow matching methods to ensure smooth and accurate execution.

One of the common problems with VLA models is the loss of some cognitive capabilities after learning physical actions. Xiaomi claims to have managed to avoid this by jointly training on multimodal and motion data. As a result, the system, at least according to the company, is able to retain analytical abilities and at the same time effectively interact with the physical environment.

Read also:

Sourcegizmochina

Subscribe

0 Comments

Newest

OldestMost Voted

Xiaomi is entering the robotics industry: New AI model for robots combines vision, speech and action

New comments