PUBLISHER: ResearchInChina | PRODUCT CODE: 1892141
PUBLISHER: ResearchInChina | PRODUCT CODE: 1892141
Research on Automotive Multimodal Interaction: The Interaction Evolution of L1~L4 Cockpits
ResearchInChina has released the "China Automotive Multimodal Interaction Development Research Report, 2025". This report comprehensively sorts out the installation of Interaction Modalities in automotive cockpits, multimodal interaction patents, mainstream cockpit interaction modes, application of interaction modes in key vehicle models launched in 2025, cockpit interaction solutions of automakers/suppliers, and integration trends of multimodal interaction.
According to the "White Paper on Automotive Intelligent Cockpit Levels and Comprehensive Evaluation" jointly released by the China Society of Automotive Engineers (China-SAE), five levels of intelligent cockpits are defined: L0-L4.
As a key driver for cockpit intelligence, multimodal interaction capability relies on the collaboration of AI large models and multiple hardware to achieve the fusion processing of multi-source interaction data. On this basis, it accurately understands the intentions of drivers and passengers and provides scenario-based feedback, ultimately achieving natural, safe, and personalized human-machine interaction. Currently, the automotive intelligent cockpit industry is generally in the L2 stage, with some leading manufacturers exploring and moving towards the L3.
The core feature of L2 intelligent cockpits is "strong perception, weak cognition". In the L2 stage, the multimodal interaction function of cockpits achieves signal-level fusion. Based on multimodal large model technology, it can "understand users' ambiguous intentions" and "simultaneously process multiple commands" to execute users' immediate and explicit commands. At present, most mass-produced intelligent cockpits can enable this.
In the case of Li i6, it is equipped with MindGPT-4o, the latest multimodal model which boasts understanding and response capabilities with ultra-long memory and ultra-low latency, and features more natural language generation. It supports multimodal "see and speak" (voice + vision fusion search: allowing illiterate children to select the cartoons they want to watch by describing the content on the video cover); multimodal referential interaction (voice + gesture: 1. Voice reference to objects: while issuing commands, extend the index finger: pointing left can control the window and complete vehicle control. 2. Voice reference to personnel: passengers in the same row can achieve voice control over designated personnel through gesture and voice coordination, e.g., pointing right and saying "Turn on the seat heating for him").
The core feature of L3 intelligent cockpits is "strong perception, strong cognition". In the L3 stage, the multimodal interaction function of cockpits achieves cognitive-level fusion. Relying on large model capabilities, the cockpit system can comprehensively understand the complete current scenario and actively initiate reasonable services or suggestions without the user issuing explicit commands.
The core feature of L4 intelligent cockpits is "full-domain cognition and autonomous evolution", creating a "full-domain intelligent manager" for users. In the L4 stage, the application of intelligent cockpits will go far beyond the tool attribute and become a "digital twin partner" that can predict users' unspoken needs, have shared memories, and dispatch all resources for users. Its core experience is: before the user clearly perceives or expresses the need, the system has completed prediction and planning and entered the execution state.
AI Agent can be regarded as the core execution unit and key technical architecture for the specific implementation of functions in the evolution of intelligent cockpits from L2 to L4. By integrating voice, vision, touch and situational information, AI Agent can not only "understand" commands, but also "see" the environment and "perceive" the state, thereby integrating the original discrete cockpit functions into a coherent, active and personalized service process.
Agent applications under L2 can be regarded as "enhanced command execution", which is the ultimate extension of L2 cockpit interaction capabilities. Based on large model technology, the cockpit system decomposes a user's complex command into multiple steps and then calls different Agent tools to execute them. For example, a passenger says: "I'm tired, help me buy a cup of coffee." The large model of the L2 cockpit system will understand this complex command and then call in sequence:
1.Voice Agent: Parse user needs in real time;
2.Food Ordering Agent: Recommend the best options according to user preferences, real-time location, and restaurant business status;
3.Payment Agent: Automatically complete unconscious payment;
4.Delivery Agent: Dynamically plan the food delivery time combined with vehicle navigation data (e.g., "food arrives when the car arrives", ensuring that the food is delivered synchronously when the user reaches the destination).
Currently, Agent applications are essentially responses and executions to a user's explicit and complex commands. The cockpit system does not do anything "actively", and it just "completes the tasks assigned by the user" more intelligently.
Case (1): IM Motors released the "IM AIOS Ecological Cockpit" jointly developed with Banma Zhixing. This cockpit is the first to implement Alibaba's ecosystem services in the form of AI Agent, creating a "No Touch & No App" human-vehicle interaction mode. The "AI Food Ordering Agent" and "AI Ticketing Agent" functions launched by the IM AIOS Ecological Cockpit allow users to complete food selection/ticketing and payment only through voice interaction without needing manual operation.
Case (2): On August 4, 2025, Denza officially launched the "Car Life Agent" intelligent service system at its brand press conference, which is first equipped on two flagship models, Denza Z9 and Z9GT. The "Car Life Agent" supports voice food ordering and enables payment by face with face recognition technology. After completing the order, the system will automatically plan the navigation route, forming a seamless experience of "demand-service-closed loop".
In the next level of intelligent cockpits, Agent applications will change from "you say, I do" to "I watch, I guess, I suggest, let's do it together". Users do not need to issue any explicit commands. They just sigh and rub their temples, and the system can comprehensively judge data from "camera" (tired micro-expressions), "biological sensors" (heart rate changes), "navigation data" (continuous driving for 2 hours), and "time" (3 pm (afternoon sleepiness period)) via the large model to know that "the user is in the tired period of long-distance driving and has the need to rest and refresh". Based on this, the system will take the initiative to initiate interaction: "You seem to need a rest. There is a service zone* kilometers ahead with your favorite ** coffee. Do you need me to turn on the navigation? At the same time, I can play refreshing music for you." After the user agrees, the system then calls navigation, entertainment and other Agent tools.
Foreword
Related Definitions