Automotive SoftwareDigital CockpitVehicle Telematics

Xiaomi releases open‑source voice AI, MiDashengLM‑7B

News, 5 August 2025

Xiaomi has just taken a bold leap into the voice AI arena, releasing MiDashengLM‑7B a powerful new 7-billion‑parameter audio model that is fully open‑sourced under an Apache 2.0 license. Developed to rival proprietary giants like Google and OpenAI, it promises industry‑leading speed, accuracy, and commercial flexibility.

Why This Is a Game‑Changer

Imagine voice AI that understands not just speech, but music, ambient noise, and emotional tone without breaking the bank for developers or companies. That’s exactly what MiDashengLM offers:

  • Lightning‑fast response: First-token delay is just 25% of leading models.
  • Massive throughput: Delivers up to 20× more concurrent processing and supports batch sizes up to 512 on an 80 GB GPU.
  • End-to-end audio intelligence: Trained via caption-based learning, it captures speech, sound, music, and environment cues—not just text.

The Human and Industry Impact

Developers often face trade-offs between performance, flexibility, and licensing. Xiaomi flips the script:

  • Commercial freedom: Apache 2.0 licensing enables usage across products—from smart home devices to in-car systems—without licensing fees or restrictions.
  • Multi-modal understand­ing: The model thrives in real-world audio, recognizing contexts like a child laughing in a song or distant construction noise—improving automation, accessibility, and emotion-aware AI.
  • Edge-ready efficiency: Optimized for both speed and reduced compute cost, making it ideal for voice-first applications in smart devices and vehicles.

How It Stacks Up Against Rivals

FeatureMiDashengLM‑7BQwen2.5‑Omni‑7B & Kimi‑Audio‑Instruct
Audio CaptioningTop scores on MusicCaps, AutoACD, ClothoV2Significantly lower FENSE scores
VGGSound Accuracy~52.11%<1%
Speaker & Language ID~92–96% accuracy on VoxCeleb1 & VoxLingua107~50–80% accuracy
Speed Efficiency3.2× throughput speedup; 4× faster first-tokenBaseline performance

While ASR results in some English benchmarks like LibriSpeech lag slightly behind specialized models, Xiaomi’s broader caption- and context-centric training intentionally sacrifices pure transcription accuracy for holistic audio intelligence.

Behind the Scenes: Novel Architecture & Massive Dataset

  • Why captions over ASR? Traditional ASR discards non-speech sounds and emotional cues. Xiaomi’s caption-driven training ensures context-rich comprehension of entire audio scenes.
  • ACAVCaps Dataset: Over 38,000 hours of curated captions derived from the open ACAV100M repository—filtered through expert models and LLM reasoning to ensure audio-text consistency.
  • Architecture: Combines Xiaomi’s open-source Dasheng audio encoder with Qwen2.5‑Omni‑7B Thinker decoder for unified, multimodal audio understanding.
Back to top button