Xiaomi releases open‑source voice AI, MiDashengLM‑7B
News, 5 August 2025
Xiaomi has just taken a bold leap into the voice AI arena, releasing MiDashengLM‑7B a powerful new 7-billion‑parameter audio model that is fully open‑sourced under an Apache 2.0 license. Developed to rival proprietary giants like Google and OpenAI, it promises industry‑leading speed, accuracy, and commercial flexibility.
Why This Is a Game‑Changer
Imagine voice AI that understands not just speech, but music, ambient noise, and emotional tone without breaking the bank for developers or companies. That’s exactly what MiDashengLM offers:
- Lightning‑fast response: First-token delay is just 25% of leading models.
- Massive throughput: Delivers up to 20× more concurrent processing and supports batch sizes up to 512 on an 80 GB GPU.
- End-to-end audio intelligence: Trained via caption-based learning, it captures speech, sound, music, and environment cues—not just text.
The Human and Industry Impact
Developers often face trade-offs between performance, flexibility, and licensing. Xiaomi flips the script:
- Commercial freedom: Apache 2.0 licensing enables usage across products—from smart home devices to in-car systems—without licensing fees or restrictions.
- Multi-modal understanding: The model thrives in real-world audio, recognizing contexts like a child laughing in a song or distant construction noise—improving automation, accessibility, and emotion-aware AI.
- Edge-ready efficiency: Optimized for both speed and reduced compute cost, making it ideal for voice-first applications in smart devices and vehicles.
How It Stacks Up Against Rivals
| Feature | MiDashengLM‑7B | Qwen2.5‑Omni‑7B & Kimi‑Audio‑Instruct |
|---|---|---|
| Audio Captioning | Top scores on MusicCaps, AutoACD, ClothoV2 | Significantly lower FENSE scores |
| VGGSound Accuracy | ~52.11% | <1% |
| Speaker & Language ID | ~92–96% accuracy on VoxCeleb1 & VoxLingua107 | ~50–80% accuracy |
| Speed Efficiency | 3.2× throughput speedup; 4× faster first-token | Baseline performance |
While ASR results in some English benchmarks like LibriSpeech lag slightly behind specialized models, Xiaomi’s broader caption- and context-centric training intentionally sacrifices pure transcription accuracy for holistic audio intelligence.
Behind the Scenes: Novel Architecture & Massive Dataset
- Why captions over ASR? Traditional ASR discards non-speech sounds and emotional cues. Xiaomi’s caption-driven training ensures context-rich comprehension of entire audio scenes.
- ACAVCaps Dataset: Over 38,000 hours of curated captions derived from the open ACAV100M repository—filtered through expert models and LLM reasoning to ensure audio-text consistency.
- Architecture: Combines Xiaomi’s open-source Dasheng audio encoder with Qwen2.5‑Omni‑7B Thinker decoder for unified, multimodal audio understanding.

