NVIDIA has introduced Nemotron Speech ASR, an open-source streaming transcription model designed for low-latency applications like voice agents and live captioning. Utilizing a cache-aware FastConformer encoder and RNNT decoder, the model processes 16 kHz mono audio with configurable chunk sizes ranging from 80 ms to 1.12 s, allowing developers to balance latency and accuracy without retraining. This innovative approach avoids overlapping window recomputation, enhancing concurrency and efficiency on modern NVIDIA GPUs. With a word error rate (WER) between 7.16% and 7.84% across various benchmarks, Nemotron Speech ASR offers a scalable solution for real-time speech applications. This matters because it enables more efficient and accurate real-time speech processing, crucial for applications like voice assistants and live transcription services.
Read Full Article: NVIDIA’s Nemotron Speech ASR: Low-Latency Transcription