low-latency

  • Accelerating LLM and VLM Inference with TensorRT Edge-LLM


    Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLMNVIDIA TensorRT Edge-LLM is a new open-source C++ framework designed to accelerate large language model (LLM) and vision language model (VLM) inference for real-time applications in automotive and robotics. It addresses the need for low-latency, reliable, and offline operations directly on embedded platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. The framework is optimized for minimal resource use and includes advanced features such as EAGLE-3 speculative decoding and NVFP4 quantization support, making it suitable for demanding edge use cases. Companies like Bosch, ThunderSoft, and MediaTek are already integrating TensorRT Edge-LLM into their AI solutions, showcasing its potential in enhancing on-device AI capabilities. This matters because it enables more efficient and capable AI systems in vehicles and robots, paving the way for smarter, real-time interactions without relying on cloud-based processing.

    Read Full Article: Accelerating LLM and VLM Inference with TensorRT Edge-LLM

  • Sonya TTS: Fast, Expressive Neural Voice Anywhere


    Sonya TTS — A Small Expressive Neural Voice That Runs Anywhere!Sonya TTS is a newly released, small, and fast text-to-speech model that offers an expressive single speaker English voice, built on the VITS framework and trained with an expressive voice dataset. It is designed to run efficiently on various devices, including GPUs, CPUs, laptops, and edge devices, delivering natural-sounding speech with emotion, rhythm, and prosody. The model provides instant generation with low latency, suitable for real-time applications, and includes an audiobook mode for handling long-form text with natural pauses. Users can adjust emotion, rhythm, and speed during inference, making it versatile and adaptable for different use cases. This matters because it democratizes access to high-quality, expressive TTS technology across a wide range of devices without requiring specialized hardware.

    Read Full Article: Sonya TTS: Fast, Expressive Neural Voice Anywhere

  • Liquid AI’s LFM2-2.6B-Transcript: Fast On-Device AI Model


    Liquid AI releases LFM2-2.6B-Transcript, an incredibly fast open-weight meeting transcribing AI model on-par with closed-source giants.Liquid AI has introduced the LFM2-2.6B-Transcript, a highly efficient AI model for transcribing meetings, which operates entirely on-device using the AMD Ryzen™ AI platform. This model provides cloud-level summarization quality while significantly reducing latency, energy consumption, and memory usage, making it practical for use on devices with as little as 3 GB of RAM. It can summarize a 60-minute meeting in just 16 seconds, offering enterprise-grade accuracy without the security and compliance risks associated with cloud processing. This advancement is crucial for businesses seeking secure, fast, and cost-effective solutions for handling sensitive meeting data.

    Read Full Article: Liquid AI’s LFM2-2.6B-Transcript: Fast On-Device AI Model

  • NVIDIA’s Nemotron Speech ASR: Low-Latency Transcription


    NVIDIA AI Released Nemotron Speech ASR: A New Open Source Transcription Model Designed from the Ground Up for Low-Latency Use Cases like Voice AgentsNVIDIA has introduced Nemotron Speech ASR, an open-source streaming transcription model designed for low-latency applications like voice agents and live captioning. Utilizing a cache-aware FastConformer encoder and RNNT decoder, the model processes 16 kHz mono audio with configurable chunk sizes ranging from 80 ms to 1.12 s, allowing developers to balance latency and accuracy without retraining. This innovative approach avoids overlapping window recomputation, enhancing concurrency and efficiency on modern NVIDIA GPUs. With a word error rate (WER) between 7.16% and 7.84% across various benchmarks, Nemotron Speech ASR offers a scalable solution for real-time speech applications. This matters because it enables more efficient and accurate real-time speech processing, crucial for applications like voice assistants and live transcription services.

    Read Full Article: NVIDIA’s Nemotron Speech ASR: Low-Latency Transcription

  • Plano-Orchestrator: Fast Open Source LLMs for Multi-Agent Systems


    I built Plano(A3B)- fastest open source LLMs for agent orchestration that beat GPT-5.1Plano-Orchestrator is a new family of open-source large language models (LLMs) designed for rapid multi-agent orchestration, developed by the Katanemo research team. These models prioritize privacy, speed, and performance, enabling them to efficiently determine which agents should handle user requests and in what order, acting as a supervisory agent in complex multi-agent systems. Suitable for various domains, including general chat, coding tasks, and extensive multi-turn conversations, Plano-Orchestrator is optimized for low-latency production environments. This innovation aims to enhance the real-world performance and efficiency of multi-agent systems, offering a valuable tool for developers focused on integrating diverse agent functionalities.

    Read Full Article: Plano-Orchestrator: Fast Open Source LLMs for Multi-Agent Systems

  • Plano-Orchestrator: Fast Multi-Agent Orchestration


    I built Plano(A3B) - 200 ms latency for multi-agent systems with frontier performancePlano-Orchestrator is a newly launched family of large language models (LLMs) designed for fast and efficient multi-agent orchestration, developed by the Katanemo research team. It acts as a supervisory agent, determining which agents should handle a user request and in what order, making it ideal for multi-domain scenarios such as general chat, coding tasks, and extended conversations. This system is optimized for low-latency production deployments, ensuring safe and efficient delivery of agent tasks while enhancing real-world performance. Integrated into Plano, a models-native proxy and dataplane for agents, it aims to improve the "glue work" often needed in multi-agent systems.

    Read Full Article: Plano-Orchestrator: Fast Multi-Agent Orchestration

  • Google’s FunctionGemma: AI for Edge Function Calling


    From Gemma 3 270M to FunctionGemma, How Google AI Built a Compact Function Calling Specialist for Edge WorkloadsGoogle has introduced FunctionGemma, a specialized version of the Gemma 3 270M model, designed specifically for function calling and optimized for edge workloads. FunctionGemma retains the Gemma 3 architecture but focuses on translating natural language into executable API actions rather than general chat. It uses a structured conversation format with control tokens to manage tool definitions and function calls, ensuring reliable tool use in production. The model, trained on 6 trillion tokens, supports a 256K vocabulary optimized for JSON and multilingual text, enhancing token efficiency. FunctionGemma's primary deployment target is edge devices like phones and laptops, benefiting from its compact size and quantization support for low-latency, low-memory inference. Demonstrations such as Mobile Actions and Tiny Garden showcase its ability to perform complex tasks on-device without server calls, achieving up to 85% accuracy after fine-tuning. This development signifies a step forward in creating efficient, localized AI solutions that can operate independently of cloud infrastructure, crucial for privacy and real-time applications.

    Read Full Article: Google’s FunctionGemma: AI for Edge Function Calling