CPU inference

  • Benchmarking SLMs on Modest Hardware


    I have been doing some benchmarking of SLM'sBenchmarking of SLMs (Statistical Language Models) was conducted using a modest hardware setup, featuring an Intel N97 CPU, 32GB of DDR4 RAM, and a 512GB NVMe drive, running on Debian with llama.cpp for CPU inference. A test suite of five questions was used, with ChatGPT providing results and comments. The usability score was calculated by raising the test score to the fifth power, multiplying by the average tokens per second, and applying a 10% penalty if the model used reasoning. This penalty is based on the premise that a non-reasoning model performing equally well as a reasoning one is considered more efficient. This matters because it highlights the efficiency and performance considerations in evaluating language models on limited hardware.

    Read Full Article: Benchmarking SLMs on Modest Hardware

  • Boosting Inference with XNNPack’s Dynamic Quantization


    Faster Dynamically Quantized Inference with XNNPackXNNPack, TensorFlow Lite's CPU backend, now supports dynamic range quantization for Fully Connected and Convolution 2D operators, significantly enhancing inference performance on CPUs. This advancement quadruples performance compared to single precision baselines, making AI features more accessible on older and lower-tier devices. Dynamic range quantization involves converting floating-point layer activations to 8-bit integers during inference, dynamically calculating quantization parameters to maximize accuracy. Unlike full quantization, it retains 32-bit floating-point outputs, combining performance gains with higher accuracy. This method is more accessible, requiring no representative dataset, and is optimized for various architectures, including ARM and x86. Dynamic range quantization can be combined with half-precision inference for further performance improvements on devices with hardware fp16 support. Benchmarks reveal that dynamic range quantization can match or exceed the performance of full integer quantization, offering substantial speed-ups for models like Stable Diffusion. This approach is now integrated into products like Google Meet and Chrome OS audio denoising, and available for open source use, providing a practical solution for efficient on-device inference. This matters because it democratizes AI deployment, enabling advanced features on a wider range of devices without sacrificing performance or accuracy.

    Read Full Article: Boosting Inference with XNNPack’s Dynamic Quantization