Marlin

Benchmarking 4-bit Quantization in vLLM

A comprehensive analysis of vLLM quantization methods reveals varied performance across different techniques. Marlin achieved the highest token processing speed at 712 tokens per second, significantly outperforming the baseline FP16's 461 tok/s, while GPTQ without Marlin's kernel lagged behind at 276 tok/s. BitsandBytes maintained the smallest quality drop and required no pre-quantized weights, whereas GGUF had the worst perplexity but excelled in HumanEval scores. AWQ showed unexpectedly slow performance in vLLM, processing only 67 tok/s. Understanding these differences is crucial for optimizing model efficiency and performance in machine learning applications.
Read Full Article
Read Full Article: Benchmarking 4-bit Quantization in vLLM

Posted on

Jan 8, 2026

by

AIGeekery

in

Benchmarking, Deep Dives

Topics: machine learning, model efficiency, quantization