NUMA node distribution

  • Optimizing GLM-4.7 on 2015 CPU-Only Hardware


    Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization GuideRunning the massive 355B parameter GLM-4.7 Mixture of Experts model on a 2015 Lenovo System x3950 X6 with eight Xeon E7-8880 v3 CPUs showcases the potential of older hardware for local large language models. By using Q8_0 quantization, the model maintains high-quality outputs with minimal degradation, achieving around 5-6 tokens per second without a GPU. Key optimizations include BIOS tweaks, NUMA node distribution, llama.cpp forks for MoE architecture, and Linux kernel adjustments, although the setup is power-intensive, drawing about 1300W AC. This approach is ideal for homelab enthusiasts or those lacking modern GPUs, offering a viable solution for running large models locally. This matters because it demonstrates how older hardware can still be leveraged effectively for advanced AI tasks, expanding access to powerful models without the need for cutting-edge technology.

    Read Full Article: Optimizing GLM-4.7 on 2015 CPU-Only Hardware