performance boost

Multi-GPU Breakthrough with ik_llama.cpp

The ik_llama.cpp project has made a significant advancement in local LLM inference for multi-GPU setups, achieving a 3x to 4x performance improvement. This breakthrough comes from a new execution mode called split mode graph, which allows for the simultaneous and maximum utilization of multiple GPUs. Previously, using multiple GPUs either pooled VRAM or offered limited performance scaling, but this new method enables more efficient use of resources. This development is particularly important as it allows for leveraging multiple low-cost GPUs instead of relying on expensive high-end enterprise cards, making it more accessible for homelabs, server rooms, or cloud environments.
Read Full Article
Read Full Article: Multi-GPU Breakthrough with ik_llama.cpp

Posted on

Jan 5, 2026

by

AIGeekery

in

Deep Dives, Tools

Topics: AI innovation, cost-effective AI, GPU utilization
Software FP8 for GPUs: 3x Speedup on Memory Operations

A workaround has been developed to enable FP8 support on GPUs that lack native hardware support, such as the RTX 3050. This method involves packing lower-precision values into FP32 using bitwise operations and Triton kernels, resulting in a threefold speed increase on memory-bound operations like GEMV and FlashAttention. The solution is compatible with a wide range of GPUs, including the RTX 30/20 series and older models. Although still in the early stages, it is functional and open for feedback from the community. This matters because it offers a significant performance boost for users with older or less advanced GPUs, expanding their capabilities without requiring hardware upgrades.
Read Full Article
Read Full Article: Software FP8 for GPUs: 3x Speedup on Memory Operations

Posted on

Jan 1, 2026

by

TweakedGeekTech

in

Deep Dives, Tools

Topics: performance boost, Triton kernels

performance boost

Software FP8 for GPUs: 3x Speedup on Memory Operations

Popular AI Topics

More AI Articles

performance boost

Multi-GPU Breakthrough with ik_llama.cpp

Software FP8 for GPUs: 3x Speedup on Memory Operations

Popular AI Topics

More AI Articles