low-bit inference

  • Exploring Ternary LLM Core with BitNet Inspiration


    Exploring a 1.58-bit / ternary LLM core inspired by BitNet (CUDA attention, GTX 1050 tests)An experimental project explores the potential of low-bit large language model (LLM) inference using ternary weights, inspired by the BitNet 1.58-bit paper. The project involves creating a custom LLM core that replaces FP16-heavy matrix multiplication layers with ternary linear layers, using a Straight-Through Estimator for training and a custom CUDA attention kernel without softmax to enhance compute efficiency and stability. Initial tests on a GTX 1050 show successful end-to-end training, reduced memory footprint, and coherent output in character-level Shakespeare training, although the model is not yet competitive with larger FP16/INT8 models and requires careful tuning. This matters because it explores the potential for efficient, low-power LLM inference on consumer GPUs, which could lead to more accessible AI technologies.

    Read Full Article: Exploring Ternary LLM Core with BitNet Inspiration