attention mechanism

  • R-GQA: Enhancing Long-Context Model Efficiency


    [Research] I implemented a routed attention mechanism (R-GQA) for faster long-context models. Then wrote a paper on it.Routed Grouped-Query Attention (R-GQA) is a novel mechanism designed to enhance the efficiency of long-context models by using a learned router to select the most relevant query heads, thereby reducing redundant computations. Unlike traditional Grouped-Query Attention (GQA), R-GQA promotes head specialization by ensuring orthogonality among query heads, leading to a significant improvement in training throughput by up to 40%. However, while R-GQA shows promise in terms of speed, it falls short in performance against similar models like SwitchHead, particularly at larger scales where aggressive sparsity limits capacity. The research provides valuable insights into model efficiency and specialization, despite not yet achieving state-of-the-art status. The findings highlight the potential for improved model architectures that balance efficiency and capacity.

    Read Full Article: R-GQA: Enhancing Long-Context Model Efficiency

  • Open Sourced Loop Attention for Qwen3-0.6B


    [D] Open sourced Loop Attention for Qwen3-0.6B: two-pass global + local attention with a learnable gate (code + weights + training script)Loop Attention is an innovative approach designed to enhance small language models, specifically Qwen-style models, by implementing a two-pass attention mechanism. It first performs a global attention pass followed by a local sliding window pass, with a learnable gate that blends the two, allowing the model to adaptively focus on either global or local information. This method has shown promising results, reducing validation loss and perplexity compared to baseline models. The open-source release includes the model, attention code, and training scripts, encouraging collaboration and further experimentation. This matters because it offers a new way to improve the efficiency and accuracy of language models, potentially benefiting a wide range of applications.

    Read Full Article: Open Sourced Loop Attention for Qwen3-0.6B

  • Expanding Attention Mechanism for Faster LLM Training


    Tuneable Attention: How expanding (not compressing) the attention mechanism dramatically accelerated my model's learning speedExpanding the attention mechanism in language models, rather than compressing it, has been found to significantly accelerate learning speed. By modifying the standard attention computation to include a learned projection matrix U, where the rank of U is greater than the dimensionality d_k, the model can achieve faster convergence despite more compute per step. This approach was discovered accidentally through hyperparameter drift, resulting in a smaller model that quickly acquired coherent English grammar. The key insight is that while attention routing benefits from expanded "scratch space," value aggregation should remain at full dimensionality. This finding challenges the common focus on compression in existing literature and suggests new possibilities for enhancing model efficiency and performance. Summary: Expanding attention mechanisms in language models can dramatically improve learning speed, challenging the traditional focus on compression for efficiency.

    Read Full Article: Expanding Attention Mechanism for Faster LLM Training