DeepSeek researchers have addressed instability issues in large language model training by applying a 1967 matrix normalization algorithm to hyper connections. Hyper connections, which enhance the expressivity of models by widening the residual stream, were found to cause instability at scale due to excessive amplification of signals. The new method, Manifold Constrained Hyper Connections (mHC), projects residual mixing matrices onto the manifold of doubly stochastic matrices using the Sinkhorn-Knopp algorithm, ensuring numerical stability by maintaining controlled signal propagation. This approach significantly reduces amplification in the model, leading to improved performance and stability with only a modest increase in training time, demonstrating a new axis for scaling large language models. This matters because it offers a practical solution to enhance the stability and performance of large AI models, paving the way for more efficient and reliable AI systems.
DeepSeek’s research tackles a significant challenge in the training of large language models, specifically addressing the instability introduced by hyper connections. Hyper connections, an evolution of residual connections, have been shown to improve the expressivity of models without a significant increase in computational cost. However, they also introduce instability when scaled up, as seen in the amplification of signals across layers. This instability is problematic because it can lead to loss spikes and unstable gradient norms, making it difficult to train models effectively at scale. The introduction of Manifold Constrained Hyper Connections (mHC) aims to resolve this by constraining the mixing behavior of hyper connections on a well-defined manifold, thereby maintaining numerical stability.
The significance of mHC lies in its ability to preserve the benefits of hyper connections while mitigating their drawbacks. By projecting the residual mixing matrix onto the manifold of doubly stochastic matrices using the Sinkhorn Knopp algorithm, mHC ensures that the residual streams remain norm-controlled. This approach effectively reduces the Amax Gain Magnitude from peaks near 3000 to about 1.6, eliminating the explosive growth that previously hindered training. This mathematical constraint, rather than relying on ad-hoc tuning, provides a robust solution to the instability problem, making it a noteworthy advancement in the field of machine learning.
While the introduction of mHC does add some computational overhead, the research team has implemented several optimizations to manage this. By using fused kernels, recompute-based activation checkpointing, and pipeline-aware scheduling, the additional training time is kept to a manageable 6.7 percent. This is a small price to pay for the stability and performance gains achieved. The empirical results demonstrate that mHC not only stabilizes the training process but also enhances performance across various benchmarks. This improvement is consistent across different model sizes and persists throughout the training trajectory, indicating that mHC is a viable approach for future large language model designs.
The development of mHC introduces a new scaling axis for large language models, emphasizing the importance of designing the topology and manifold constraints of the residual stream. This approach offers a practical way to enhance performance and stability beyond simply scaling parameters or context length. As the field of artificial intelligence continues to evolve, innovations like mHC highlight the potential for thoughtful design choices to drive significant advancements. This matters because it opens up new possibilities for building more powerful and stable models, ultimately contributing to the broader goal of harnessing AI for social good.
Read the original article here


Leave a Reply
You must be logged in to post a comment.