Genesis-152M-Instruct: Exploring Hybrid Architectures

Genesis-152M-Instruct — Hybrid GLA + FoX + Test-Time Training at small scale

Genesis-152M-Instruct is an experimental small-scale language model designed to explore the interplay of recent architectural innovations under tight data constraints, boasting 152 million parameters trained on approximately 2 billion tokens. It integrates hybrid GLA and FoX attention mechanisms, test-time training (TTT) during inference, selective activation via sparse feedforward networks, and µP-scaled training. Despite its small scale, Genesis achieves notable performance on benchmarks like ARC-Easy, BoolQ, and SciQ, demonstrating the potential of architectural strategies to compensate for limited data. The model is fully open-source and invites feedback, particularly from those interested in linear attention, hybrid architectures, or test-time adaptation. This exploration matters as it provides insights into how architectural advancements can enhance model performance even with constrained data resources.

Genesis-152M-Instruct is an intriguing experiment in the realm of language models, aiming to explore the potential of architectural innovations when combined, especially under the constraint of limited data. This model is not about achieving state-of-the-art performance but rather about understanding how architectural choices can influence the performance of a model with only 152 million parameters trained on a relatively small dataset of 2 billion tokens. This is in stark contrast to models like SmolLM2, which are trained on datasets with trillions of tokens. By focusing on architecture over data scaling, Genesis-152M-Instruct provides valuable insights into the efficiency and potential of hybrid architectures.

The model integrates several contemporary architectural ideas, such as Hybrid GLA and FoX attention mechanisms, along with Test-Time Training (TTT) during inference. The GLA component focuses on long-range efficiency, while FoX attention ensures precise retrieval. This combination allows the model to handle complex tasks with a smaller parameter count. Test-Time Training is a particularly innovative feature, enabling the model to adapt during inference, which is a departure from the traditional approach of using a static model during this phase. This adaptability could lead to more robust performance in real-world applications where data distributions may shift over time.

Genesis-152M-Instruct also employs Selective Activation through sparse feed-forward networks (FFNs) and µP-scaled training, which are designed to enhance efficiency and stability. The selective activation acts as a form of regularization, maintaining performance while reducing computational overhead. The use of µP rules and Zero-Centered RMSNorm ensures stable scaling, even with the limited data available. These techniques highlight the potential for architectural innovations to compensate for smaller training datasets, which is crucial in scenarios where data is scarce or expensive to obtain.

While the model has limitations, such as the small training corpus and increased inference overhead due to TTT, it represents a significant step forward in understanding the interplay between architecture and data. This matters because it challenges the prevailing notion that larger datasets are always necessary for better performance, suggesting that clever architectural designs can also yield substantial improvements. As the field of machine learning continues to evolve, insights from experiments like Genesis-152M-Instruct could pave the way for more efficient and adaptable models, making advanced AI capabilities accessible even with limited resources.

Read the original article here