test-time training

  • PonderTTT: Adaptive Compute for LLMs


    My first ML paper - PonderTTT: Adaptive compute for LLMsPonderTTT introduces a novel approach to adaptive computing for large language models (LLMs) by determining when to allocate more computational resources to complex inputs using Test-Time Training. This method allows the model to achieve 82-89% of optimal performance without requiring additional training, using a straightforward threshold and Exponential Moving Average (EMA). The project was developed by a self-taught high school student from Korea, showcasing the potential for independent research in machine learning. This matters because it highlights an efficient way to enhance LLM performance while minimizing computational costs, making advanced AI more accessible and sustainable.

    Read Full Article: PonderTTT: Adaptive Compute for LLMs

  • TOPAS-DSPL: Dual-Stream Transformer for Reasoning


    [P] TOPAS-DSPL: A 15M param Dual-Stream Recursive Transformer achieving 24% on ARC-2TOPAS-DSPL is a neuro-symbolic model that utilizes a dual-stream recursive transformer architecture to enhance small-scale reasoning tasks. By employing a "Bicameral" latent space, it separates algorithmic planning from execution state, which reduces "Compositional Drift" compared to traditional monolithic models. With a parameter count of approximately 15 million, it achieves a 24% accuracy on the ARC-AGI-2 Evaluation Set, showing a significant improvement over standard Tiny Recursive Models. The model's architecture addresses the "forgetting" problem in recursive loops by decoupling rule generation from state updates, and the open-sourcing of its training pipeline allows for independent verification and further development. This matters as it demonstrates significant advancements in reasoning models, making them more accessible and effective for complex problem-solving tasks.

    Read Full Article: TOPAS-DSPL: Dual-Stream Transformer for Reasoning

  • End-to-End Test-Time Training for Long Context


    [R] End-to-End Test-Time Training for Long ContextLong-context language modeling is approached as a continual learning problem, utilizing a standard Transformer architecture with sliding-window attention. The model continues to learn during test time by predicting the next token based on the given context, effectively compressing the context into its weights. By employing meta-learning during training, the model's initialization is enhanced for learning at test time. This End-to-End Test-Time Training (TTT-E2E) method demonstrates scalability similar to full attention Transformers while maintaining constant inference latency, offering a significant speed advantage. This development is crucial as it provides a more efficient approach to handling long-context language tasks, improving both performance and speed.

    Read Full Article: End-to-End Test-Time Training for Long Context

  • Genesis-152M-Instruct: Exploring Hybrid Architectures


    Genesis-152M-Instruct — Hybrid GLA + FoX + Test-Time Training at small scaleGenesis-152M-Instruct is an experimental small-scale language model designed to explore the interplay of recent architectural innovations under tight data constraints, boasting 152 million parameters trained on approximately 2 billion tokens. It integrates hybrid GLA and FoX attention mechanisms, test-time training (TTT) during inference, selective activation via sparse feedforward networks, and µP-scaled training. Despite its small scale, Genesis achieves notable performance on benchmarks like ARC-Easy, BoolQ, and SciQ, demonstrating the potential of architectural strategies to compensate for limited data. The model is fully open-source and invites feedback, particularly from those interested in linear attention, hybrid architectures, or test-time adaptation. This exploration matters as it provides insights into how architectural advancements can enhance model performance even with constrained data resources.

    Read Full Article: Genesis-152M-Instruct: Exploring Hybrid Architectures