InstaDeep has introduced Nucleotide Transformer v3 (NTv3), a multi-species genomics foundation model designed to enhance genomic prediction and design by connecting local motifs with megabase scale regulatory contexts. NTv3 operates at single-nucleotide resolution for 1 Mb contexts and integrates representation learning, functional track prediction, genome annotation, and controllable sequence generation into a single framework. The model builds on previous versions by extending sequence-only pretraining to longer contexts and incorporating explicit functional supervision and a generative mode, making it capable of handling a wide range of genomic tasks across multiple species.
NTv3 employs a U-Net style architecture that processes very long genomic windows, utilizing a convolutional downsampling tower, a transformer stack for long-range dependencies, and a deconvolution tower for base-level resolution restoration. It tokenizes input sequences at the character level, maintaining a vocabulary size of 11 tokens. The model is pretrained on 9 trillion base pairs from the OpenGenome2 resource and post-trained with a joint objective incorporating self-supervision and supervised learning on functional tracks and annotation labels from 24 animal and plant species. This comprehensive training allows NTv3 to achieve state-of-the-art accuracy in functional track prediction and genome annotation, outperforming existing genomic foundation models.
Beyond prediction, NTv3 can be fine-tuned as a controllable generative model using masked diffusion language modeling, enabling the design of enhancer sequences with specified activity levels and promoter selectivity. These designs have been validated experimentally, demonstrating improved promoter specificity and intended activity ordering. NTv3’s ability to unify various genomic tasks and support long-range, cross-species genome-to-function inference makes it a significant advancement in genomics, providing a powerful tool for researchers and practitioners in the field. This matters because it enhances our understanding and manipulation of genomic data, potentially leading to breakthroughs in fields such as medicine and biotechnology.
The introduction of Nucleotide Transformer v3 (NTv3) marks a significant advancement in the field of genomics, particularly in the realm of genomic prediction and design. This new model is designed to handle the complex task of connecting local motifs with megabase-scale regulatory contexts across multiple species. By unifying representation learning, functional track and genome annotation prediction, and controllable sequence generation, NTv3 offers a comprehensive approach to genomic analysis. This matters because it enhances our ability to predict molecular phenotypes and understand the regulatory grammar shared across different organisms, which is crucial for advancements in genetic research and biotechnology.
NTv3’s architecture is particularly noteworthy for its ability to process 1 Mb genomic windows at single-nucleotide resolution. Utilizing a U-Net style architecture, it compresses input sequences through a convolutional downsampling tower, models long-range dependencies with a transformer stack, and restores base-level resolution via a deconvolution tower. This design allows NTv3 to manage extremely long genomic contexts while maintaining high resolution, which is essential for accurate functional predictions and annotations. The model’s ability to tokenize sequences at the character level further enhances its precision, making it a powerful tool for genomic research.
The training process of NTv3 is another highlight, involving pretraining on a massive dataset of 9 trillion base pairs from the OpenGenome2 resource. This extensive pretraining is followed by post-training with a joint objective that combines self-supervised learning with supervised learning on functional tracks and annotation labels from 24 species. The result is a model that achieves state-of-the-art accuracy in functional track prediction and genome annotation across species. This level of performance is crucial for researchers and scientists who rely on accurate genomic data to drive discoveries in areas such as disease research, agriculture, and evolutionary biology.
Beyond prediction, NTv3’s capability to be fine-tuned into a controllable generative model represents a significant leap forward. By using masked diffusion language modeling, NTv3 can design enhancer sequences with specified activity levels and promoter selectivity, which are validated experimentally. This ability to generate specific genomic sequences with desired properties has profound implications for synthetic biology and gene editing. It opens up new possibilities for designing targeted genetic interventions, optimizing crop traits, and developing novel therapeutics. Overall, NTv3’s advancements highlight the growing intersection of artificial intelligence and genomics, promising transformative impacts on both scientific research and practical applications.
Read the original article here

