Genomes of eukaryotic organisms such as yeast were used to train the Evo-2 model.Credit: Thomas Deerinck, NCMIR/Science Photo Library
Scientists today released what they say is the biggest-ever artificial-intelligence (AI) model for biology.
The model — which was trained on 128,000 genomes spanning the tree of life, from humans to single-celled bacteria and archaea — can write whole chromosomes and small genomes from scratch. It can also make sense of existing DNA, including hard-to-interpret ‘non-coding’ gene variants that are linked to disease.
‘ChatGPT for CRISPR’ creates new gene-editing tools
Evo-2, co-developed by researchers at the Arc Institute and Stanford University, both in Palo Alto, California, and chip maker NVIDIA, is available to scientists through web interfaces or they can download its freely available software code, data and other parameters needed to replicate the model.
The developers see Evo-2 as a platform that others can adapt to their own uses. “We’re really looking forward to how scientists and engineers build this ‘app store’ for biology,” Patrick Hsu, a bioengineer at the Arc Institute and the University of California, Berkeley, said at a press briefing announcing Evo-2’s launch.
Other scientists are impressed with what they’ve read about the model — which is described in a paper posted to the Arc Institute website and submitted to the bioRxiv preprint server. But they say they will need to kick the tyres before coming to firm conclusions.
“We’ll have to see how it holds up in independent benchmarks after the preprint is out,” says Anshul Kundaje, a computational genomicist at Stanford University in Palo Alto. So far, he is impressed by the engineering that underpins the model.
Trillions of letters
In the past few years, researchers have developed increasingly powerful ‘protein language models’ such as the ESM-3 model developed by former Meta employees that, after training on millions of protein sequences, have been used to help predict protein structures and design totally new proteins including gene editors and fluorescent molecules.
AI has dreamt up a blizzard of new proteins. Do any of them actually work?
Unlike these models, Evo-2 was trained on genome data that contains both ‘coding sequences’ — which carry instructions for making proteins — and non-coding DNA that includes sequences that can control when, where and how genes are active. The first version of Evo released last year was trained on the genomes of 80,000 bacteria and archaea — simple organisms called prokaryotes — as well as their viruses and other sequences.
The latest model is based on 128,000 genomes, including those of humans and other animals, plants and other eukaryotic organisms. These genomes encompass a total of 9.3 trillion DNA letters. Based on the computing power needed to devour this data and other features, the Evo-2 is the biggest biological AI model yet released, says Hsu.

Credit: Arc Institute
Compared with prokaryotes, eukaryotic genomes tend to be longer and more complex: genes are made of interspersed segments of coding and non-coding regions, and non-coding ‘regulatory DNA’ can be far away from the genes they control. To handle this complexity, Evo-2 was built so that it can learn patterns in sequences of DNA as far away as 1 million base pairs.
To demonstrate its ability to make sense of complex genomes, Hsu and his colleagues used Evo-2 to predict the effects of previously studied mutations in a gene implicated in breast cancer called BRCA1. It did nearly as well as the best bio-AI models at determining whether changes to coding regions would cause diseases, said Hsu. “It’s state of the art for non-coding mutations.” In the future, the model could help to identify these hard-to-interpret changes in patient genomes.
The researchers also tested the model’s ability to decipher other features of complex genomes — including that of the woolly mammoth. “Evo-2 represents a significant step in learning DNA regulatory grammar,” says Christina Theodoris, a computational biologist at the Gladstone Institutes in San Francisco, California.