AI researchers at Meta say they have developed the largest protein folding model of its kind to date, and that it is capable of predicting the structure of more than 600 million proteins.
On Tuesday, the team released the transformer-based ESM-2 model with 15 billion parameters and a database of its protein structure predictions, called the ESM Metagenomic Atlas. This database contains protein shapes that scientists have not yet observed.
Proteins are complex biological molecules containing up to 20 types of amino acids and perform various biological functions in organisms. Crucially, they fold into complex 3D structures whose shape is vital to how they function; Knowing their shape helps scientists understand how they work, and from that helps them figure out ways to mimic, change, or counter that behavior.
Unfortunately, you can’t just take the amino acid formula and immediately work out the eventual structure. You can run simulations or experiments to potentially find out, but it’s time consuming. Nowadays, you can feed appropriately trained machine learning software the chemical composition of a protein and the model will quickly and relatively accurately predict the structure.
DeepMind demonstrated the same with its AlphaFold model, which won the 2020 CASP international biennial protein folding competition. Given an input string of amino acids, AlphaFold and other machine learning software can generate the corresponding three-dimensional structure.
Scientists at London’s DeepMind have since improved their system to predict the structure of more than 200 million proteins known to science. The latest ESM system from Meta went further, predicting hundreds of millions more after being trained on millions of protein sequences.
A preprint paper by the Meta team – Lin et al – explaining the ESM-2 design can be found here. Interestingly, according to the researchers, the system is actually a large language model built to “learn evolutionary patterns and generate accurate end-to-end structure predictions directly from the protein sequence.” For example, AlphaFold is not a language model and uses a different approach.
As the boffins note in their article, these large language models can be used for much more than manipulating human languages: “Modern language models containing tens to hundreds of billions of parameters develop capabilities such as multi-frame translation, logical reasoning, and mathematical problem solving, that all without explicit supervision.
“These observations raise the possibility that language models trained on protein sequences could exhibit a parallel form of emergence.”
The result is ESM-2, which, despite the language model, has learned to predict the physical shape of a protein from a text string representing its amino acids.
ESM-2 is the largest model of its kind and apparently predicts structures faster than similar systems; it is up to 60x faster than previous state-of-the-art systems such as AlphaFold or Rosetta, which can take over ten minutes to generate output, according to Meta.
The model was able to create an ESM Metagenomic Atlas that predicted more than 600 million structures from the MGnify90 protein database in just two weeks running on 2,000 GPUs. On a single Nvidia V100 GPU, it takes only 14.2 seconds to simulate a protein composed of 384 amino acids. From the document, Meta seems to have said that its system mostly, but not completely, matches AlphaFold’s accuracy, although its speed is the key thing that allows it to predict more proteins.
“With today’s state-of-the-art computational tools, structure prediction for hundreds of millions of protein sequences can take years in a practical time frame, even using the resources of a large research institution. Predictions at the scale of metagenomics, a breakthrough in prediction speed, is critical,” said the Facebook owner.
Meta hopes that ESM-2 and the ESM Metagenomic Atlas will help advance science by helping scientists study evolutionary history or address disease and climate change. “To extend this work even further, we are studying how language models can be used to design new proteins and contribute to solving problems in health, disease and the environment,” concluded biz. ®
#Meta #created #nextgeneration #model #protein #folding