Industrial yeasts are a powerhouse of protein production, used to manufacture vaccines, biopharmaceuticals, and other useful compounds. In a new study, MIT chemical engineers have harnessed artificial intelligence to optimize the development of new protein manufacturing processes, which could reduce the overall costs of developing and manufacturing these drugs.
Using a large language model (LLM), the MIT team analyzed the genetic code of the industrial yeast Komagataella phaffii — specifically, the codons that it uses. There are multiple possible codons, or three-letter DNA sequences, that can be used to encode a particular amino acid, and the patterns of codon usage are different for every organism.
The new MIT model learned those patterns for K. phaffii and then used them to predict which codons would work best for manufacturing a given protein. This allowed the researchers to boost the efficiency of the yeast’s production of six different proteins, including human growth hormone and a monoclonal antibody used to treat cancer.
“Having predictive tools that consistently work well is really important to help shorten the time from having an idea to getting it into production. Taking away uncertainty ultimately saves time and money,” says J. Christopher Love, the Raymond A. and Helen E. St. Laurent Professor of Chemical Engineering at MIT, a member of the Koch Institute for Integrative Cancer Research, and faculty co-director of the MIT Initiative for New Manufacturing (MIT INM).
Love is the senior author of the new study, which appears this week in the Proceedings of the National Academy of Sciences. Former MIT postdoc Harini Narayanan is the paper’s lead author.
Codon optimization
Yeast such as K. phaffii and Saccharomyces cerevisiae (baker’s yeast) are the workhorses of the biopharmaceutical industry, producing billions of dollars of protein drugs and vaccines every year.
To engineer yeast for industrial protein production, researchers take a gene from another organism, such as the insulin gene, and modify it so that the microbe will produce it in large quantities. This requires coming up with an optimal DNA sequence for the yeast cells, integrating it into the yeast’s genome, devising favorable growth conditions for it, and finally purifying the end product.
For new biologic drugs — large, complex drugs produced by living organisms — this development process might account for 15 to 20 percent of the overall cost of commercializing the drug.
“Today, those steps are all done by very laborious experimental tasks,” Love says. “We have been looking at the question of where could we take some of the concepts that are emerging in machine learning and apply them to make different aspects of the process more reliable and simpler to predict.”
In this study, the researchers wanted to try to optimize the sequence of DNA codons that make up the gene for a protein of interest. There are 20 naturally occurring amino acids, but 64 possible codon sequences, so most of these amino acids can be encoded by more than one codon. Each codon corresponds to a unique transfer RNA (tRNA) molecule, which carries the correct amino acid to the ribosome, where amino acids are strung together into proteins.
Different organisms use each of these codons at different rates, and designers of engineered proteins often optimize the production of their proteins by choosing the codons that occur the most frequently in the host organism. However, this doesn’t necessarily produce the best results. If the same codon is always used to encode arginine, for example, the cell may run low on the tRNA molecules that correspond to that codon.
To take a more nuanced approach, the MIT team deployed a type of large language model known as an encoder-decoder. Instead of analyzing text, the researchers used it to analyze DNA sequences and learn the relationships between codons that are used in specific genes.
Their training data, which came from a publicly available dataset from the National Center for Biotechnology Information, consisted of the amino acid sequences and corresponding DNA sequences for all of the approximately 5,000 proteins naturally produced by K. phaffii.
“The model learns the syntax or the language of how these codons are used,” Love says. “It takes into account how codons are placed next to each other, and also the long-distance relationships between them.”
Once the model was trained, the researchers asked it to optimize the codon sequences of six different proteins, including human growth hormone, human serum albumin, and trastuzumab, a monoclonal antibody used to treat cancer.
They also generated optimized sequences of these proteins using four commercially available codon optimization tools. The researchers inserted each of these sequences into K. phaffii cells and measured how much of the target protein each sequence generated. For five of the six proteins, the sequences from the new MIT model worked the best, and for the sixth, it was the second-best.
“We made sure to cover a variety of different philosophies of doing codon optimization and benchmarked them against our approach,” Narayanan says. “We’ve experimentally compared these approaches and showed that our approach outperforms the others.”
Learning the language of proteins
K. phaffii, formerly known as Pichia pastoris, is used to produce dozens of commercial products, including insulin, hepatitis B vaccines, and a monoclonal antibody used to treat chronic migraines. It is also used in the production of nutrients added to foods, such as hemoglobin.
Researchers in Love’s lab have started using the new model to optimize proteins of interest for K. phaffii, and they have made the code available for other researchers who wish to use it for K. phaffii or other organisms.
The researchers also tested this approach on datasets from different organisms, including humans and cows. Each of the resulting models generated different predictions, suggesting that species-specific models are needed to optimize codons of target proteins.
By looking into the inner workings of the model, the researchers found that it appeared to learn some of the biological principles of how the genome works, including things that the researchers did not teach it. For example, it learned not to include negative repeat elements — DNA sequences that can inhibit the expression of nearby genes. The model also learned to categorize amino acids based on traits such as hydrophobicity and hydrophilicity.
“Not only was it learning this language, but it was also contextualizing it through aspects of biophysical and biochemical features, which gives us additional confidence that it is learning something that’s actually meaningful and not simply an optimization of the task that we gave it,” Love says.
The research was funded by the Daniel I.C. Wang Faculty Research Innovation Fund at MIT, the MIT AltHost Research Consortium, the Mazumdar-Shaw International Oncology Fellowship, and the Koch Institute.
de MIT News https://ift.tt/OYB5hby
No hay comentarios:
Publicar un comentario