In recent years, artificial intelligence (AI) has moved beyond its traditional boundaries, venturing into the intricate world of biology. The emergence of foundation models—advanced AI systems trained on vast datasets—has revolutionized how scientists interpret biological data. These models, similar to the technology behind ChatGPT, are now being used to decode the complex structures of proteins, DNA, and RNA, paving the way for precise biomolecule design. According to a recent report in Nature,
“AI methods similar to those that produced GPT-4 are now yielding foundation models that can make sophisticated predictions about the structure and function of DNA, RNA, and protein, and enable the generation of novel biomolecules based on real-world biological and evolutionary principles” (Eisenstein, 2024).
Imagine being able to design a protein from scratch or predict the effectiveness of a gene therapy before it’s ever tested in the lab. With AI’s foundation models, this once-distant dream is becoming a reality. These models learn from massive biological datasets, helping scientists predict molecular behavior and design biomolecules with real-world applications in medicine, agriculture, and beyond. This represents the dawn of a new era where AI-driven insights are transforming the future of biology and biomolecule design.
Foundation Models in Biology: The Future of Biomolecular Research
Foundation models are transforming our understanding of biology by harnessing AI’s ability to interpret vast amounts of unstructured biological data. These models are trained on immense, unlabelled datasets containing genetic, proteomic, and molecular information, enabling them to learn the underlying principles of how biomolecules such as DNA, RNA, and proteins function. This capability allows them to make sophisticated predictions that help scientists solve complex biological problems.
An exemplary case is the AI algorithm ESM3, developed by the techbio startup EvolutionaryScale. This model has demonstrated an extraordinary ability to design novel proteins, producing a green fluorescent protein with only 58% similarity to any known protein while maintaining its functional brightness. By leveraging real-world biological and evolutionary principles, ESM3 can generate biologically relevant molecules, showcasing the immense potential of foundation models in biomolecule design.
The impact of foundation models goes beyond individual examples. Major biotech companies like Deep Genomics and BioMap are integrating these AI-powered tools into their workflows, revolutionising drug discovery, gene editing, and various areas of molecular biology. By precisely predicting molecular structures and functions, AI foundation models are enhancing our ability to design novel therapeutics and treatments.
How Foundation Models Are Redefining AI in Biology
Foundation models represent a significant leap forward in applying AI to biology, offering flexibility and predictive power unmatched by traditional machine learning models. While traditional models are typically trained on smaller, task-specific datasets, foundation models are pretrained on vast amounts of unstructured biological data, such as DNA, RNA, and protein sequences. This broad training enables them to generalize across different biological tasks without requiring additional task-specific data. For instance, a single foundation model can predict protein structures, analyze gene expression, and even design new biomolecules—all from the same dataset.
Their predictive power is another key differentiator. Foundation models can make accurate predictions even for biological processes they haven’t been explicitly trained on, allowing them to tackle complex problems such as predicting protein folding or gene expression patterns under various conditions. For example, the BigRNA model has shown the ability to predict gene expression levels for genes it had never encountered before, showcasing the advanced capabilities of these models.
Although foundation models require substantial computational resources due to their complexity, their ability to integrate insights across multiple biological domains makes them indispensable for modern research. They can address a wide range of challenges, from drug discovery to synthetic biology, making them essential tools for driving innovation in biomolecule design.
Unlocking Protein Secrets: How AI is Shaping the Future of Biomolecule Design
Foundation models have revolutionized our understanding of protein structure and function by providing sophisticated tools for predicting protein behavior and designing novel biomolecules. These models can predict how proteins will fold, their stability, and their interactions with other molecules—insights that are crucial for understanding their roles in biological processes.
One of the most exciting advancements in this area is the use of foundation models for generative protein design. For instance, ESM3 successfully generated a green fluorescent protein (GFP) variant with only 58% sequence similarity to known GFPs, yet it maintained functional brightness. This ability to design proteins with specific characteristics opens up vast possibilities for applications in biotechnology and medicine, from engineering more efficient enzymes to developing targeted therapies for diseases.
The predictive capabilities of these models extend beyond proteins they have been explicitly trained on. Foundation models like BigRNA can predict complex structures and gene expression levels based on unannotated genomic data, making them invaluable in drug discovery. They guide the design of custom antibodies and other therapeutic molecules, offering new solutions for disease treatment and prevention. Through their robust predictions and generative design capabilities, AI foundation models are paving the way for innovative breakthroughs in both research and medicine.
Integration of AI with Existing Biological Research Tools
AI foundation models are transforming biological research by integrating seamlessly with traditional tools like X-ray crystallography and Cryo-Electron Microscopy (Cryo-EM). These AI models complement conventional methods by enhancing both speed and precision, allowing researchers to decode complex biomolecular structures more efficiently. While X-ray crystallography and Cryo-EM have long been gold standards for visualizing molecular structures, foundation models can predict protein structures and behaviors, either before or alongside these techniques, streamlining the experimental process. This synergy accelerates discoveries and reduces the need for time-intensive experiments.
Moreover, foundation models simplify workflows that previously required multiple specialized models. For example, BigRNA can handle tasks such as predicting RNA transcription, splicing, and polyadenylation within a single platform. This unified framework allows researchers to integrate AI-driven predictions with data generated from traditional methods, creating a more comprehensive understanding of biological systems. By enhancing the depth of analysis and providing detailed predictions that can be cross-verified using traditional tools, foundation models ultimately increase the accuracy of results and drive the next wave of precision biology.
The open-source availability of many foundation models also encourages collaboration across the scientific community, allowing researchers to incorporate AI-driven insights into their ongoing experiments. This integration of AI with traditional laboratory techniques is empowering researchers to explore new frontiers in biomolecule design with unprecedented speed and accuracy.
Applications in Drug Discovery, Gene Therapy, and Synthetic Biology
Foundation models are revolutionizing drug discovery by enabling the rapid design and development of new biomolecules. These advanced AI models leverage extensive biological data to predict how proteins and genes will behave under different conditions, crucial for identifying potential drug targets and understanding disease mechanisms. For instance, BigRNA can predict gene expression levels and intron-exon structures in genes it has never encountered before, providing deeper insights into genetic regulation and opening new avenues for therapeutic intervention. This predictive capability allows researchers to pinpoint novel targets for drug development, accelerating the discovery of treatments for complex diseases.
In gene therapy, foundation models are facilitating the design of custom RNA therapeutics. By integrating diverse biological data, such as genomic and transcriptomic information, these models provide a holistic view of cellular processes, enabling the creation of RNA molecules that can specifically target and correct genetic disorders. BioMap’s xTrimoPGLM model, for example, predicts the stability and interactions of designed proteins, allowing for the development of customized molecules tailored to specific therapeutic needs. This capability not only speeds up the development of gene therapies but also enhances their precision, reducing off-target effects and improving patient outcomes.
In synthetic biology, foundation models are driving innovation by enabling the generative design of entirely new biomolecules. These models can predict how novel proteins will fold and interact within a biological system, making it possible to engineer enzymes with enhanced functions or create synthetic pathways that do not exist in nature. This has far-reaching implications for biotechnology, from developing sustainable biofuels to creating novel biomaterials. By streamlining research workflows and integrating multimodal data, foundation models empower scientists to explore complex biological systems more efficiently, making groundbreaking applications in drug discovery, gene therapy, and synthetic biology a reality.
Handling the Deluge of Unstructured Biological Data
One of the greatest challenges in modern biology is managing the sheer volume and complexity of unstructured biological data. From vast libraries of protein sequences to intricate gene expression profiles, much of this information remains unannotated, making it difficult for traditional methods to extract meaningful insights. Foundation models address this challenge by pretraining on enormous datasets that encompass diverse biological information, such as protein structures, genomic sequences, and functional annotations. This extensive training allows these models to recognize patterns and relationships across millions or even billions of data points, without relying on manual annotations, which are often sparse and difficult to generate.
Similar to large language models like GPT-4, foundation models for biology leverage statistical associations derived from their training data. Using billions of parameters, they can classify and process biological data at a granular level, uncovering hidden connections and making predictions based on high-level prompts or specific molecular details. This ability to process and integrate unstructured data enables foundation models to generate coherent insights into complex biological mechanisms, such as protein folding or gene regulation, that were previously beyond the reach of conventional computational tools.
As these models evolve, their capacity to integrate multimodal data—including structural, sequence, and functional information—will be crucial for advancing our understanding of the fundamental principles that govern life. By analyzing this integrated data, foundation models provide a holistic understanding of biological systems, enabling the prediction of unseen protein structures or the identification of novel therapeutic targets. Their ability to process and interpret vast amounts of unstructured data will drive innovation in biomolecule design and beyond.
Ethical Considerations in Using AI for Biological Research
While AI-driven biological research offers immense potential, it also raises significant ethical challenges that must be addressed. One of the foremost concerns is data privacy and consent, particularly when using large datasets derived from human subjects. The use of genomic and personal health data necessitates stringent measures to protect sensitive information. Researchers must ensure that data is anonymized and that individuals provide informed consent, fully understanding how their data will be used and the potential implications of its use. This is crucial to maintain public trust and avoid misuse of personal information.
Bias and fairness are also critical issues. Foundation models are trained on extensive datasets that may not adequately represent all populations, potentially leading to biased outcomes. For example, if a model is predominantly trained on data from one demographic, its predictions may not be accurate for other groups, exacerbating existing health disparities. Furthermore, algorithmic biases can influence drug design and treatment recommendations, which may not account for biological diversity across different populations. Ensuring diverse and representative training data is essential to mitigate these risks and promote equitable healthcare outcomes.
The creation of novel biomolecules through AI also raises safety concerns, as it could lead to unforeseen consequences in the environment or human health. The potential to design new proteins or genes must be approached with caution, considering the long-term effects and regulatory challenges. Additionally, the opacity of foundation models, often referred to as the “black box” problem, poses challenges for transparency and accountability. When these models guide critical decisions in drug development or patient care, understanding how they arrive at specific conclusions is vital. Establishing clear guidelines for the responsible use of AI in biological research is imperative to ensure these powerful tools are used ethically and safely, balancing innovation with societal responsibility.
Conclusion
The integration of AI foundation models into biological research is revolutionizing the way we design and understand biomolecules, opening new frontiers in drug discovery, gene therapy, and synthetic biology. These advanced models offer unparalleled capabilities, from predicting complex protein structures to generating novel therapeutics, and are reshaping research workflows by providing a unified framework that can tackle multiple biological tasks. However, this power comes with the responsibility to address ethical concerns such as data privacy, algorithmic bias, and the potential consequences of creating new biomolecules.
As the field continues to evolve, it is essential for researchers to embrace these AI-driven technologies while being mindful of their limitations and ethical implications. By leveraging the strengths of foundation models alongside traditional methods, we can gain a deeper understanding of biological systems and accelerate the development of innovative solutions to some of the most pressing challenges in medicine and biotechnology. The future of biomolecule design lies at the intersection of AI and biology, and researchers who harness these tools thoughtfully and responsibly will be at the forefront of groundbreaking discoveries.
Additional Reading
- Boyd, N. et al. 2023. ATOM-1: A Foundation Model for RNA Structure and Function Built on Chemical Mapping Data. bioRxiv 2023.12.13.571579; https://doi.org/10.1101/2023.12.13.571579
- Celaj, A. et al. 2023. An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv 2023.09.20.558508; https://doi.org/10.1101/2023.09.20.558508
- Chen, B. et al. 2023. xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. bioRxiv 2023.07.05.547496; https://doi.org/10.1101/2023.07.05.547496
- Eisenstein, M. 2024. Foundation models build on ChatGPT tech to learn the fundamental language of biology. Nat. Biotechnol. 42, 1323–1325. https://doi.org/10.1038/s41587-024-02400-2
- Nguyen, E. et al. 2024. Sequence modeling and design from molecular to genome scale with Evo. bioRxiv 2024.02.27.582234; https://doi.org/10.1101/2024.02.27.582234
