Nobel Prize winner David Baker: 13 blue oceans of AI for Science (Part 1)

Vasundhara Mali

4 days ago

Text | AlphaEngineer, author | Fei Binjie

With the rapid development of AI in the past two years, C-side applications of large models have emerged one after another and are deeply rooted in the hearts of the people. In contrast, AI for Science has always been shrouded in mystery.

Recently, the views of the AI industry have begun to change. Jason Wei clearly pointed out that AI for Science contains huge opportunities, and the biggest scenario lies in the protein revolution launched by AlphaFold 2.

Recently, David Baker, winner of the 2024 Nobel Prize in Chemistry, gave a wonderful speech titled “De Novo Protein Design”. Standing at the forefront of scientific research, he revealed the mystery for us: What are the application scenarios of AI for Science and what it can bring What actual value comes from.

David Baker predicts that in the next 5-10 years, we will see a variety of new synthetic proteins born with the help of large AI models to solve medical problems including cancer, autoimmune diseases, and Alzheimer’s disease. problems, while also flexing its muscles in fields such as bioelectronics, catalyst synthesis, and solar energy harvesting.

The article is a bit hardcore, but the content is very valuable. It is recommended to read it patiently, or you can save it and read it slowly.

(1) Sequence <-> 3D Structure: twin problems in the protein field

The 2024 Nobel Prize in Chemistry focuses on two major areas: Computational Protein Design and Protein Structure Prediction. The two are actually twin problems with two sides to one body.

It is known that peptide chains fold into complex three-dimensional structures that are somehow encoded in the sequence of amino acids that make up the peptide chain. In other words, the linear sequence of amino acids determines the three-dimensional structure of the protein.

For this important discovery, Christian Anfinsen was awarded the Nobel Prize in Chemistry in 1972.

This means that in principle we can predict the three-dimensional structure directly from the amino acid sequence. Vice versa, given a specific three-dimensional protein structure, we can theoretically deduce the amino acid sequence that constitutes this protein.

These two questions, pros and cons, are the core of protein research.

3D Structure -> Sequence, called Computational Protein Design

Sequence -> 3D Sequence, called Protein Structure Prediction

The challenge of “protein design” was overcome by David Baker in 2003. He designed a new protein containing 93 amino acids and calculated the amino acid sequence. They then synthesized the protein in the laboratory and proved the prediction correct.

In contrast, predicting the three-dimensional structure of a protein based on its amino acid sequence is a huge search problem, which was pointed out by Cyrus Leventhal as early as 1960.

For decades, progress in this field has been very slow, but Denis Hassabis and John Jump successfully solved this problem in 2020 by training a neural network model.

Now AlphaFold2 can accurately predict the distance map between amino acid sequences and further convert it into a three-dimensional structure to achieve accurate prediction of protein structure.

(2) The birth of protein: natural evolution or AI synthesis

Proteins have been gradually evolved by life over billions of years. They are like micro-robots that perform a variety of important functions in living organisms.

However, as the average life expectancy has continued to increase in recent years, humans are facing new challenges including cancer, neurodegenerative diseases, global warming, etc.

If we still rely on nature to evolve new proteins to solve these problems, we may have to wait hundreds of millions of years.

But if we can design proteins on demand, we can achieve breakthrough results in just a few years. This is the value of protein design.

In protein design, we first construct a protein that is expected to have a specific function, and then calculate the amino acid sequence corresponding to this protein.

Since this is a completely new protein, there is no gene that can encode it in nature. People need to create a synthetic gene, a synthetic DNA fragment that can encode this protein.

Then we put it into bacteria, which act as protein-producing factories, and finally we extract the protein and test whether it meets the expected functional requirements.

(3) The number of potentially untapped proteins is astronomical

A typical protein contains a sequence of more than 100 amino acids, of which there are 20 types. This means that there are at least 20^100 types of potential proteins, which is an astronomical number.

The proteins born in the natural evolution of life are only a very, very small part of it. The gray area in the figure below represents the potential protein space, and the red area is the protein types that exist in nature.

Since the evolution of life is gradual, there are often high correlations between proteins in nature. For example, the proteins in our human body are highly similar to the proteins in other mammals, so the red dots in the picture appear aggregated. feature.

Therefore, when scientists want to design a new protein, the traditional method is to first go to nature to see if there are any proteins with similar properties, and then make micro-innovations based on them. This approach is called “bioprospecting” (Bio Prospecting). Prospecting).

But there are many problems with this approach. First of all, the types of proteins that exist in nature are limited and the functions they can achieve are also limited. When we want to achieve some special functions, there may be no similar natural proteins available for exploration. At the same time, the protein structure existing in nature is very complex, and it is not easy to carry out micro-innovation on a complex system, just like debugging millions of lines of software code.

(4) RF Diffusion: Generate proteins like pictures

In recent years, people have begun to use the RF Diffusion method for protein design. This algorithm is actually inspired by image generation algorithms.

In the Diffusion algorithm, people first add different noises to the image, and then train a neural network to remove the noise and restore the image.

Once this neural network can perfectly remove noise, it can start from completely random noise pixels, gradually remove the noise, and generate a brand new image.

The principle of RF Diffusion algorithm is highly similar to it. First, we extract massive protein structure data from the PDB, inject more and more noise into it, and then train a neural network to remove the noise in the protein structure data.

After training is complete, we can start with a completely random amino acid configuration and gradually remove the noise to generate a completely new protein structure.

Just like when generating pictures, we can limit the content of the pictures we want to generate through methods such as Prompt and Lora. When generating proteins, we can also add restrictive conditions to produce proteins with certain functional properties.

For example, the figure below shows the synthesis of a protein that can bind to a given insulin receptor. During the training process, the neural network has learned the complementary shape characteristics between proteins and is therefore able to synthesize proteins that perfectly fit the target. Scientists have now designed proteins that can bind to more than 200 targets.

Next, we will discuss the application value of protein design, with a total of 13 scenarios, corresponding to the three major fields of medicine, electronic technology, and sustainable development.

(4) Protein × New drug research and development: snake venom vaccine

Snake venoms continue to be an important medical problem, especially in developing countries, because snake venoms can directly interfere with basic biochemical reactions.

A snake venom vaccine must be chemically stable enough and cheap enough to be used in countries without cold chain transport.

The blue part in the picture on the left is a protein designed by AI, which can perfectly combine with snake venom. After injecting it into mice, the snake venom is completely relieved and the mortality rate is reduced from 100% to 0%.

(5) Protein × New drug research and development: autoimmune diseases

Inflammation is a key topic in the current medical field. It is closely related to autoimmunity and cancer tumors.

At the heart of inflammation is a protein called the TNF receptor, which is also the target of many drugs currently on the market.

The left side of the picture below shows a protein produced based on TNF receptors. Injecting it into animals can effectively inhibit inflammation.

Current drugs used to treat inflammation, such as Enbrel (etanercept), have a certain effect, but the protein synthesized by AI binds more tightly to the receptor, so the anti-inflammatory effect is better.

This means that in the near future, people will be able to design new drugs to treat a variety of autoimmune diseases.

(6) Protein × New drug research and development: cancer tumor treatment

Cancer treatment is a key area where protein design shines. Now scientists can design entirely new proteins to activate the immune system and thereby treat cancer.

The red protein on the left in the picture below binds two immune receptors together, causing strong activation of the immune system.

In experiments on treating pancreatic cancer, this method achieved better results than traditional treatments, with tumors shrinking significantly.

(7) Protein × New drug research and development: epidemic antibodies

The gray part on the left side of the picture below is the influenza virus surface protein, on which we can use AI to generate a binding protein.

When generating, we can add a constraint: we want this protein to be an antibody, a special type of protein fold.

The purple part in the right picture above is the antibody protein structure measured in the laboratory, and the gray part is the protein structure generated by the model. The two are almost identical.

Antibodies recognize targets through CDR Loops. The antibody proteins synthesized through neural networks perfectly simulate CDR Loops and can bind tightly to influenza virus surface proteins, so they have good antibody effects.

In fact, the development of epidemic antibodies based on protein design has already entered our daily lives.

As early as 2016, Neil King began trying to make self-assembled nanoparticles. After successfully making them, he realized that he could put some viral protein fragments on them to produce vaccines.

Based on this idea, during Covid, it placed Covid surface protein receptors on these nanoparticles, which he found triggered a very strong immune response.

Based on this research, SKYCovione was born, a clinically approved drug. In the next few years, more and more similar new drugs will be released.

(8) Protein × New drug development: Alzheimer’s disease

Another medical problem of increasing importance is neurodegenerative diseases, such as Alzheimer’s disease, which are associated with the formation of long amyloid fibrils.

The formation process of long amyloid fibrils involves a variety of proteins, including Amyloid β and Tau protein, which combine with each other to form long amyloid fibrils.

We could design a completely new protein that binds to the disordered parts of these proteins, thereby blocking the formation of amyloid and thereby preventing Alzheimer’s disease.

Above, we have sorted out the potential application value of AI for Science in new drug research and development.

For more exciting content, follow Titanium Media’s WeChat ID (ID: taimeiti), or download Titanium Media App