Society of Chemical Industry - Data with discovery

Machine learning technology is being harnessed by researchers in the hunt for the next generation of antibiotics and antivirals, reports Jon Evans

Science is awash with data. Automated synthesis techniques, high-throughput screening and powerful analytical technologies are churning out information at an unprecedented rate.

The challenge now is to analyse this deluge of data to pick out the useful nuggets among all the noise. Fortunately, sifting through all this data would appear to be an ideal task for machine learning.

Now the dominant form of artificial intelligence (AI), machine learning is based on using computers to find patterns in data, which can then be used to reveal the useful nuggets. This involves training the machine learning system with one lot of data to identify the patterns and then applying those patterns to a new set of data.

Machine learning is responsible for computer systems that can beat the best human players of strategy games such as chess and Go. These systems were fed lots of previous games of chess and Go as training data, from which they identified strategies and tactics that they could use when playing against human opponents. It is also responsible for the face-recognition systems that control access to many laptops and smartphones, which were trained with lots of photos of faces to allow them to identify the best facial attributes for distinguishing people.

Given the success of machine learning over the past few years, it’s no surprise it has been readily adopted by scientists to sift through their mounds of data, such that it is now being used for everything from designing novel materials to predicting earthquakes. All these machine learning systems look for patterns by essentially building models that match inputs of training data to specific outputs.

Several mathematical techniques have been developed for matching inputs to outputs, including support vector machines (SVMs) and decision trees, but the most powerful are perhaps artificial neural networks (ANNs), which are behind the machine learning systems able to recognise faces and defeat Go champions. ANNs comprise arrays of connected nodes designed to mimic nerves within the human brain. The connections between the nodes are weighted and the precise output to specific inputs depends on these weightings, which can change as a result of experience.

The idea is to train the network to associate specific inputs with a specific output by altering the weightings between the nodes until it produces the desired output. For face-recognition, the inputs are photos of faces, and the output is accurately determining whether two faces are identical or different. Any mistakes are used to modify the weightings until the network can perform accurately. Because the network can use any combination of weightings to associate inputs with an output, as long as they produce accurate results, the exact association it has identified between them often remains a mystery.

Nevertheless, in the wake of its impressive performances, machine learning is now being applied to some of the most pressing, intractable problems currently facing humankind. Prime among these is the rise of antibiotic resistance.

We are developing pattern recognition algorithms, inspired by image and speech recognition, to discover novel peptide antibiotics in genomes and proteomes.
Cesar de la Fuente-Nunez, University of Pennsylvania, Philadelphia, US

Towards new antibiotics

It’s well known that the widespread use, or overuse, of antibiotics has led to the rise of antibiotic resistant strains of bacteria such as Staphylococcus aureus and Streptococcus. Antibiotic resistance already accounts for more than 25,000 deaths/year in Europe and 35,000 in the US, and the situation is only getting worse. Added to this, the rate of discovery of new antibiotics has slowed over the years, as the most obvious ones have already been discovered. This is because scientists have, understandably, tended to search for antibiotics similar to those already discovered, which are mainly based on antimicrobial metabolites produced by soil-dwelling microbes. But not only does this mean they have already thoroughly searched the chemical space related to current antibiotics, but any new antibiotics they do discover will likely work in a similar way to current versions and thus may not control resistant strains.

With machine learning, however, scientists can now search over a much wider chemical space to find completely novel antibiotics that work in totally different ways. The idea is to train a machine learning model with data on the molecular features of both antibiotic and non-antibiotic molecules, so it can determine the combination of features that best distinguish antibiotics. Scientists then apply this trained model to libraries of molecules to identify those with the combination of features that indicate they could make good antibiotics, which are then tested in the lab.

Chemists from the US and Canada, led by James Collins at Massachusetts Institute of Technology, did just that. They trained an ANN with experimental data on 2335 different molecules and their effect on the growth of the model bacterium Escherichia coli, and then applied that network to the molecules in multiple chemical libraries (Cell, 2020, 180, 688).

In one of the libraries, their network identified the kinase inhibitor halicin as a promising antibiotic, even though it has a different molecular structure to all known antibiotics. When tested in mice, they found that halicin was active against antibiotic-resistant strains of Clostridioides difficile and Acinetobacter baumannii.

In another one of the libraries, comprising over 100m molecules, their network identified eight potential antibiotic compounds, again all structurally distinct from known antibiotics. Two of these compounds showed broad-spectrum activity, meaning antibiotic activity against a variety of pathogenic bacteria, including antibiotic-resistant strains of E. coli.

Machine learning is responsible for computer systems that can beat the best human players of strategy games such as chess and Go. It is also responsible for the face-recognition systems that control access to many laptops and smartphones.

100m
In a collection of 100m molecules, a neural network identified eight potential antibiotic compounds, all structurally distinct from known antibiotics. Two of these compounds showed broad-spectrum activity, meaning activity against a variety of pathogenic bacteria, including antibiotic-resistant strains of E. coli.

Another promising source of novel antibiotic compounds are antimicrobial peptides (AMPs): tiny fragments of proteins with antimicrobial activity produced by many bacteria and fungi. Unfortunately, scientists have struggled to turn known AMPs into effective antibiotics, and so far, only a handful have made it to clinical trials. But researchers are also sure they’ve only scratched the surface in terms of the number of AMPs that could be out there. Hence, they are turning to machine learning to try to discover more, by looking in some unusual places.

For example, as peptides are essentially tiny fragments of proteins, a team led by Cesar de la Fuente-Nunez at the University of Pennsylvania in Philadelphia, is searching for AMPs inside human proteins, as this provides a vast, easily accessible but unexplored search space. ‘We are developing pattern recognition algorithms, inspired by image and speech recognition, to discover novel peptide antibiotics in genomes and proteomes,’ he explains.

The idea is to train a machine learning model with data on the properties of known AMPs, including their amino acid sequence and structure, and then search protein databases for human proteins with sections possessing similar sequences and structures. Identified sections can then be synthesised, by inserting the genetic sequence that codes for the section into microbes such as E. coli, to produce peptides for testing.

Using this method, de la Fuente-Nunez and colleagues have already found three novel AMPs within pepsin A, the main protein-digesting enzyme in the human stomach (ACS Synthetic Biology, 2018, 7, 2105). Testing revealed that the three AMPs have antibacterial activity against several food-borne pathogens. This suggests these AMP sequences weren’t in pepsin A by chance, but might actually help to control pathogenic bacteria in the human stomach.

Antibiotic testing

Image: TEK IMAGE / SCIENCE PHOTO LIBRARY

While de la Fuente-Nunez thinks this approach holds promise, he admits there are challenges. ‘For example, how can we train machines to truly create a drug, taking into account not only potency for a desired application but also toxicity, stability, etc? Another aspect I think will be the crucial will be to endow machines with elements of creativity. To be able to create and innovate novel chemistries that represent drugs.’ Nevertheless, he hopes to take the first AI-generated drug to the clinic within five years.

Also using machine learning to look in unexpected places is Hosein Mohimani at Carnegie Mellon University in Pittsburgh, US. But rather than look for AMPs in proteins, he is looking for an unusual type of peptide produced by microbes, known as non-ribosomal peptides (NRPs), many of which have antimicrobial properties.

Normal proteins and peptides are produced by a cellular construct known as the ribosome, in response to genetic instructions carried by messenger RNA. Non-ribosomal peptides, as their name suggests, aren’t produced by the ribosome, but rather by special enzymes encoded in the microbial genome. These enzymes are each designed to carry a specific amino acid, and when transcribed from the genome, they join these amino acids together to form a specific peptide.

Unfortunately, identifying NRPs and the regions of the genome coding for the enzymes that produce them has proved difficult. So Mohimani and his colleagues built a machine learning model to identify NRPs from a combination of genomic data and metabolomic data produced by mass spectrometry (Nature Commun., 2021, 12, 3225).

‘We used machine learning approaches to match the signals of a microbe’s metabolites with its genomic signals and identify which likely correspond to a novel antimicrobial peptide,’ explains Mohimani. ‘Using artificial intelligence and machine learning to integrate the two data types, our methods can automatically identify novel bioactive peptides across thousands of microbial samples.’

They were able to identify many NRPs from different environments, including four previously unreported NRP families from soil-associated microbes, two of which featured NRPs with antimicrobial properties, and several from microbes living within humans.
Despite their undoubted power, however, machine learning models still have limitations, especially when searching for novel antibiotic compounds. ‘One big problem is they don’t perform well on data they haven’t seen, which is an issue known as overfitting,’ says Bahar Behsaz, a project scientist in Mohimani’s team. ‘Models trained on limited databases often don’t generalise when evaluated on novel peptides, which makes them challenging to use for discovery.’

‘I think the long-term solution is going to be increasing the scale of our data,’ suggests Mohimani. ‘As the experimental technologies and our knowledge of antimicrobial peptides advances, we can use larger training datasets to resolve this issue.’

25,000
Deaths/year due to antibiotic resistance in Europe and 35,000 in the US, and the situation is only getting worse.

Artificial neural networks comprise arrays of connected nodes designed to mimic nerves within the human brain. The connections between the nodes are weighted and the precise output to specific inputs depends on these weightings, which can change as a result of experience.

The right data

But it’s not just more data that’s required, but more of the right kind of data. This was recently demonstrated by a team led by Bobbie‑Jo Webb‑Robertson at the Pacific Northwest National Laboratory in Richland, US, when developing a machine learning model for discovering novel antiviral peptides (AVPs). These are a subset of AMPs that target viruses and could thus form the basis for novel antiviral treatments, which have historically proved more difficult to develop than antibiotics.

Webb-Robertson had access to training data on over 1000 peptides, including known AVPs. This comprised information on many different features of the peptides, including multiple aspects of their sequence and structure. But she realised that some of these peptide features would not be useful for identifying AVPs, so decided to identify the most informative features for use as training data. To do this, she again turned to machine learning (Sci. Reports, 2020, 10, 19260).

The model she and her team developed was able to cut down the number of peptide features from 649 to a core set of just 169, with those related to peptide secondary structure proving to be most informative. In tests, she and her team used this core set to construct several different machine learning models, including SVM and ANN models, and showed that they were more accurate at identifying AVPs than models trained with the full feature set.

Still, Webb-Robertson agrees that having more of the right kind of data would help produce more accurate machine learning models. ‘The main challenge currently is that that training data is very limited, so there are likely very large error bars on the probabilities,’ she says. In the race to develop novel antibiotics, training will clearly be key.