The a16z Show - Journal Club: Finding New Antibiotics with Machine Learning, What Coronavirus Structures Tell Us

Starting point is 00:00:00 Hello and welcome to the inaugural episode of the A16Z Journal Club. I'm Lauren Richardson, one of our bio-editors, and in this episode, we'll cover two topics. First, a novel machine learning-based approach to identify new antibiotics, and second, we'll discuss two articles characterizing the novel coronavirus causing the current pandemic. Journal Club will cover a variety of articles every few weeks, so stay tuned here, and we'll announce its own feed soon. First up is my conversation with A16Z general partner, VJ Pande, and Deal Partner on the bio team, Andy Tran. We dive into a deep learning approach to antibiotic discovery by Jonathan Stokes, Regina Barsley, James Collins, and colleagues.

Starting point is 00:00:41 In this article, published in Cell, the authors create a novel machine learning-based method to identify new antibiotic drugs from two large databases. They then validated one of their candidates, a drug named Hallison, showing that it has excellent antibiotic property, both in vitro and in two different mouse models of bacterial infection. Excitingly, Halison has a distinct structure and appears to have a distinct mechanism of action from other antibiotics, which is important given the problem of antibiotic resistance and the need to find new drugs. Our discussion of the paper covers the business of antibiotics, the methods, and how deep learning can identify novel drug structures, and other applications for deep learning in drug discovery

Starting point is 00:01:24 and development. But we begin with what made this paper appeal to us. And the first voice you'll hear is VJ's. A huge takeaway from this article was the breadth of experimental work that was done to demonstrate the accuracy of the predictions involved. And so while there has been a lot of work about using neural nets for predicting drug compounds, this was probably one of the landmark examples to date of a predictive perspective of approach validated experimentally for something.

Starting point is 00:01:55 non-trivial in terms of function. Typically doing something in vitro is pretty straightforward, but going to Enviro models was an important step forward to convince, especially drug hunters and experts in the field that this has real validity. So I think putting all those pieces together, I think is what really made this a paper stand out. Why do you think it is that it's so difficult to identify new antibiotics? What makes a challenge is not only the scientific side, but then also the business side. Not only are these antibiotics complex to develop, But the most innovative new products also cannot be sold freely. They are put on the shelf in reserve for more serious cases.

Starting point is 00:02:32 And they're actually dubbed these drugs of last resort. And so all of these scientific and business headwinds actually make something like, you know, synthesizing a brand-new antibiotic really challenging to do. Yeah, and most indications, if you create a drug and it works better than other drugs, that immediately becomes your top choice when you're prescribing. But for antibiotics, they actually get put to. to the end of the line because they want to save them for when all the other drugs that are already gaining resistance fail completely. Well, and then given resistance, I think the nightmare

Starting point is 00:03:05 scenario that we're all worried about is that we don't have drugs at work as the last lines of defense go and we don't have any mechanism for creating new ones. The business side is very critical here because if there can't be a way to be rewarded for making drugs, it's just hard to put hundreds of millions of dollars into doing it. It's almost like the business of making Blockbuster movies, that you need to have a drug that will make enough money on the other side to be able to support all the effort that goes into it, all the R&D effort as well as all the effort running clinical trials.

Starting point is 00:03:39 Whereas in this case, you know, once you can actually bring down the cost, at least to get something out of preclinical quickly, now you have actually the opportunity for one to go after indications that aren't blockbusters, that are small ones, kind of almost like this shift where you know you have certain things that are still going to be done with like marvel and movies and so on but there's a long tail of YouTubers who actually can come up with interesting content. This could be academic labs, startups may be funded through philanthropy, maybe governments, there may be now the beginnings of the potential for a long tail of new

Starting point is 00:04:12 drugs coming out that go after indications that will not be blockbusters but that will still have a huge fundamental impact on humanity. There are a lot of interesting different models we could talk about here for how technology like this coupled with a new business approach and now are possible because of techniques like this could make a huge change in our ability to develop novel antibiotics. Yeah, and it would be a huge step for the industry at large once we have more of these methods democratized for the broader scientific community. Why don't you break down for me what the machine learning approach that they used here and what kind of advanced does this represent? Yeah, sure. They took this machine learning model that they made and they trained it on about

Starting point is 00:04:50 2,500 molecules and use that to train binary classification models to predict probability of whether a new compound would inhibit the growth of E. coli or not. And then turn to the drug repurposing hub library, a library of 6,000 compounds that are ready in human clinical development for a wide variety of indications. And at this point, they compared several different machine alerted models. And after narrowing down those molecules and actually predicting toxicity using different neural networks, they came up with this particular molecule in Halison. And then thirdly, lastly, in the process, they went on to apply this machine learning model

Starting point is 00:05:28 after iteration and optimization to much broader set, the zinc-15 data set, with over a billion and a half structures. Now, the machine learning side, what's key here is the deep learning network that they used didn't really rely on any preset information about the chemical structures of the molecules. It actually really built new representations, they were called. For years, a lot of people represented molecules,

Starting point is 00:05:49 with these fingerprint vectors, you know, reflected things like presence or absence of functional groups or descriptors and computable molecular properties. But relying on known fingerprints didn't really work that well. And that's why, you know, a lot of the old antibiotic screening process gives you a lot of the same classes of molecules over and over again. And what they did here, they actually have these fingerprint descriptors that were built from scratch. Well, you know, what you're describing is still a fingerprint, right? It's a one-dimensional vector to describe molecules. I think perhaps what's different is that in a deep learning approach, you can try to infer what the right descriptors should be.

Starting point is 00:06:26 That's the hallmark of all of the deep learning approaches for drug design and deep learning in general, that recall, even when we're just talking about convolutional neural nets for image recognition, the idea is that CNNs for image recognition versus classical computational vision is that in the classical approach, the person sort of defines what the right features are, And so similarly, you know, what's interesting is that you can feed any representation of a molecule into a computer, but which parts are the interesting ones?

Starting point is 00:06:54 You can have, just like old school computer version, you could have a human being say, ah, these are the important ones. But a beauty of a DNA approach, which is used here, but also in many precursor works, that the DNA helps understand what are the key aspects and what are the interesting ones. And that is really, I think, the big difference between which you can get in modern deep learning. with machine learning versus classical machine learning with like random forest or something like that. Right. So deep learning helps us figure out what we don't know versus focusing only on what we already know or what we think we know. What makes Hallison an attractive candidate for further research and development? What are some of the properties that they discovered?

Starting point is 00:07:33 Without adapt, certainly a really potent inhibitor of E. coli. And further investigations showed that Hallison has strong growth inhibitory effects on a wide phylogenic spectrum of pathogens. They tried it on C-DIF, A. Balmani, and which is, you know, one of the highest priority pathogens that is urgently required for, in terms of antibiotics. And then, more interestingly, it was even able to eradicate E. coli persister cells that remained after Epolycillin treatment. So pretty strong efficacy and pretty low-tox based on their screen. It also checks the box of something that is really structurally divergent from conventional antibiotics. And so certainly a very powerful new class of antibiotics that could potentially be, you know, strong candidate for further development.

Starting point is 00:08:19 Yeah, the fact that they found the antibiotic, they showed it worked in vitro, they showed it worked in vivo, and then they also did some experiments to get at its mechanism of action, suggesting that Hallison selectively disrupts the pH potential across the bacterial membrane. This saps the proton motor force, which is like the battery of the cell. So like all antibiotics, it's disrupting an essential cellular function, but this appears to be a distinct and new function that's being targeted. That's right. It's like super elegant work. Such a complete, well-rounded story. Yeah, well, I mean, I think one of the things that really stands out here is that full

Starting point is 00:08:53 stack of experiments that they've done where it goes all the way from looking at the MICE in a dish to going through mice. And one of the appealing things about studying antibiotics and this often even pertains antivirals, that the animal models are pretty good with something. like Alzheimer's on the far extreme where animals are generally not very good. And so it's appealing that one could do all this, you know, with probably not requiring a huge budget and therefore get something on the other side that looks kind of intriguing. Beyond initial discovery, can this kind of machine learning-based approach be applied to other aspects of either earlier

Starting point is 00:09:28 late-stage drug development? Yeah, it's a broad topic, but that's the fun thing about it. There's a reason why it's a broad topic because there's a broad range of things it can do. I mean, you could talk about identifying targets, and there's a lot of work. to do there. And especially novel targets, it's a really interesting time to go after novel targets. You could talk about identifying leads. And this is a lot of basically what's been done here, the identification of leads and then the testing of them. These compounds from zinc are leads, but presumably they're not drug-like, so they have to be optimized. So there's these

Starting point is 00:09:56 types of methods helping in lead optimization. And then along the way, hopefully you'd want to also be screening for talks. And so there's a ton of methods that are getting really surprisingly the accurate. And basically, the beautiful thing about a machine learning approach like this is that the approach for the most part is pretty agnostic to what you're predicting and that the sort of processes you're building up can be useful. One last thing,

Starting point is 00:10:18 and this is maybe the holy grill dream, is that if you're predicting a lot of properties for a lot of different systems with a whole bunch of molecules in some multitask like framework where one model is predicting all of it, you can learn from all of it. And that you develop, even

Starting point is 00:10:34 though you might not have a lot of data in any single project or any single area here, the sum of all this data now is huge and helps to regularize your predictions to make them less overfit and more robust, such that the sort of the predictive capability that emerges from all of that is better than what anyone would be in between. I think that really is the big, big future, is taking the fact that there is this breadth of what it can do and not just recognizing all those possibilities, but using all those possibilities to actually improve any one of those predictions. Thanks, Andy and Vijay, for joining me for this first segment. So to quickly wrap up on a high level, there are two takeaways from

Starting point is 00:11:10 this article. First, it uncovers new candidates for future development as antibiotics, most critically Hallison, which was rigorously validated. Second, it demonstrates the ability of deep neural networks to make accurate predictions for drug lead identification in a mechanism agnostic manner. This is a broadly useful approach with huge potential to shape the future of drug discovery and development, and further highlights the increasingly important role of AI and medicine. If you enjoyed this conversation, check out our podcast featuring senior author Jim Collins called All About Synthetic Biology. In this next segment, A16Z Bio Deal partner Judy Savitz-Gai and I discussed two articles on the novel coronavirus causing the COVID-19 pandemic. The first article

Starting point is 00:11:55 is structure, function, and antigenicity of the SARS-COV-2 spike glycoprotein by Alexandra Walls, David Viesler and colleagues, published in Cell. We refer to this article throughout as Walls at All or the Walls paper. The second article is cryo-em structure of the 2019 NCOV spike in the pre-fusion confirmation by David Rapp, Nianchwang, Jason McClellan, and colleagues, published in science. We refer to this article as Rapp at All. Note that these two articles use different names for the novel coronavirus, SARS-C-O-V-2 versus 2019 N-COV, which was the original name for the novel coronavirus, but it has now been settled

Starting point is 00:12:36 officially as SARS-COV-2. And this is a bit confusing because the virus that caused the pandemic SARS, which is an acronym for severe acute respiratory syndrome, primarily in Asia in 2002-2004, is called SARS-COV. They're both coronaviruses and they're both closely related. And each of these articles that we discuss compare SARS-C-O-V. which we informally refer to as 2002 SARS with SARS-COV-2, which we refer to as 2019 SARS. Finally, one last bit of important context to know. The two papers also have similar experimental designs, where they investigate how the spike protein binds to the host cell and whether antibodies that recognize 2002 SARS spike protein

Starting point is 00:13:23 will recognize 2019 SARS spike protein. This is important because it helps guide vaccine and drug design, since the result of the results of the two papers are not identical, we discuss why this might be and what we can learn. But first, we start with the similarities. Both of these articles are comparing and contrasting the spike proteins on the current coronavirus with the 2002 SARS virus. It's really important to know the structure of the spike protein because that's the part of the virus that is actually going to bind to the host cell.

Starting point is 00:13:54 It's really important to know where it binds in the host cell, what protein it's using, which has been discovered to be ACE2. that's the same protein that the 2002 SARS virus used to infect human cells. So once we know the structure of this spike protein, we can start to understand how it's actually entering cells and have some hypotheses for what types of drugs or what particular molecules could be used to disrupt that interaction and actually treat this disease

Starting point is 00:14:20 or prevent the ability of the virus to initiate an infection into new cells. The spike protein is also really important because it's exposed, to the immune system, and so that's where antibodies will probably bind to the virus. So speaking of the immune system, one of the really interesting aspects of these viruses is that the spike proteins are covered in different sugar-like molecules, and those are actually sometimes called a glycan shield, which is glycan for sugars, and then shield because they actually prevent the immune system from recognizing the viral proteins as easily as they otherwise could. Both of these papers showed that there are a lot of glycans on the

Starting point is 00:14:59 the spike protein of the 2019 SARS virus, but that they actually differ very slightly from the 2002 SARS virus. Overall, the structure was super similar. But there is one key difference. Do you want to talk about the fear and cleavage site? This cleavage is a really key part of the mechanism for viral entry. So the mechanism for the virus to get into the cell is actually a sequence of about four steps. The first is the virus arriving at the lungs.

Starting point is 00:15:27 once it arrives at the lungs, the spike proteins will bind to the ACE2 receptor, which has been now shown in multiple different papers. And then proteins that are on the surface of the cell called proteases will reach over and cleave a piece of this spike protein. And that helps to pull the virus in closer to the cell to the point where a fusion can occur between the membranes. And so at that point, the virus is inside of the cell and it can replicate. What has been found in the SARS 2019 virus versus the 2002 virus is that there's an additional cleavage site, an additional set of amino acids in the protein that are cleaved.

Starting point is 00:16:08 And that's called a fureen site. And that is not present in the 2002 virus. So there's some hypotheses out there that this furen cleavage occurs first. And it actually enhances the probability and sort of the success of the second cleavage and then subsequent viral entry. So there's some theories that the reason this virus is actually. more effective at infecting cells than the 2002 SARS virus is because of this fearing cleavage site. I think that's so interesting. And one of the facts mentioned in the discussion of both papers was that a similar fur and cleavage site is seen in an analogous protein in highly virulent, avian, and human influenza strains. Let's talk more about how the virus binds the host cell and what happens next.

Starting point is 00:16:51 Tell me about this ACE2 molecule. Ace 2 is a really common protein found on the surface of many different cell types in the human body. It stands for AGO-O-Tensin converting enzyme. And this is a receptor that is involved in regulating blood pressure. So it has nothing to do with viral entry and it definitely didn't evolve for that. But many of these viruses have found ACE2 to be a sort of convenient receptor that they can co-opt to be able to get into the cell. And I think one of the reasons for this is that the protease activity that the virus is using, this sort of cut and pull-in mechanism that we talked about before, that's part of the natural mechanism of how ACE2 functions. What's important here is that this receptor is the exact same one that was discovered to be involved in the 2002 SARS'

Starting point is 00:17:39 entry mechanism. And that's important because if we have a good understanding of how to drug the ACE2 receptor following on the 2002 SARS work, we can potentially apply that to treatments for this current coronavirus. Yeah, the downside of that would be that since ACE-2 is involved in extremely critical human physiology, that it might be almost impossible to drug without having some kind of adverse side effect. Yeah. Both sets of papers look at the strength of the interaction between 2019 SARS and 2002 SARS with the ACE-2 receptors. What did they find here? So using different methods, the Wall's paper found that the 2019 SARS virus and the 2002 SARS virus bind to ACE2 with roughly the same affinity, but then RAP at all showed that the 2019 SARS virus binds with 10 to 20 times higher affinity, which is a huge difference.

Starting point is 00:18:36 They conjecture that this tighter binding is part of what makes it so virulent because it's able to more successfully get into cells following that binding event. It's interesting, you know, they're publishing these studies one right after the other. They don't discuss why this might be different. Yeah, and I think it's hard to say they're using different techniques and this science is happening so live. It's actually really cool to think about this from the perspective of being at the bench. You're working on something and if a paper comes out that contradicts what you're still working on it, that affects your work because it changes how you trust your own work and it changes what kind of data you're going to be looking for. But in this case, this is all happening so quickly that they're finding what they're finding and they're publishing it. And I think it's really cool to see it side by side. Yeah, you get to see the evolution of the field and the thinking kind of at the moment it's happening. And of course, that really bolsters the findings that do concord, like finding that ACE2 is the receptor and that the spike protein is really similar. Since those groups got those same answers without influencing each other, that really is kind of the gold standard of replication in scientific publishing is having, you know, two.

Starting point is 00:19:44 independent groups finding things in parallel. Okay, let's talk about the next area where we saw differences between these two articles. So this was in the section where they looked at whether antibodies against 2002 SARS combined to 2019 SARS. There were key differences between these two papers in the types of antibodies that they were looking at. In the walls at all paper, they are looking at polyclonal antibodies, which were generated by injecting mice with the purified S protein from the virus.

Starting point is 00:20:18 So the mice's immune system saw this S protein. It generated antibodies in response. The researchers then took a blood sample, purified the cera. That cera contains all sorts of different antibodies that bind to all different regions of the spike protein. Rap et all used three monoclonal antibodies, which had already been previously characterized. as binding to just a subdomain within the 2002 SARS spike protein. So whereas in the walls at all paper, there's all sorts of antibodies in this mix that can bind all over the spike protein and wrap at all.

Starting point is 00:20:59 These are three highly specific antibodies that bind to three highly specific sites in just one tiny portion. There's also a big difference in what they measure. Using the monoclonal antibodies in wrap, they're measuring just by. binding. So do the monoclonal antibodies bind to the spike protein at all? Whereas in walls, what they're measuring is whether the polyclonal antibodies set prevents the virus from entering into human cells. This is more of a phenotypic. It's more of like a does the entire mechanism that you're trying to stop not happen versus just the binding alone, which is a piece of the process.

Starting point is 00:21:35 In this case, I'd say that the two papers show opposite results. Walls shows that you can stop viral entry using these polyclinol antibodies, and RAP shows that there's no binding of the monoclonal antibodies from 2002 SARS to the 2019 SARS virus. They're opposite results, but they're not necessarily mutually exclusive results, because RAP at all is only looking at three antibodies. You know, there's an absolutely enormous number of possible antibodies against these viruses and even just the spike protein. So the fact that three didn't work, you know, that could just be. just be scientific, bad luck. There's also the fact that we read other papers to prepare for this segment that supported both the rap conclusion that there is no cross-reactivity and the wall's

Starting point is 00:22:24 conclusion that there is cross-reactivity. So the jury's still very much out. And the fact that all these studies are using slightly different designs, different sources of antibodies, different kinds of antibodies, different readouts, this is still definitely evolving. And it's part of what makes the research right now so interesting is that so many groups are applying their expertise, their backgrounds, their favorite techniques to this problem. And my gut feeling is that there's some kind of nuance into the cross-reactivity here that we just haven't appreciated yet. Why do you think they basically ended up doing the same set of experiments? I think that what we have here is a demonstration of the types of science that you can get data on really quickly.

Starting point is 00:23:12 In the past, to get a structure of this quality, you would have had to do X-ray crystallography, which requires first getting the protein to crystallize in this very specific manner, and it's a tricky thing to do. But with CryoEM, you just get the protein sample super, super cold, and shoot a beam of electrons off it. Then determine the structure based off how the electrons bounced off. There has been a real revolution in CryoEM recently, making this method capable of determining the structure of very large protein complexes. Also, the results of both of these papers are in vitro. So we're getting really important information that can help vaccine design and therapeutics design.

Starting point is 00:23:49 These are also the fastest type of research that can be done on these. As the publication cycles continue and the research continues, we'll get different insights coming out with experiments that take longer and that require mouse work or that require kind of more complex experimental design. Seeing these two papers side by side really speaks to one. What can be done? What can we learn? That's important and valuable in the shortest amount of time. Thanks, Judy, for discussing these articles with me. And that's it for Journal Club this week. You can find all episodes at A16Z.com. Thanks for listening.

The a16z Show - Journal Club: Finding New Antibiotics with Machine Learning, What Coronavirus Structures Tell Us

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.