a16z Podcast - Journal Club: Finding New Antibiotics with Machine Learning, What Coronavirus Structures Tell Us

Episode Date: April 26, 2020

a16z Journal Club (part of the a16z Podcast), curates and covers recent advances from the scientific literature -- what papers we’re reading, and why they matter from our perspective at the intersec...tion of biology & technology (for bio journal club). This inaugural episode covers 2 different topics, in discussion with Lauren Richardson:0:26 #1 identifying new antibiotics through a novel machine-learning based approach -- a16z general partner Vijay Pande and bio deal partner Andy Tran discuss the business of pharma; the specific methods/  how it works; and other applications for deep learning in drug discovery and development based on this paper:"A Deep Learning Approach to Antibiotic Discovery" in Cell (February 2020), by Jonathan Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina Donghia, Craig MacNair, Shawn French, Lindsey Carfrae, Zohar Bloom-Ackermann, Victoria Tran, Anush Chiappino-Pepe, Ahmed Badran, Ian Andrews, Emma Chory, George Church, Eric Brown, Tommi Jaakkola, Regina Barzilay, James Collins11:43 #2 characterizing the novel coronavirus causing the COVID-19 pandemic -- a16z bio deal partner Judy Savitskaya shares what we can learn from the protein structures; the relationship to the 2002-2004 SARS epidemic; and more based on these two research articles: "Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein" in Cell (April 2020), by Alexandra Walls, Young-Jun Park, M. Tortorici, Abigail Wall, Andrew McGuire, David Veesler"Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation" in Science (March 2020), by Daniel Wrapp, Nianshuang Wang, Kizzmekia Corbett, Jory Goldsmith, Ching-Lin Hsieh, Olubukola Abiona, Barney Graham, Jason McLellanYou can find these episodes at a16z.com/journalclub.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to the inaugural episode of the A16Z Journal Club. I'm Lauren Richardson, one of our bio-editors, and in this episode, we'll cover two topics. First, a novel machine learning-based approach to identify new antibiotics, and second, we'll discuss two articles characterizing the novel coronavirus causing the current pandemic. Journal Club will cover a variety of articles every few weeks, so stay tuned here, and we'll announce its own feed soon. First up is my conversation with A16Z general partner, VJ Pande, and deal partner on the bio team, Andy Tran. We dive into a deep learning approach to antibiotic discovery by Jonathan Stokes, Regina Barsley, James Collins, and colleagues.
Starting point is 00:00:41 In this article, published in Cell, the authors create a novel machine learning-based method to identify new antibiotic drugs from two large databases. They then validated one of their candidates, a drug named Hallison, showing that it has excellent antibiotic property, both in vitro and in two different mouse models of bacterial infection. Excitingly, Halison has a distinct structure and appears to have a distinct mechanism of action from other antibiotics, which is important given the problem of antibiotic resistance and the need to find new drugs. Our discussion of the paper covers the business of antibiotics, the methods, and how deep learning can identify novel drug structures, and other applications for deep learning in drug discovery
Starting point is 00:01:24 and development. But we begin with what made this paper appeal to us. And the first voice you'll hear is VJ's. A huge takeaway from this article was the breadth of experimental work that was done to demonstrate the accuracy of the predictions involved. And so while there has been a lot of work about using neural nets for predicting drug compounds, this was probably one of the landmark examples to date of a predictive perspective of approach validated experimentally for something.
Starting point is 00:01:55 non-trivial in terms of function. Typically, doing something in vitro is pretty straightforward, but going to in vivo models was an important step forward to convince, especially drug hunters and experts in the field that this has real validity. So I think putting all those pieces together, I think is what really made this paper stand out. Why do you think it is that it's so difficult to identify new antibiotics?
Starting point is 00:02:17 What makes a challenge is not only the scientific side, but then also the business side. Not only are these antibiotics complex to develop, but the most innovative new products also cannot be sold freely. They are put on the shelf in reserve for more serious cases, and they're actually dubbed these drugs of last resort. And so all of these scientific and business headwinds actually make something like synthesizing a brand-new antibiotic really challenging to do. Yeah, and most indications, if you create a drug and it works better than other drugs, that immediately becomes your top choice when you're, prescribing, but for antibiotics, they actually get put to the end of the line because they want to save them for when all the other drugs that are already gaining resistance fail completely.
Starting point is 00:03:03 Well, and then given resistance, I think the nightmare scenario that we're all worried about is that we don't have drugs at work as the last lines of defense go, and we don't have any mechanism for creating new ones. The business side is very critical here, because if there can't be a way to be rewarded for making drugs, it's just hard. to put hundreds of millions of dollars into doing it. It's almost like the business of making Blockbuster movies, that you need to have a drug that will make enough money on the other side to be able to support all the effort that goes into it,
Starting point is 00:03:34 all the R&D effort as well as all the effort running clinical trials. Whereas in this case, you know, once you can actually bring down the cost, at least to get something out of pre-clinical quickly, now you have actually the opportunity for one to go after indications that aren't Blockbusters, that are small ones, kind of almost like the shift where you know you have certain things that are still going to be done with like marvel and movies and so on but there's a long tale of YouTubers who actually can come up with interesting content. This could be academic labs, startups may be funded through
Starting point is 00:04:06 philanthropy, maybe governments, there may be now the beginnings of the potential for a long tail of new drugs coming out that go after indications that will not be blockbusters but that will still have a huge fundamental impact on humanity. There are a lot of interesting different models. We could talk about here for how technology like this coupled with a new business approach and now are possible because of techniques like this could make a huge change in our ability to develop novel antibiotics. Yeah. And it would be a huge step for the industry at large once we have more of these methods democratized for the broader scientific community. Why don't you break down for me what the machine learning approach that they used here and what kind
Starting point is 00:04:45 of advanced does this represent? Yeah, sure. They took this machine learning model that they made and they trained it on about 2,500 molecules and used that to train binary classification models to predict probability of whether a new compound would inhibit the growth of E. coli or not. And then turn to the drug repurposing hub library, a library of 6,000 compounds that are ready in human clinical development
Starting point is 00:05:07 for a wide variety of indications. And at this point, they compared several different machine alerted models, and after narrowing down those molecules and actually predicting toxicity using different neural networks. They came up with this particular molecule in Hallison. And then thirdly, lastly, in the process, they went on to apply this machine learning model after iteration and optimization to much broader set, the Zinc15 dataset, with over a billion and a half structures. Now, the machine learning side, what's key here is the deep learning network that
Starting point is 00:05:40 they used didn't really rely on any preset information about the chemical structures of the molecules. It actually really built new representations, they were called. For years, a lot of people, represented molecules with these fingerprint vectors, you know, reflected things like presence or absence of functional groups or descriptors and computable molecular properties. But relying on known fingerprints didn't really work that well. And that's why, you know, a lot of the old antibiotic screening process gives you a lot of the same classes of molecules over and over again. And what they did here, they actually have these fingerprint descriptors that were built from scratch. Well, you know, what you're describing is still a fingerprint, right? It's a one-dimensional vector to describe
Starting point is 00:06:18 molecules. I think perhaps what's different is that in a deep learning approach, you can try to infer what the right descriptor should be. That's the hallmark of all of the deep learning approaches for drug design, is that, and deep learning in general, that, you know, recall, like, even when we're just talking about convolutional neurons for image recognition, the idea is that the CNN's for image recognition versus classical computational vision is that in the classical approach, the person sort of defines what the right features are. And so similarly, you know, what's interesting is that you can feed any representation of a molecule into a computer, but which parts are the interesting ones.
Starting point is 00:06:54 You can have, just like old school computer version, you could have a human being say, ah, these are the important ones. But a beauty of a DNA approach, which is used here, but also in many precursor works, that the DNA helps understand what are the key aspects and what are the interesting ones. And that is really, I think, the big difference between which you can get in modern, deep learning with machine learning versus classical machine learning with like random forest or something like that. Right. So deep learning helps us figure out what we don't know versus focusing only on what we already know or what we think we know. What makes Hallison an attractive candidate for
Starting point is 00:07:30 further research and development? What are some of the properties that they discovered? Without adapt, certainly a really potent inhibitor of E. coli. And further investigations showed that Hallison has strong growth inhibitory effects on a wide phylogenic spectrum of pathogens. They tried it on C. diff, A. Balmani, and which is, you know, one of the highest priority pathogens that is urgently required for, in terms of antibiotics. And then, more interestingly, it was even able to eradicate E. coli persister cells that remained after epicillin treatment. So pretty strong efficacy and pretty low-tox based on their screen. It also checks the box of something that is really structurally divergent from conventional antibiotics. And so certainly
Starting point is 00:08:13 a very powerful new class of antibiotics that could potentially be a strong candidate for further development. Yeah, the fact that they found the antibiotic, they showed it worked in vitro, they showed it worked in vivo, and then they also did some experiments to get at its mechanism of action, suggesting that Hallison selectively disrupts the pH potential
Starting point is 00:08:31 across the bacterial membrane. This saps the proton motor force, which is like the battery of the cell. So like all antibiotics, it's disrupting an essential cellular function, but this appears to be a distinct and new function that's being targeted. That's right. It's like super elegant work.
Starting point is 00:08:47 Such a complete, well-rounded story. Yeah, well, I mean, I think one of the things that really stands out here is that full stack of experiments that they've done where it goes all the way from looking at them I see in a dish to going through mice. And one of the appealing things about studying antibiotics and this often even pertains antivirals, that the animal models are pretty good with something like Alzheimer's on the far extreme where animals are generally not very good. And so it's appealing that one could do all this
Starting point is 00:09:16 with probably not requiring a huge budget and therefore get something on the other side that looks kind of intriguing. Beyond initial discovery, can this kind of machine learning-based approach be applied to other aspects of either early or late-stage drug development? Yeah, it's a broad topic, but that's the fun thing about it. There's a reason why it's a broad topic
Starting point is 00:09:34 because there's a broad range of things it can do. I mean, you could talk about identifying targets and there's a lot of work to do there. novel targets, it's a really interesting time to go after novel targets. You could talk about identifying leads. And this is a lot of basically what's been done here, the identification of leads and then the testing of them. These compounds from zinc are leads, but presumably they're not drug-like, so they have to be optimized. So there's these types of methods helping in lead optimization. And then along the way, hopefully you'd want to also be screening for talks.
Starting point is 00:10:03 And so there's a ton of methods that are getting really surprisingly accurate. And basically, The beautiful thing about a machine learning approach like this is that the approach for the most part is pretty agnostic to what you're predicting and that the sort of processes you're building up can be useful. One last thing, and this is maybe the holy grill dream, is that if you're predicting a lot of properties for a lot of different systems with a whole bunch of molecules in some multitask like framework where one model is predicting all of it, you can learn from all of it. and that you develop, even though you might not have a lot of data in any single project or any single area here, the sum of all this data now is huge and helps to regularize your predictions to make them less overfit and more robust, such that the sort of the predictive capability that emerges from all of that is better than what anyone would be in between. I think that really is the big, big future, is taking the fact that there is this breadth of what
Starting point is 00:10:58 it can do and not just recognizing all those possibilities, but using all those possibilities to actually improve any one of those predictions. Thanks, Andy and Vijay, for joining me for this first segment. So to quickly wrap up on a high level, there are two takeaways from this article. First, it uncovers new candidates for future development as antibiotics, most critically Hallison, which was rigorously validated. Second, it demonstrates the ability of deep neural networks to make accurate predictions for drug lead identification in a mechanism agnostic manner.
Starting point is 00:11:27 This is a broadly useful approach with huge potential to shape the future of drug, drug discovery and development, and further highlights the increasingly important role of AI and medicine. If you enjoyed this conversation, check out our podcast featuring senior author Jim Collins called All About Synthetic Biology. In this next segment, A16Z Bio Deal partner Judy Savitz-Gaya and I discussed two articles on the novel coronavirus causing the COVID-19 pandemic. The first article is structure, function, and antigenicity of the SARS-COV-2 spike glycoprote by Alexandra Walls, David Wiesler and colleagues, published in Cell. We refer to this article throughout as Walls at All or the Walls paper.
Starting point is 00:12:08 The second article is cryo-em structure of the 2019 NCOV spike in the pre-fusion confirmation by David Rapp, Neon-Schwang, Jason McClellan, and colleagues, published in science. We refer to this article as Rapp at All. Note that these two articles use different names for the novel coronavirus, SARS-C-O-V-2 versus 2019 NCOV, which was the original name for the novel coronavirus, but it has now been settled officially as SARS-C-O-V-2. And this is a bit confusing because the virus that caused the pandemic SARS, which is an acronym for severe acute respiratory syndrome, primarily in Asia in 2002-2004, is called SARS-C-O-V.
Starting point is 00:12:52 They're both coronaviruses and they're both closely related. And each of these articles that we discuss compare SARS-C-O-V, which we inform. formally referred to as 2002 SARS with SARS-COV-2, which we refer to as 2019 SARS. Finally, one last bit of important context to know. The two papers also have similar experimental designs where they investigate how the spike protein binds to the host cell and whether antibodies that recognize 2002 SARS spike protein will recognize 2019 SARS spike protein. This is important because it helps guide vaccine and drug design. Since the results of the two papers, are not identical, we discuss why this might be and what we can learn. But first, we start with the
Starting point is 00:13:36 similarities. Both of these articles are comparing and contrasting the spike proteins on the current coronavirus with the 2002 SARS virus. It's really important to know the structure of the spike protein because that's the part of the virus that is actually going to bind to the host cell. It's really important to know where it binds in the host cell, what protein it's using, which has been discovered to be ACE2. That's the same protein that, the 2002 SARS virus used to infect human cells. So once we know the structure of this spike protein, we can start to understand how it's actually entering cells
Starting point is 00:14:12 and have some hypotheses for what types of drugs or what particular molecules could be used to disrupt that interaction and actually treat this disease or prevent the ability of the virus to initiate an infection into new cells. The spike protein is also really important because it's exposed to the immune system and so that's where antibodies will probably bind to the virus.
Starting point is 00:14:35 So speaking of the immune system, one of the really interesting aspects of these viruses is that the spike proteins are covered in different sugar-like molecules. And those are actually sometimes called a glycan shield, which is glycan for sugars and then shield because they actually prevent the immune system from recognizing the viral proteins as easily as they otherwise could. Both of these papers showed that there are a lot of glycans on the spike protein of the 2019 SARS virus, but that they actually differ very slightly from the 2002 SARS virus. Overall, the structure was super similar. But there is one key difference. Do you want to talk
Starting point is 00:15:11 about the fear and cleavage site? This cleavage is a really key part of the mechanism for viral entry. So the mechanism for the virus to get into the cell is actually a sequence of about four steps. The first is the virus arriving at the lungs. Once it arrives at the lungs, the spike proteins will bind to the ACE2 receptor, which has been now shown in multiple different papers. And then proteins that are on the surface of the cell called proteases will reach over and cleave a piece of this spike protein. And that helps to pull the virus in closer to the cell to the point where a fusion can occur between the membranes. And so at that point, the virus is inside of the cell and it can replicate. What has been found in the SARS 2019,
Starting point is 00:15:59 virus versus the 2002 virus is that there's an additional cleavage site, an additional set of amino acids in the protein that are cleaved, and that's called a furin site, and that is not present in the 2002 virus. So there's some hypotheses out there that this furen cleavage occurs first, and it actually enhances the probability and sort of the success of the second cleavage and then subsequent viral entry. So there's some theories that the reason this virus is actually more effective at infecting cells than the 2002 SARS virus is because of this fearing cleavage site. I think that's so interesting. And one of the facts mentioned in the discussion of both papers was that a similar fearing cleavage site is seen in an analogous protein
Starting point is 00:16:43 in highly virulent avian and human influenza strains. Let's talk more about how the virus binds the host cell and what happens next. Tell me about this ACE2 molecule. Ace 2 is a really common protein found on the surface of many different cell types in the human body. It stands for AGO-Tensin converting enzyme. And this is a receptor that is involved in regulating blood pressure. So it has nothing to do with viral entry. And it definitely didn't evolve for that. But many of these viruses have found ACE2 to be a sort of convenient receptor that they can co-opt to be able to get into the cell. And I think one of the reasons for this is that the protease activity that the virus is using, this sort of cut and pull in mechanism that we talked about
Starting point is 00:17:27 before, that's part of the natural mechanism of how ACE2 functions. What's important here is that this receptor is the exact same one that was discovered to be involved in the 2002 SARS's entry mechanism. And that's important because if we have a good understanding of how to drug the ACE2 receptor following on the 2002 SARS work, we can potentially apply that to treatments for this current coronavirus. Yeah, the downside of that would be that since ACE-2 is involved in extremely critical human physiology, that it might be almost impossible to drug without having some kind of adverse side effect. Yeah. Both sets of papers look at the strength of the interaction between 2019 SARS and 2002 SARS with the ACE-2 receptors. What did they find here? So using different
Starting point is 00:18:17 methods. The Wall's paper found that the 2019 SARS virus and the 2002 SARS virus bind to ACE2 with roughly the same affinity, but then RAP at all showed that the 2019 SARS virus binds with 10 to 20 times higher affinity, which is a huge difference. They conjecture that this tighter binding is part of what makes it so virulent because it's able to more successfully get into cells following that binding event. It's interesting, you know, they're published. these studies, one right after the other, they don't discuss why this might be different. Yeah, and I think it's hard to say they're using different techniques and this science is happening so live. It's actually really cool to think about this from the perspective of
Starting point is 00:19:00 being at the bench. You're working on something and if a paper comes out that contradicts what you're finding while you're still working on it, that affects your work because it changes how you trust your own work and it changes what kind of data you're going to be looking for. But in this case, this is all happening so quickly that they're finding what they're finding and they're publishing it. And I think it's really cool to see it side by side. Yeah, you get to see the evolution of the field and the thinking kind of at the moment it's happening. And of course, that really bolsters the findings that do concord, like finding that ACE2 is the receptor and that the spike protein is really similar. Since those groups got those
Starting point is 00:19:36 same answers without influencing each other, that really is kind of the gold standard of replication in scientific publishing is having, you know, two independent groups finding things in parallel. Okay, let's talk about the next area where we saw differences between these two articles. So this was in the section where they looked at whether antibodies against 2002 SARS combined to 2019 SARS. There were key differences between these two papers in the types of antibodies that they were looking at. In the walls at all paper, they are looking at polyclonal antibodies, which were generated by injecting mice with the purified S-prote. from the virus. So the mice's immune system saw this S-protein. It generated antibodies in response. The researchers then took a blood sample, purified the cera. That cera contains all sorts of
Starting point is 00:20:32 different antibodies that bind to all different regions of the spike protein. Rapp at all used three monoclonal antibodies, which had already been previously characterized as binding to just a subdomain within the 2002 SARS spike protein. So whereas in the walls at all paper, there's all sorts of antibodies in this mix that can bind all over the spike protein and rap at all. These are three highly specific antibodies
Starting point is 00:21:01 that bind to three highly specific sites in just one tiny portion. There's also a big difference in what they measure. Using the monoclonal antibodies in wrap, they're measuring just binding. So do the monoclonal antibodies bind to the spike protein at all. Whereas in walls, what they're measuring is whether the polyclonal antibody set prevents the virus from entering into human cells. This is more of a phenotypic. It's more of like a
Starting point is 00:21:28 does the entire mechanism that you're trying to stop not happen versus just the binding alone, which is a piece of the process. In this case, I'd say the two papers show opposite results. Walls shows that you can stop viral entry using these polyclinol antibodies and rap shows. that there's no binding of the monoclonal antibodies from 2002 SARS to the 2019 SARS virus. They're opposite results, but they're not necessarily mutually exclusive results, because Rapidol is only looking at three antibodies. You know, there's an absolutely enormous number of possible antibodies against these viruses and even just the spike protein.
Starting point is 00:22:09 So the fact that three didn't work, you know, that could just be scientific, bad luck. There's also the fact that we read other papers to prepare for this segment that supported both the rap conclusion that there is no cross-reactivity and the wall's conclusion that there is cross-reactivity. So the jury's still very much out and the fact that all these studies are using slightly different designs, different sources of antibodies, different kinds of antibodies, different readouts, this is still definitely evolving and it's part of what makes the research right now. so interesting is that so many groups are applying their expertise, their backgrounds, their favorite techniques to this problem. And my gut feeling is that there's some kind of nuance into the cross-reactivity here that we just haven't appreciated yet. Why do you think they basically ended up doing the same set of experiments? I think that what we have here is a demonstration of the types of science that you can get data on really quickly. In the past, to get a structure of
Starting point is 00:23:13 this quality, you would have had to do x-ray crystallography, which requires first getting the protein to crystallize in this very specific manner, and it's a tricky thing to do. But with cry OEM, you just get the protein sample super, super cold, and shoot a beam of electrons off it. Then determine the structure based off how the electrons bounced off. There has been a real revolution in cryoem recently, making this method capable of determining the structure of very large protein complexes. Also, the results of both of these papers are in vitro. So we're getting really important information that can help vaccine design and therapeutics design. These are also the fastest type of research that can be done on these.
Starting point is 00:23:53 As the publication cycles continue and the research continues, we'll get different insights coming out with experiments that take longer and that require mouse work or that require kind of more complex experimental design. Seeing these two papers side by side really speaks to what can be done, what can we learn that's important and very, valuable in the shortest amount of time. Thanks, Judy, for discussing these articles with me. And that's it for Journal Club this week.
Starting point is 00:24:20 You can find all episodes at A16Z.com. Thanks for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.