a16z Podcast - Journal Club: Finding New Antibiotics with Machine Learning, What Coronavirus Structures Tell Us
Episode Date: April 26, 2020a16z Journal Club (part of the a16z Podcast), curates and covers recent advances from the scientific literature -- what papers we’re reading, and why they matter from our perspective at the intersec...tion of biology & technology (for bio journal club). This inaugural episode covers 2 different topics, in discussion with Lauren Richardson:0:26 #1 identifying new antibiotics through a novel machine-learning based approach -- a16z general partner Vijay Pande and bio deal partner Andy Tran discuss the business of pharma; the specific methods/ how it works; and other applications for deep learning in drug discovery and development based on this paper:"A Deep Learning Approach to Antibiotic Discovery" in Cell (February 2020), by Jonathan Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina Donghia, Craig MacNair, Shawn French, Lindsey Carfrae, Zohar Bloom-Ackermann, Victoria Tran, Anush Chiappino-Pepe, Ahmed Badran, Ian Andrews, Emma Chory, George Church, Eric Brown, Tommi Jaakkola, Regina Barzilay, James Collins11:43 #2 characterizing the novel coronavirus causing the COVID-19 pandemic -- a16z bio deal partner Judy Savitskaya shares what we can learn from the protein structures; the relationship to the 2002-2004 SARS epidemic; and more based on these two research articles: "Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein" in Cell (April 2020), by Alexandra Walls, Young-Jun Park, M. Tortorici, Abigail Wall, Andrew McGuire, David Veesler"Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation" in Science (March 2020), by Daniel Wrapp, Nianshuang Wang, Kizzmekia Corbett, Jory Goldsmith, Ching-Lin Hsieh, Olubukola Abiona, Barney Graham, Jason McLellanYou can find these episodes at a16z.com/journalclub.
Transcript
Discussion (0)
Hello, and welcome to the inaugural episode of the A16Z Journal Club.
I'm Lauren Richardson, one of our bio-editors, and in this episode, we'll cover two topics.
First, a novel machine learning-based approach to identify new antibiotics, and second, we'll discuss
two articles characterizing the novel coronavirus causing the current pandemic.
Journal Club will cover a variety of articles every few weeks, so stay tuned here, and we'll
announce its own feed soon.
First up is my conversation with A16Z general partner, VJ Pande, and deal partner on the bio team, Andy Tran.
We dive into a deep learning approach to antibiotic discovery by Jonathan Stokes, Regina Barsley, James Collins, and colleagues.
In this article, published in Cell, the authors create a novel machine learning-based method to identify new antibiotic drugs from two large databases.
They then validated one of their candidates, a drug named Hallison, showing that it has excellent antibiotic property,
both in vitro and in two different mouse models of bacterial infection.
Excitingly, Halison has a distinct structure and appears to have a distinct mechanism of action
from other antibiotics, which is important given the problem of antibiotic resistance
and the need to find new drugs.
Our discussion of the paper covers the business of antibiotics, the methods, and how deep learning
can identify novel drug structures, and other applications for deep learning in drug discovery
and development.
But we begin with what made this paper appeal to us.
And the first voice you'll hear is VJ's.
A huge takeaway from this article was the breadth of experimental work that was done to demonstrate
the accuracy of the predictions involved.
And so while there has been a lot of work about using neural nets for predicting drug compounds,
this was probably one of the landmark examples to date of a predictive perspective
of approach validated experimentally for something.
non-trivial in terms of function.
Typically, doing something in vitro is pretty straightforward,
but going to in vivo models was an important step forward
to convince, especially drug hunters and experts in the field
that this has real validity.
So I think putting all those pieces together,
I think is what really made this paper stand out.
Why do you think it is that it's so difficult to identify new antibiotics?
What makes a challenge is not only the scientific side,
but then also the business side.
Not only are these antibiotics complex to develop, but the most innovative new products also cannot be sold freely.
They are put on the shelf in reserve for more serious cases, and they're actually dubbed these drugs of last resort.
And so all of these scientific and business headwinds actually make something like synthesizing a brand-new antibiotic really challenging to do.
Yeah, and most indications, if you create a drug and it works better than other drugs, that immediately becomes your top choice when you're,
prescribing, but for antibiotics, they actually get put to the end of the line because they want to
save them for when all the other drugs that are already gaining resistance fail completely.
Well, and then given resistance, I think the nightmare scenario that we're all worried about
is that we don't have drugs at work as the last lines of defense go, and we don't have any
mechanism for creating new ones. The business side is very critical here, because if there can't
be a way to be rewarded for making drugs, it's just hard.
to put hundreds of millions of dollars into doing it.
It's almost like the business of making Blockbuster movies,
that you need to have a drug that will make enough money on the other side
to be able to support all the effort that goes into it,
all the R&D effort as well as all the effort running clinical trials.
Whereas in this case, you know, once you can actually bring down the cost,
at least to get something out of pre-clinical quickly,
now you have actually the opportunity for one to go after indications
that aren't Blockbusters, that are small ones,
kind of almost like the shift where you know you have certain things that are still going to be done
with like marvel and movies and so on but there's a long tale of YouTubers who actually can come
up with interesting content. This could be academic labs, startups may be funded through
philanthropy, maybe governments, there may be now the beginnings of the potential for a long tail
of new drugs coming out that go after indications that will not be blockbusters but that
will still have a huge fundamental impact on humanity. There are a lot of interesting different
models. We could talk about here for how technology like this coupled with a new business
approach and now are possible because of techniques like this could make a huge change in
our ability to develop novel antibiotics. Yeah. And it would be a huge step for the industry at large
once we have more of these methods democratized for the broader scientific community.
Why don't you break down for me what the machine learning approach that they used here and what kind
of advanced does this represent? Yeah, sure. They took this machine learning model that they made
and they trained it on about 2,500 molecules
and used that to train binary classification models
to predict probability of whether a new compound
would inhibit the growth of E. coli or not.
And then turn to the drug repurposing hub library,
a library of 6,000 compounds
that are ready in human clinical development
for a wide variety of indications.
And at this point, they compared several different machine
alerted models, and after narrowing down those molecules
and actually predicting toxicity using different
neural networks. They came up with this particular molecule in Hallison. And then thirdly,
lastly, in the process, they went on to apply this machine learning model after iteration and
optimization to much broader set, the Zinc15 dataset, with over a billion and a half
structures. Now, the machine learning side, what's key here is the deep learning network that
they used didn't really rely on any preset information about the chemical structures of the molecules.
It actually really built new representations, they were called. For years, a lot of people,
represented molecules with these fingerprint vectors, you know, reflected things like presence or absence of
functional groups or descriptors and computable molecular properties. But relying on known fingerprints
didn't really work that well. And that's why, you know, a lot of the old antibiotic screening
process gives you a lot of the same classes of molecules over and over again. And what they did here,
they actually have these fingerprint descriptors that were built from scratch. Well, you know,
what you're describing is still a fingerprint, right? It's a one-dimensional vector to describe
molecules. I think perhaps what's different is that in a deep learning approach, you can try to
infer what the right descriptor should be. That's the hallmark of all of the deep learning
approaches for drug design, is that, and deep learning in general, that, you know, recall, like,
even when we're just talking about convolutional neurons for image recognition, the idea is that
the CNN's for image recognition versus classical computational vision is that in the classical
approach, the person sort of defines what the right features are. And so similarly, you know,
what's interesting is that you can feed any representation of a molecule into a computer,
but which parts are the interesting ones.
You can have, just like old school computer version, you could have a human being say,
ah, these are the important ones.
But a beauty of a DNA approach, which is used here, but also in many precursor works,
that the DNA helps understand what are the key aspects and what are the interesting ones.
And that is really, I think, the big difference between which you can get in modern,
deep learning with machine learning versus classical machine learning with like random forest or something
like that. Right. So deep learning helps us figure out what we don't know versus focusing only on
what we already know or what we think we know. What makes Hallison an attractive candidate for
further research and development? What are some of the properties that they discovered?
Without adapt, certainly a really potent inhibitor of E. coli. And further investigations showed
that Hallison has strong growth inhibitory effects on a wide phylogenic spectrum of
pathogens. They tried it on C. diff, A. Balmani, and which is, you know, one of the highest
priority pathogens that is urgently required for, in terms of antibiotics. And then, more interestingly,
it was even able to eradicate E. coli persister cells that remained after epicillin treatment.
So pretty strong efficacy and pretty low-tox based on their screen. It also checks the box of
something that is really structurally divergent from conventional antibiotics. And so certainly
a very powerful new class of antibiotics
that could potentially be a strong candidate
for further development. Yeah, the fact
that they found the antibiotic, they
showed it worked in vitro, they showed it worked in vivo,
and then they also did some experiments to get
at its mechanism of action, suggesting
that Hallison selectively disrupts the pH potential
across the bacterial membrane. This
saps the proton motor force, which is like
the battery of the cell. So like
all antibiotics, it's disrupting an essential
cellular function, but this
appears to be a distinct and new function that's being targeted.
That's right.
It's like super elegant work.
Such a complete, well-rounded story.
Yeah, well, I mean, I think one of the things that really stands out here is that full
stack of experiments that they've done where it goes all the way from looking at them I see
in a dish to going through mice.
And one of the appealing things about studying antibiotics and this often even pertains
antivirals, that the animal models are pretty good with something like Alzheimer's on the
far extreme where animals are generally not very good.
And so it's appealing that one could do all this
with probably not requiring a huge budget
and therefore get something on the other side
that looks kind of intriguing.
Beyond initial discovery, can this kind of machine learning-based
approach be applied to other aspects
of either early or late-stage drug development?
Yeah, it's a broad topic, but that's the fun thing about it.
There's a reason why it's a broad topic
because there's a broad range of things it can do.
I mean, you could talk about identifying targets
and there's a lot of work to do there.
novel targets, it's a really interesting time to go after novel targets. You could talk
about identifying leads. And this is a lot of basically what's been done here, the identification
of leads and then the testing of them. These compounds from zinc are leads, but presumably they're
not drug-like, so they have to be optimized. So there's these types of methods helping in lead
optimization. And then along the way, hopefully you'd want to also be screening for talks.
And so there's a ton of methods that are getting really surprisingly accurate. And basically,
The beautiful thing about a machine learning approach like this is that the approach for the most part is pretty agnostic to what you're predicting and that the sort of processes you're building up can be useful.
One last thing, and this is maybe the holy grill dream, is that if you're predicting a lot of properties for a lot of different systems with a whole bunch of molecules in some multitask like framework where one model is predicting all of it, you can learn from all of it.
and that you develop, even though you might not have a lot of data in any single project or any
single area here, the sum of all this data now is huge and helps to regularize your predictions
to make them less overfit and more robust, such that the sort of the predictive capability
that emerges from all of that is better than what anyone would be in between.
I think that really is the big, big future, is taking the fact that there is this breadth of what
it can do and not just recognizing all those possibilities, but using all those possibilities
to actually improve any one of those predictions.
Thanks, Andy and Vijay, for joining me for this first segment.
So to quickly wrap up on a high level, there are two takeaways from this article.
First, it uncovers new candidates for future development as antibiotics,
most critically Hallison, which was rigorously validated.
Second, it demonstrates the ability of deep neural networks
to make accurate predictions for drug lead identification in a mechanism agnostic manner.
This is a broadly useful approach with huge potential to shape the future of drug,
drug discovery and development, and further highlights the increasingly important role of AI and
medicine. If you enjoyed this conversation, check out our podcast featuring senior author Jim Collins
called All About Synthetic Biology. In this next segment, A16Z Bio Deal partner Judy Savitz-Gaya
and I discussed two articles on the novel coronavirus causing the COVID-19 pandemic. The first article
is structure, function, and antigenicity of the SARS-COV-2 spike glycoprote by Alexandra Walls,
David Wiesler and colleagues, published in Cell.
We refer to this article throughout as Walls at All or the Walls paper.
The second article is cryo-em structure of the 2019 NCOV spike in the pre-fusion confirmation
by David Rapp, Neon-Schwang, Jason McClellan, and colleagues, published in science.
We refer to this article as Rapp at All.
Note that these two articles use different names for the novel coronavirus, SARS-C-O-V-2 versus 2019
NCOV, which was the original name for the novel coronavirus, but it has now been settled
officially as SARS-C-O-V-2.
And this is a bit confusing because the virus that caused the pandemic SARS, which is an acronym for
severe acute respiratory syndrome, primarily in Asia in 2002-2004, is called SARS-C-O-V.
They're both coronaviruses and they're both closely related.
And each of these articles that we discuss compare SARS-C-O-V, which we inform.
formally referred to as 2002 SARS with SARS-COV-2, which we refer to as 2019 SARS.
Finally, one last bit of important context to know. The two papers also have similar experimental
designs where they investigate how the spike protein binds to the host cell and whether
antibodies that recognize 2002 SARS spike protein will recognize 2019 SARS spike protein. This is important
because it helps guide vaccine and drug design. Since the results of the two papers,
are not identical, we discuss why this might be and what we can learn. But first, we start with the
similarities. Both of these articles are comparing and contrasting the spike proteins on the current
coronavirus with the 2002 SARS virus. It's really important to know the structure of the spike
protein because that's the part of the virus that is actually going to bind to the host cell.
It's really important to know where it binds in the host cell, what protein it's using,
which has been discovered to be ACE2. That's the same protein that,
the 2002 SARS virus used to infect human cells.
So once we know the structure of this spike protein,
we can start to understand how it's actually entering cells
and have some hypotheses for what types of drugs
or what particular molecules could be used
to disrupt that interaction and actually treat this disease
or prevent the ability of the virus
to initiate an infection into new cells.
The spike protein is also really important
because it's exposed to the immune system
and so that's where antibodies will probably bind to the virus.
So speaking of the immune system, one of the really interesting aspects of these viruses
is that the spike proteins are covered in different sugar-like molecules.
And those are actually sometimes called a glycan shield,
which is glycan for sugars and then shield because they actually prevent the immune system
from recognizing the viral proteins as easily as they otherwise could.
Both of these papers showed that there are a lot of glycans on the spike protein of the
2019 SARS virus, but that they actually differ very slightly from the 2002 SARS virus.
Overall, the structure was super similar. But there is one key difference. Do you want to talk
about the fear and cleavage site? This cleavage is a really key part of the mechanism for
viral entry. So the mechanism for the virus to get into the cell is actually a sequence of about
four steps. The first is the virus arriving at the lungs. Once it arrives at the lungs, the
spike proteins will bind to the ACE2 receptor, which has been now shown in multiple different
papers. And then proteins that are on the surface of the cell called proteases will reach over
and cleave a piece of this spike protein. And that helps to pull the virus in closer to the cell
to the point where a fusion can occur between the membranes. And so at that point, the virus is
inside of the cell and it can replicate. What has been found in the SARS 2019,
virus versus the 2002 virus is that there's an additional cleavage site, an additional set of
amino acids in the protein that are cleaved, and that's called a furin site, and that is not
present in the 2002 virus. So there's some hypotheses out there that this furen cleavage
occurs first, and it actually enhances the probability and sort of the success of the second
cleavage and then subsequent viral entry. So there's some theories that the reason this virus is
actually more effective at infecting cells than the 2002 SARS virus is because of this
fearing cleavage site. I think that's so interesting. And one of the facts mentioned in the
discussion of both papers was that a similar fearing cleavage site is seen in an analogous protein
in highly virulent avian and human influenza strains. Let's talk more about how the virus binds
the host cell and what happens next. Tell me about this ACE2 molecule. Ace 2 is a really common protein
found on the surface of many different cell types in the human body. It stands for
AGO-Tensin converting enzyme. And this is a receptor that is involved in regulating blood pressure.
So it has nothing to do with viral entry. And it definitely didn't evolve for that. But
many of these viruses have found ACE2 to be a sort of convenient receptor that they can co-opt
to be able to get into the cell. And I think one of the reasons for this is that the
protease activity that the virus is using, this sort of cut and pull in mechanism that we talked about
before, that's part of the natural mechanism of how ACE2 functions. What's important here is that
this receptor is the exact same one that was discovered to be involved in the 2002 SARS's entry
mechanism. And that's important because if we have a good understanding of how to drug the ACE2
receptor following on the 2002 SARS work, we can potentially apply that to treatments for this
current coronavirus. Yeah, the downside of that would be that since ACE-2 is involved in extremely
critical human physiology, that it might be almost impossible to drug without having some kind
of adverse side effect. Yeah. Both sets of papers look at the strength of the interaction between
2019 SARS and 2002 SARS with the ACE-2 receptors. What did they find here? So using different
methods. The Wall's paper found that the 2019 SARS virus and the 2002 SARS virus bind to ACE2
with roughly the same affinity, but then RAP at all showed that the 2019 SARS virus binds with
10 to 20 times higher affinity, which is a huge difference. They conjecture that this tighter binding
is part of what makes it so virulent because it's able to more successfully get into cells
following that binding event. It's interesting, you know, they're published.
these studies, one right after the other, they don't discuss why this might be different.
Yeah, and I think it's hard to say they're using different techniques and this science is
happening so live. It's actually really cool to think about this from the perspective of
being at the bench. You're working on something and if a paper comes out that contradicts
what you're finding while you're still working on it, that affects your work because it changes
how you trust your own work and it changes what kind of data you're going to be looking for.
But in this case, this is all happening so quickly that they're finding what they're
finding and they're publishing it. And I think it's really cool to see it side by side.
Yeah, you get to see the evolution of the field and the thinking kind of at the moment it's
happening. And of course, that really bolsters the findings that do concord, like finding that
ACE2 is the receptor and that the spike protein is really similar. Since those groups got those
same answers without influencing each other, that really is kind of the gold standard of replication
in scientific publishing is having, you know, two independent groups finding things in parallel.
Okay, let's talk about the next area where we saw differences between these two articles.
So this was in the section where they looked at whether antibodies against 2002 SARS combined to 2019 SARS.
There were key differences between these two papers in the types of antibodies that they were looking at.
In the walls at all paper, they are looking at polyclonal antibodies, which were generated by injecting mice with the purified S-prote.
from the virus. So the mice's immune system saw this S-protein. It generated antibodies in response.
The researchers then took a blood sample, purified the cera. That cera contains all sorts of
different antibodies that bind to all different regions of the spike protein. Rapp at all used
three monoclonal antibodies, which had already been previously characterized as binding to just a subdomain
within the 2002 SARS spike protein.
So whereas in the walls at all paper,
there's all sorts of antibodies in this mix
that can bind all over the spike protein
and rap at all.
These are three highly specific antibodies
that bind to three highly specific sites
in just one tiny portion.
There's also a big difference in what they measure.
Using the monoclonal antibodies in wrap,
they're measuring just binding.
So do the monoclonal antibodies bind to the spike
protein at all. Whereas in walls, what they're measuring is whether the polyclonal antibody set
prevents the virus from entering into human cells. This is more of a phenotypic. It's more of like a
does the entire mechanism that you're trying to stop not happen versus just the binding alone,
which is a piece of the process. In this case, I'd say the two papers show opposite results. Walls
shows that you can stop viral entry using these polyclinol antibodies and rap shows.
that there's no binding of the monoclonal antibodies from 2002 SARS to the 2019 SARS virus.
They're opposite results, but they're not necessarily mutually exclusive results,
because Rapidol is only looking at three antibodies.
You know, there's an absolutely enormous number of possible antibodies against these viruses
and even just the spike protein.
So the fact that three didn't work, you know, that could just be scientific, bad luck.
There's also the fact that we read other papers to prepare for this segment that supported both the rap conclusion that there is no cross-reactivity and the wall's conclusion that there is cross-reactivity.
So the jury's still very much out and the fact that all these studies are using slightly different designs, different sources of antibodies, different kinds of antibodies, different readouts, this is still definitely evolving and it's part of what makes the research right now.
so interesting is that so many groups are applying their expertise, their backgrounds, their
favorite techniques to this problem. And my gut feeling is that there's some kind of nuance into
the cross-reactivity here that we just haven't appreciated yet. Why do you think they basically
ended up doing the same set of experiments? I think that what we have here is a demonstration of
the types of science that you can get data on really quickly. In the past, to get a structure of
this quality, you would have had to do x-ray crystallography, which requires first getting the
protein to crystallize in this very specific manner, and it's a tricky thing to do. But with cry OEM,
you just get the protein sample super, super cold, and shoot a beam of electrons off it. Then determine
the structure based off how the electrons bounced off. There has been a real revolution in
cryoem recently, making this method capable of determining the structure of very large protein
complexes. Also, the results of both of these papers are in vitro.
So we're getting really important information that can help vaccine design and therapeutics design.
These are also the fastest type of research that can be done on these.
As the publication cycles continue and the research continues,
we'll get different insights coming out with experiments that take longer
and that require mouse work or that require kind of more complex experimental design.
Seeing these two papers side by side really speaks to what can be done,
what can we learn that's important and very,
valuable in the shortest amount of time.
Thanks, Judy, for discussing these articles with me.
And that's it for Journal Club this week.
You can find all episodes at A16Z.com.
Thanks for listening.