The a16z Show - Journal Club: Finding New Antibiotics with Machine Learning, What Coronavirus Structures Tell Us
Episode Date: April 26, 2020a16z Journal Club (part of the a16z Podcast), curates and covers recent advances from the scientific literature -- what papers we’re reading, and why they matter from our perspective at the intersec...tion of biology & technology (for bio journal club). This inaugural episode covers 2 different topics, in discussion with Lauren Richardson:0:26 #1 identifying new antibiotics through a novel machine-learning based approach -- a16z general partner Vijay Pande and bio deal partner Andy Tran discuss the business of pharma; the specific methods/ how it works; and other applications for deep learning in drug discovery and development based on this paper:"A Deep Learning Approach to Antibiotic Discovery" in Cell (February 2020), by Jonathan Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina Donghia, Craig MacNair, Shawn French, Lindsey Carfrae, Zohar Bloom-Ackermann, Victoria Tran, Anush Chiappino-Pepe, Ahmed Badran, Ian Andrews, Emma Chory, George Church, Eric Brown, Tommi Jaakkola, Regina Barzilay, James Collins11:43 #2 characterizing the novel coronavirus causing the COVID-19 pandemic -- a16z bio deal partner Judy Savitskaya shares what we can learn from the protein structures; the relationship to the 2002-2004 SARS epidemic; and more based on these two research articles: "Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein" in Cell (April 2020), by Alexandra Walls, Young-Jun Park, M. Tortorici, Abigail Wall, Andrew McGuire, David Veesler"Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation" in Science (March 2020), by Daniel Wrapp, Nianshuang Wang, Kizzmekia Corbett, Jory Goldsmith, Ching-Lin Hsieh, Olubukola Abiona, Barney Graham, Jason McLellanYou can find these episodes at a16z.com/journalclub. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Hello and welcome to the inaugural episode of the A16Z Journal Club.
I'm Lauren Richardson, one of our bio-editors, and in this episode, we'll cover two topics.
First, a novel machine learning-based approach to identify new antibiotics, and second, we'll
discuss two articles characterizing the novel coronavirus causing the current pandemic.
Journal Club will cover a variety of articles every few weeks, so stay tuned here, and we'll
announce its own feed soon.
First up is my conversation with A16Z general partner, VJ Pande, and Deal Partner on the bio team, Andy Tran.
We dive into a deep learning approach to antibiotic discovery by Jonathan Stokes, Regina Barsley, James Collins, and colleagues.
In this article, published in Cell, the authors create a novel machine learning-based method to identify new antibiotic drugs from two large databases.
They then validated one of their candidates, a drug named Hallison, showing that it has excellent antibiotic property,
both in vitro and in two different mouse models of bacterial infection.
Excitingly, Halison has a distinct structure and appears to have a distinct mechanism of action
from other antibiotics, which is important given the problem of antibiotic resistance
and the need to find new drugs.
Our discussion of the paper covers the business of antibiotics, the methods, and how deep learning
can identify novel drug structures, and other applications for deep learning in drug discovery
and development.
But we begin with what made this paper appeal to us.
And the first voice you'll hear is VJ's.
A huge takeaway from this article was the breadth of experimental work that was done to demonstrate
the accuracy of the predictions involved.
And so while there has been a lot of work about using neural nets for predicting drug compounds,
this was probably one of the landmark examples to date of a predictive perspective
of approach validated experimentally for something.
non-trivial in terms of function. Typically doing something in vitro is pretty straightforward,
but going to Enviro models was an important step forward to convince, especially drug hunters and
experts in the field that this has real validity. So I think putting all those pieces together,
I think is what really made this a paper stand out. Why do you think it is that it's so difficult
to identify new antibiotics? What makes a challenge is not only the scientific side, but then
also the business side. Not only are these antibiotics complex to develop,
But the most innovative new products also cannot be sold freely.
They are put on the shelf in reserve for more serious cases.
And they're actually dubbed these drugs of last resort.
And so all of these scientific and business headwinds actually make something like, you know,
synthesizing a brand-new antibiotic really challenging to do.
Yeah, and most indications, if you create a drug and it works better than other drugs,
that immediately becomes your top choice when you're prescribing.
But for antibiotics, they actually get put to.
to the end of the line because they want to save them for when all the other drugs that are already
gaining resistance fail completely. Well, and then given resistance, I think the nightmare
scenario that we're all worried about is that we don't have drugs at work as the last lines of
defense go and we don't have any mechanism for creating new ones. The business side is very critical
here because if there can't be a way to be rewarded for making drugs, it's just hard to put
hundreds of millions of dollars into doing it.
It's almost like the business of making Blockbuster movies,
that you need to have a drug that will make enough money on the other side
to be able to support all the effort that goes into it,
all the R&D effort as well as all the effort running clinical trials.
Whereas in this case, you know, once you can actually bring down the cost,
at least to get something out of preclinical quickly,
now you have actually the opportunity for one to go after indications that aren't
blockbusters, that are small ones,
kind of almost like this shift where you know you have certain things that are still going to be done
with like marvel and movies and so on but there's a long tail of YouTubers who actually can come up
with interesting content. This could be academic labs, startups may be funded through philanthropy,
maybe governments, there may be now the beginnings of the potential for a long tail of new
drugs coming out that go after indications that will not be blockbusters but that will still have
a huge fundamental impact on humanity. There are a lot of interesting different
models we could talk about here for how technology like this coupled with a new business approach
and now are possible because of techniques like this could make a huge change in our ability
to develop novel antibiotics. Yeah, and it would be a huge step for the industry at large once we have
more of these methods democratized for the broader scientific community. Why don't you break down for
me what the machine learning approach that they used here and what kind of advanced does this represent?
Yeah, sure. They took this machine learning model that they made and they trained it on about
2,500 molecules and use that to train binary classification models to predict probability of whether
a new compound would inhibit the growth of E. coli or not. And then turn to the drug repurposing
hub library, a library of 6,000 compounds that are ready in human clinical development for a wide
variety of indications. And at this point, they compared several different machine alerted models.
And after narrowing down those molecules and actually predicting toxicity using different neural networks,
they came up with this particular molecule in Halison.
And then thirdly, lastly, in the process,
they went on to apply this machine learning model
after iteration and optimization to much broader set,
the zinc-15 data set, with over a billion and a half structures.
Now, the machine learning side,
what's key here is the deep learning network that they used
didn't really rely on any preset information
about the chemical structures of the molecules.
It actually really built new representations, they were called.
For years, a lot of people represented molecules,
with these fingerprint vectors, you know, reflected things like presence or absence of functional groups or descriptors and computable molecular properties.
But relying on known fingerprints didn't really work that well.
And that's why, you know, a lot of the old antibiotic screening process gives you a lot of the same classes of molecules over and over again.
And what they did here, they actually have these fingerprint descriptors that were built from scratch.
Well, you know, what you're describing is still a fingerprint, right?
It's a one-dimensional vector to describe molecules.
I think perhaps what's different is that in a deep learning approach,
you can try to infer what the right descriptors should be.
That's the hallmark of all of the deep learning approaches for drug design
and deep learning in general,
that recall, even when we're just talking about convolutional neural nets
for image recognition, the idea is that CNNs for image recognition
versus classical computational vision is that in the classical approach,
the person sort of defines what the right features are,
And so similarly, you know, what's interesting is that you can feed any representation of a molecule into a computer,
but which parts are the interesting ones?
You can have, just like old school computer version, you could have a human being say, ah, these are the important ones.
But a beauty of a DNA approach, which is used here, but also in many precursor works,
that the DNA helps understand what are the key aspects and what are the interesting ones.
And that is really, I think, the big difference between which you can get in modern deep learning.
with machine learning versus classical machine learning with like random forest or something like that.
Right. So deep learning helps us figure out what we don't know versus focusing only on what we
already know or what we think we know. What makes Hallison an attractive candidate for further
research and development? What are some of the properties that they discovered?
Without adapt, certainly a really potent inhibitor of E. coli. And further investigations showed
that Hallison has strong growth inhibitory effects on a wide phylogenic spectrum of pathogens. They
tried it on C-DIF, A. Balmani, and which is, you know, one of the highest priority pathogens that
is urgently required for, in terms of antibiotics. And then, more interestingly, it was even
able to eradicate E. coli persister cells that remained after Epolycillin treatment. So pretty
strong efficacy and pretty low-tox based on their screen. It also checks the box of something that
is really structurally divergent from conventional antibiotics. And so certainly a very powerful
new class of antibiotics that could potentially be, you know, strong candidate for further development.
Yeah, the fact that they found the antibiotic, they showed it worked in vitro, they showed it
worked in vivo, and then they also did some experiments to get at its mechanism of action, suggesting
that Hallison selectively disrupts the pH potential across the bacterial membrane. This saps the proton
motor force, which is like the battery of the cell. So like all antibiotics, it's disrupting an essential
cellular function, but this appears to be a distinct and new function that's being targeted.
That's right.
It's like super elegant work. Such a complete, well-rounded story.
Yeah, well, I mean, I think one of the things that really stands out here is that full
stack of experiments that they've done where it goes all the way from looking at the MICE
in a dish to going through mice.
And one of the appealing things about studying antibiotics and this often even pertains
antivirals, that the animal models are pretty good with something.
like Alzheimer's on the far extreme where animals are generally not very good.
And so it's appealing that one could do all this, you know,
with probably not requiring a huge budget and therefore get something on the other side that looks kind of intriguing.
Beyond initial discovery, can this kind of machine learning-based approach be applied to other aspects of either earlier
late-stage drug development?
Yeah, it's a broad topic, but that's the fun thing about it.
There's a reason why it's a broad topic because there's a broad range of things it can do.
I mean, you could talk about identifying targets, and there's a lot of work.
to do there. And especially novel targets, it's a really interesting time to go after novel
targets. You could talk about identifying leads. And this is a lot of basically what's been
done here, the identification of leads and then the testing of them. These compounds from zinc
are leads, but presumably they're not drug-like, so they have to be optimized. So there's these
types of methods helping in lead optimization. And then along the way, hopefully you'd want to
also be screening for talks. And so there's a ton of methods that are getting really surprisingly
the accurate. And basically, the beautiful thing
about a machine learning approach like this is that
the approach for the most part is pretty
agnostic to what you're predicting
and that the sort of processes you're building
up can be useful. One last thing,
and this is maybe the holy grill dream, is
that if you're predicting a lot
of properties for a lot of different systems
with a whole bunch of molecules
in some multitask like framework
where one model is predicting
all of it, you can learn from
all of it. And that you develop, even
though you might not have a lot of data in any
single project or any single area here, the sum of all this data now is huge and helps to
regularize your predictions to make them less overfit and more robust, such that the sort of
the predictive capability that emerges from all of that is better than what anyone would be
in between. I think that really is the big, big future, is taking the fact that there is this
breadth of what it can do and not just recognizing all those possibilities, but using all those
possibilities to actually improve any one of those predictions. Thanks, Andy and Vijay, for joining me
for this first segment. So to quickly wrap up on a high level, there are two takeaways from
this article. First, it uncovers new candidates for future development as antibiotics, most critically
Hallison, which was rigorously validated. Second, it demonstrates the ability of deep neural
networks to make accurate predictions for drug lead identification in a mechanism agnostic manner.
This is a broadly useful approach with huge potential to shape the future of drug discovery and
development, and further highlights the increasingly important role of AI and medicine.
If you enjoyed this conversation, check out our podcast featuring senior author Jim Collins
called All About Synthetic Biology. In this next segment, A16Z Bio Deal partner Judy Savitz-Gai and
I discussed two articles on the novel coronavirus causing the COVID-19 pandemic. The first article
is structure, function, and antigenicity of the SARS-COV-2 spike glycoprotein by Alexandra Walls,
David Viesler and colleagues, published in Cell.
We refer to this article throughout as Walls at All or the Walls paper.
The second article is cryo-em structure of the 2019 NCOV spike in the pre-fusion confirmation
by David Rapp, Nianchwang, Jason McClellan, and colleagues, published in science.
We refer to this article as Rapp at All.
Note that these two articles use different names for the novel coronavirus, SARS-C-O-V-2 versus
2019 N-COV, which was the original name for the novel coronavirus, but it has now been settled
officially as SARS-COV-2. And this is a bit confusing because the virus that caused the pandemic
SARS, which is an acronym for severe acute respiratory syndrome, primarily in Asia in 2002-2004,
is called SARS-COV. They're both coronaviruses and they're both closely related. And each of
these articles that we discuss compare SARS-C-O-V.
which we informally refer to as 2002 SARS with SARS-COV-2, which we refer to as 2019 SARS.
Finally, one last bit of important context to know.
The two papers also have similar experimental designs, where they investigate how the spike
protein binds to the host cell and whether antibodies that recognize 2002 SARS spike protein
will recognize 2019 SARS spike protein.
This is important because it helps guide vaccine and drug design, since the result of the
results of the two papers are not identical, we discuss why this might be and what we can learn.
But first, we start with the similarities.
Both of these articles are comparing and contrasting the spike proteins on the current coronavirus
with the 2002 SARS virus.
It's really important to know the structure of the spike protein because that's the part of the
virus that is actually going to bind to the host cell.
It's really important to know where it binds in the host cell, what protein it's using,
which has been discovered to be ACE2.
that's the same protein that the 2002 SARS virus used to infect human cells.
So once we know the structure of this spike protein,
we can start to understand how it's actually entering cells
and have some hypotheses for what types of drugs
or what particular molecules could be used
to disrupt that interaction and actually treat this disease
or prevent the ability of the virus to initiate an infection into new cells.
The spike protein is also really important because it's exposed,
to the immune system, and so that's where antibodies will probably bind to the virus.
So speaking of the immune system, one of the really interesting aspects of these viruses
is that the spike proteins are covered in different sugar-like molecules, and those are actually
sometimes called a glycan shield, which is glycan for sugars, and then shield because
they actually prevent the immune system from recognizing the viral proteins as easily as they
otherwise could. Both of these papers showed that there are a lot of glycans on the
the spike protein of the 2019 SARS virus, but that they actually differ very slightly from the 2002
SARS virus.
Overall, the structure was super similar.
But there is one key difference.
Do you want to talk about the fear and cleavage site?
This cleavage is a really key part of the mechanism for viral entry.
So the mechanism for the virus to get into the cell is actually a sequence of about four steps.
The first is the virus arriving at the lungs.
once it arrives at the lungs, the spike proteins will bind to the ACE2 receptor,
which has been now shown in multiple different papers.
And then proteins that are on the surface of the cell called proteases will reach over
and cleave a piece of this spike protein.
And that helps to pull the virus in closer to the cell to the point where a fusion can occur
between the membranes.
And so at that point, the virus is inside of the cell and it can replicate.
What has been found in the SARS 2019 virus versus the 2002 virus is that there's an additional cleavage site, an additional set of amino acids in the protein that are cleaved.
And that's called a fureen site. And that is not present in the 2002 virus.
So there's some hypotheses out there that this furen cleavage occurs first.
And it actually enhances the probability and sort of the success of the second cleavage and then subsequent viral entry.
So there's some theories that the reason this virus is actually.
more effective at infecting cells than the 2002 SARS virus is because of this fearing cleavage site.
I think that's so interesting. And one of the facts mentioned in the discussion of both papers
was that a similar fur and cleavage site is seen in an analogous protein in highly virulent, avian,
and human influenza strains. Let's talk more about how the virus binds the host cell and what happens next.
Tell me about this ACE2 molecule. Ace 2 is a really common protein found on the surface of
many different cell types in the human body. It stands for AGO-O-Tensin converting enzyme. And this is a
receptor that is involved in regulating blood pressure. So it has nothing to do with viral entry and it definitely
didn't evolve for that. But many of these viruses have found ACE2 to be a sort of convenient
receptor that they can co-opt to be able to get into the cell. And I think one of the reasons for this is
that the protease activity that the virus is using, this sort of cut and pull-in mechanism that we
talked about before, that's part of the natural mechanism of how ACE2 functions. What's important
here is that this receptor is the exact same one that was discovered to be involved in the 2002 SARS'
entry mechanism. And that's important because if we have a good understanding of how to drug the
ACE2 receptor following on the 2002 SARS work, we can potentially apply that to treatments for
this current coronavirus. Yeah, the downside of that would be that since ACE-2 is involved in extremely
critical human physiology, that it might be almost impossible to drug without having some kind of
adverse side effect. Yeah. Both sets of papers look at the strength of the interaction between
2019 SARS and 2002 SARS with the ACE-2 receptors. What did they find here? So using different methods,
the Wall's paper found that the 2019 SARS virus and the 2002 SARS virus bind to ACE2 with roughly the same affinity,
but then RAP at all showed that the 2019 SARS virus binds with 10 to 20 times higher affinity, which is a huge difference.
They conjecture that this tighter binding is part of what makes it so virulent because it's able to more successfully get into cells following that binding event.
It's interesting, you know, they're publishing these studies one right after the other. They don't discuss why this might be different.
Yeah, and I think it's hard to say they're using different techniques and this science is happening so live. It's actually really cool to think about this from the perspective of being at the bench. You're working on something and if a paper comes out that contradicts what you're still working on it, that affects your work because it changes how you trust your own work and it changes what kind of data you're going to be looking for. But in this case,
this is all happening so quickly that they're finding what they're finding and they're publishing it.
And I think it's really cool to see it side by side.
Yeah, you get to see the evolution of the field and the thinking kind of at the moment it's happening.
And of course, that really bolsters the findings that do concord, like finding that ACE2 is the receptor and that the spike protein is really similar.
Since those groups got those same answers without influencing each other, that really is kind of the gold standard of replication in scientific publishing is having, you know, two.
independent groups finding things in parallel.
Okay, let's talk about the next area where we saw differences between these two articles.
So this was in the section where they looked at whether antibodies against 2002 SARS combined
to 2019 SARS.
There were key differences between these two papers in the types of antibodies that they were
looking at.
In the walls at all paper, they are looking at polyclonal antibodies, which were generated
by injecting mice with the purified S protein from the virus.
So the mice's immune system saw this S protein.
It generated antibodies in response.
The researchers then took a blood sample, purified the cera.
That cera contains all sorts of different antibodies that bind to all different regions of the spike protein.
Rap et all used three monoclonal antibodies, which had already been previously characterized.
as binding to just a subdomain within the 2002 SARS spike protein.
So whereas in the walls at all paper, there's all sorts of antibodies in this mix
that can bind all over the spike protein and wrap at all.
These are three highly specific antibodies that bind to three highly specific sites in just one
tiny portion.
There's also a big difference in what they measure.
Using the monoclonal antibodies in wrap, they're measuring just by.
binding. So do the monoclonal antibodies bind to the spike protein at all? Whereas in walls,
what they're measuring is whether the polyclonal antibodies set prevents the virus from entering into
human cells. This is more of a phenotypic. It's more of like a does the entire mechanism
that you're trying to stop not happen versus just the binding alone, which is a piece of the process.
In this case, I'd say that the two papers show opposite results. Walls shows that you can stop
viral entry using these polyclinol antibodies, and RAP shows that there's no binding of the monoclonal
antibodies from 2002 SARS to the 2019 SARS virus. They're opposite results, but they're not necessarily
mutually exclusive results, because RAP at all is only looking at three antibodies. You know,
there's an absolutely enormous number of possible antibodies against these viruses and even just
the spike protein. So the fact that three didn't work, you know, that could just be.
just be scientific, bad luck. There's also the fact that we read other papers to prepare for this
segment that supported both the rap conclusion that there is no cross-reactivity and the wall's
conclusion that there is cross-reactivity. So the jury's still very much out. And the fact that
all these studies are using slightly different designs, different sources of antibodies, different
kinds of antibodies, different readouts, this is still definitely evolving. And it's part of what
makes the research right now so interesting is that so many groups are applying their expertise,
their backgrounds, their favorite techniques to this problem. And my gut feeling is that there's
some kind of nuance into the cross-reactivity here that we just haven't appreciated yet. Why do you
think they basically ended up doing the same set of experiments? I think that what we have here is
a demonstration of the types of science that you can get data on really quickly.
In the past, to get a structure of this quality, you would have had to do X-ray crystallography,
which requires first getting the protein to crystallize in this very specific manner, and it's a tricky thing to do.
But with CryoEM, you just get the protein sample super, super cold, and shoot a beam of electrons off it.
Then determine the structure based off how the electrons bounced off.
There has been a real revolution in CryoEM recently, making this method capable of determining the structure of very large protein complexes.
Also, the results of both of these papers are in vitro.
So we're getting really important information that can help vaccine design and therapeutics
design.
These are also the fastest type of research that can be done on these.
As the publication cycles continue and the research continues, we'll get different
insights coming out with experiments that take longer and that require mouse work or that
require kind of more complex experimental design.
Seeing these two papers side by side really speaks to one.
What can be done? What can we learn? That's important and valuable in the shortest amount of time.
Thanks, Judy, for discussing these articles with me. And that's it for Journal Club this week.
You can find all episodes at A16Z.com. Thanks for listening.
