CppCast - Rosetta
Episode Date: April 16, 2020Rob and Jason are joined by Andrew Leaver-Fay from UNC and Jack Maguire from Menten AI. They first discuss a proposal to update both C and C++ and create a unified common core for the languages. Then ...they talk to Andrew and Jack about Rosetta, a C++ protein modeling library, it's history being ported from Fortran and some of its use cases such as creating HIV vaccines. News C2x Proposal: A Common C/C++ Core Clion AMA Session on May 7 Third Annual C++ Foundation Developer Survey "Lite" Links Rosetta Commons - Software Menten AI Designing Peptides on a Quantum Computer Rosetta@home fold.it Sponsors PVS-Studio. Write #cppcast in the message field on the download page and get one month license Read the article "Zero, one, two, Freddy's coming for you" about a typical pattern of typos related to the usage of numbers 0, 1, 2
Transcript
Discussion (0)
Thank you. In this episode, we discuss a common C and C++ core.
Then we talk to Andrew from UNC and Jack from Menton AI.
Andrew and Jack tell us about Rosetta,
a C++ library by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
Doing fine.
Don't think I have too much news to share.
You got anything you want to talk about?
Nothing big at the moment,
although I guess I should mention that I will be coming on CBP Chat also this week.
So two for one if you really want to listen to me talk.
Awesome.
Who else are you going to be on with on CBP Chat?'m i'm not sure but it's going to be an episode about um well i mean i don't have all the names on hand at the moment
but it's going to be a little round table about um uh training in this covid era basically right
right okay well i should definitely tune into that one if you're interested. Okay. Well, at the top of our episode, I'd like to read a piece of feedback.
We got this tweet from Connor Hoekstra, who we've had on the show before.
And he says,
Episode 242 of CppCast with guest John Turner and host Robin Jason was so amazing that I set up a blog to write about how awesome it was,
hear about TypeScript, Rustlang, and other programming languages, New Shell, and more.
And yeah, Connor, thanks for writing this blog post.
I did read through it.
He was very excited about last week's episode, Jason.
He was. He was.
And yeah, I know John read the article also
and liked it.
It's very complimentary towards John
and the work that he has done.
Yeah, yeah. Well, it was really great talking to uh to your cousin last week it was fun yeah it was okay well we'd
love to hear your thoughts about the show you can always reach out to us on facebook twitter or
emails at feedback at cbcast.com and don't forget to leave us a review on itunes or subscribe on
youtube joining us today are andreweverfay and Jack McGuire.
Andrew is a research assistant professor at UNC in the Department of Biochemistry.
He got his BA from UVA in Philosophy and Cognitive Science and his PhD from UNC's Department
of Computer Science.
As a postdoc in Brian Coleman's lab at UNC and later in David Baker's lab at UW, he led
a team of developers in the rewrite and re-architecturing of the Rosetta Molecular Modeling Program Thank you. It's great to be here.
You have way more credentials than I do. spun out of the Coleman Lab at UNC. Andrew, welcome to the show. Thank you. It's great to be here.
You have way more credentials than I do.
Sorry, I'm just going back and rereading all of these things here.
Although you could have gone to a better school for your undergrad.
Oh, no.
Sorry, Virginia Tech.
I just had to throw that out there when I had the opportunity every thanksgiving we meet on the field also joining us is jack uh starting his career as
a baby model and took an early retirement at the age of one he was first exposed to c++ while
pursuing his bachelor's in chemistry at the university of rochester where he wrote programs
to predict and design rna folding patterns he recently compared completed his ph chemistry at the University of Rochester, where he wrote programs to predict and design RNA folding patterns. He recently completed his PhD at the University of North
Carolina, where he wrote programs to predict and design protein folding patterns. Jack now works
at Menton AI on a team that uses quantum computing and machine learning to superpower the Rosetta
protein modeling software. Jack, welcome to the show. Thanks, everyone. It's nice to be here.
You're also way more credentialed than I am. But I have to ask a serious question. Were you
actually a baby model? Oh, yeah. Yeah, no, I don't lie about that.
But there's a gimmick, right? So I was pretty heavy as a baby. So, you know, that's really cute.
Got those Michelin folding arms. Yeah. Yeah. If there are any babies listening,
you know, now's a good time to
put on weight much better than later in life so seriously are you on like i don't know like
baby food product labels or what like what was it about i did a few shows where my mom carried me
down a runway and uh i don't think i was ever in catalogs.
That was the beginning of my career.
Was that your college fund or what?
Did it actually affect your life in any way?
I don't think I was on that scale.
Okay.
Have you ever been recognized on the street?
Luckily, I lost some of that weight. But if not, maybe not maybe no i'm still waiting for them to call
back for another show but you know i'm kind of reading between the lines that's terrible
it's funny okay well uh we have a couple news articles to discuss uh feel free to comment on
any of these guys and then we'll start talking more about rosetta okay all right so uh this first one i don't expect anyone to have actually read this
entire paper because it's some 600 pages long but uh it's programming languages a common c
slash c plus plus core specification and i have to say that's like the only time i've seen c slash
c plus plus and i'm okay with it it was actually used correctly it would seem yeah say that's like the only time i've seen c slash c plus plus and i'm okay with it it was
actually used correctly it would seem yeah yeah that's weird so this is a proposal basically to
uh you know c and c plus plus have diverged a little bit uh over recent years and they want
to kind of bring it back together to have a common core that both languages share, which I think is a good idea.
So a lot of random interesting things in here, if you're up to reading it.
How much of this did you actually read through?
Very little, but I searched for things that I was interested in, like constexpr.
And there is a discussion at the end for future possible changes of adding constexpr and the spaceship operator
into c really so that would be part of the common core okay so how much of this then is adding new
features to c based on what's been added to c++ my takeaway is that a lot of it is more that
direction making because c has more built-in types i mean excuse me c++ has more built-in types.
I mean, excuse me, C++ has more built-in types.
So that was my takeaway,
but I don't know if the rest of you had any chance to read this.
I don't need to...
Yeah, Andrew or Jack, did you read through this at all?
No, I looked at it and thought,
we're just so heavily C++ focused
that I probably won't ever need the C features.
So I kind of thought I would let you guys talk about this.
I thought it was pretty timely after Jason's Doom video
to see just how different the two languages were,
and then to see this proposal come out.
I did the same thing where I just searched for a few terms.
I searched for ABI, and that didn't come out. I did the same thing where I just searched for a few terms. I searched for ABI,
and that didn't come up. Yeah, I don't think ABI is specifically mentioned, but things like
the layout of objects to make sure that they are bit compatible, I think, is at least discussed.
And there's one of those other things in here that was like should padding be represented
as a ray of void is a question for future discussion okay a ray of void i i feel like
that's an entirely separate can of worms because doesn't that imply that void becomes a regular
type and matt calabrese's paper on that hasn't gone anywhere.
No, I remember talking to him about that.
Yeah, regular void, it's still in the back of my mind. I've never had to do very generic programming because it comes up.
Yeah.
Mm-hmm.
Okay.
Well, the next thing we have is that JetBrains is going to be hosting a live webinar.
I think this is in a couple weeks.
I think it was like May 7th? Yeah.
Thursday, May 7th at
5 Central Time, 3 GMT.
So if you want to
ask the JetBrains team anything about
CLion or I'm sure there are other IDEs,
this should be informative.
I'm going to try to join it.
Yeah. And then the last thing
we have, a third annual
C++ Foundation Developer survey is going out.
And I guess they had some questions about whether or not they wanted to even do the survey, but they decided they were going to do it.
So you can go ahead and put in your feedback.
What's in this?
Jason, you have a funny look on your face.
I do. There's a minor problem with this, Rob.
What's that?
It's already been closed.
Has it?
It has.
Am I late? uh not you're
not very late you're only late by like uh one or two days um but it definitely will still be closed
by the time this airs okay well maybe we'll just cut this out or i'll look silly it'll be fine no
no no we should say it's it's was out and look forward to the results soon.
Yeah, there you go.
Good spin. Good spin.
Yeah. How did I miss that? I guess they don't have this out for very long.
No, it was only one week or something like that.
Yeah, it closes in one week and the article is posted on the 5th, so it was from the 5th to the 12th, I guess. Okay. Well, why don't we switch gears
then and start talking about
protein folding and what it is that Rosetta
does, because I don't think a whole lot of our listeners
are going to be well-versed on that. I don't know, I'm not
a biochemist, so could one of you maybe give us an overview
of what exactly it means? I have zero degrees
in chemistry,
just for the record. Oh, I'd be happy to take a stab here. So, proteins are molecules that our
bodies use for basically everything that they need to do. And not just humans, but like all life on
earth. They're actually like the most important of the molecules.
Everybody's heard of DNA from CSI and Law and Order, but DNA is really a rather boring molecule.
It doesn't do anything structurally that's very interesting. And the only thing that really does
is code for proteins. And so proteins, they're the purpose of DNA. They scaffold cells.
They communicate with other cells.
They receive messages from other cells.
They transmit signals.
They open and close pores in neurons to have neurons fire.
They're like everything that a cell does.
And so when you look at DNA, you can read off the sequence of the proteins that
the DNA is coding for, but you can't really know from just the sequence what the protein's going
to do or how it's going to behave. Really, in order to understand that, you need to know the
structure of the protein. Proteins, when you put them in water and not in a vacuum they'll adopt this really compact
structure it's really intricately packed and the chemistry is such that the certain amino acids
that make up the protein are on the interior and other amino acids will be on the exterior
and and so i mean in broad strokes you can guess at how things are going to go,
but to get really like the actual confirmation that a protein adopts, you have to understand
exactly how things are going to come together. And so, what Rosetta does is it tries to search
through all the different possible combinations, but it can't enumerate them all. They're just vastly too many,
much more than there are particles in the universe.
And so we have to be kind of smart
in the way that we do our sampling,
getting bits and pieces from existing structures
and trying to combine them to figure out
what a protein might look like when it's folded.
And so we use a ton of computational resources to do that.
We burn something like 300 million CPU hours per year
through a couple different sources.
One of the main ones is Rosetta at Home.
So people donate their home computers when they're idle
just as like a screensaver that you can download through BOINC,
the Berkeley Open
Infrastructure for Network Computation, I think is what it stands for. And so we use lots and lots
of computers for running Rosetta to try and predict their structures. And we also try to
design new proteins. And proteins are like the smallest motors and actors that we know of. And so if you want to control protein biosynthesis,
or if you, I'm sorry, like small molecule synthesis,
or if you want to try and create new drugs,
proteins are a really attractive target for that.
We actually have a couple really neat things in the clinic.
There's an enzyme that a bunch of undergraduates were able to
design using Rosetta to break down the protein that causes celiacs. And so, you just like
take a pill or you drink that protein, it sits in your stomach. And then when you eat bread,
it breaks down the parts of alpha-glutenin, these glutamine
arginine stretches that can't be broken down by the enzymes that you already have in your stomach.
And so for celiac patients, once these proteins leave the stomach and enter the
small intestine, then they get an immune response. Their antibodies attack this protein
that's not really doing anything, but it messes up the rest of the small intestine. It causes
this large immune inflammation response. And so, if you can just break that down, then you have
like basically a way to eat bread. Yeah, so that's pretty exciting.
We also have people who are pushing a vaccine that's developed by computational modeling in Rosetta for HIV.
The problem with HIV is that when you package it up and send it out, like when it packages itself up and send it out into the bloodstream,
it wraps itself in the membrane of the human cell that it left.
And so as it's floating around, it's basically invisible to the immune system
because it just looks like another human cell, except for one protein that sits on its surface.
And this protein, GP120, is how it recognizes the
next human cell that it's going to invade, the helper T cells that HIV kills. And so the immune
system can only see this one protein and try to memorize the shape of that protein. But that
protein can vary. It has a whole bunch of its surface that isn't terribly important.
And HIV just mutates a ton.
So that by the time the immune system has memorized the shape of GP120,
the HIV virus in your body has already mutated.
So that the next time it goes floating around through the body,
it's no longer recognized by the immune system.
But there's one section of GP120, this surface protein,
that can't mutate away because it's going to bind to a part of the human protein
on the helper T cells, CD4.
And the CD4 binding region of GP120 is like the exhaust port of the GP120 death star,
right? It's the place where if the immune system can target that, if it can memorize that section,
then HIV is hosed. So, what the people at Scripps who are developing this vaccine have done is
create a pared-down version of GP120
that just presents that exhaust port to the immune system.
And it's a little more complicated because the exhaust port is also kind of buried deep in a trench of GP120.
And so only certain antibodies of the immune system can recognize things sort of buried that deeply.
And so the vaccine is actually like a series of proteins that you would inject,
where the first one is like, let's take a look at what something in a trench would look like,
and then what is the exhaust port in a trench, and then what's the exhaust port in the trench
with the cannons on the surface firing at you, all these extra glycosylation points that are kind of distracting the immune system um and it
you know it it works in apes so you can take an ape and give it basically a human immune system
and and give them the vaccine and they're able to generate antibodies that neutralize hiv
and so the next step is human trials. And that's going
on right now. It's really exciting. I thought you were going to say the next step would be to
take the ape's immune system and put it back in the human system. It's already been immunized.
Well, I mean, you could imagine trying to treat patients with antibody injections,
and that may be working. I should look at that.
Because there are a couple that we sort of understand how broadly neutralizing antibodies as a therapy for HIV.
And I guess there's a couple flaws.
One is that you can only inject antibodies.
And so if you were trying to treat someone with HIV, you'd have to inject them every six months and probably for the rest of their lives.
And so that's not a good way to eradicate the disease.
Really, for eradication, you need a vaccine.
You need to be able to have everyone in the world have the vaccine, and then it just disappears, right?
Or at least enough people so that you get herd immunity. But it's not going to,
it's not going to, antibody delivery is not going to treat 8 billion people.
So for either one of you, you mentioned Rosetta at home, and it made me think back to my
university days when I remember folding at home was quite popular. And I haven't heard people
talk about it recently so much, but are those projects related in any way? It sounds like the same kind of thing
where the folding proteins. Yeah, folding at home is developed by a different set of scientists
out of Stanford. And they do very short molecular dynamics simulations on people's home computers. Molecular dynamics
is sort of a different approach
to conformational sampling,
where you sort of
have all the atoms, and then
you put forces between them so
that they'll jiggle, and then you just
let them jiggle
very small time steps, like femtosecond
time steps, for a
while, and if you can get enough
trajectories, then you can kind of try to predict how proteins will fold. So there's two separate
problems that have the same name. One is, how do you predict, given the sequence of a structure,
what its folded state is? And people call this protein folding.
And then there's the other question, which is when a protein is in water,
how does it actually sample conformations until it gets to its folded state?
And that's also called protein folding.
So to be very technically correct, I sometimes talk about protein structure prediction is what we try to do with Rosetta. And protein folding is more of the reverse. If you know the folded structure,
then you can unfold it and sort of watch what pathway it travels as it unfolds. And I think
that's mostly what folding at home does. It runs lots of very short molecular dynamic simulations.
The problem with MD is that you need um either one of the
supercomputers that they've built up in manhattan um to do uh like a millisecond of uh simulation
or you um or you have to run for a very long time on uh on a single like like you can do um
like microsecond simulations in md but proteins fold on the scale of seconds to tens of seconds.
And so MD is still sort of outside of what we can do for protein structure prediction.
Could we maybe talk a little bit more about the Rosetta library itself?
What's the history on it? How long has it been around?
Yeah, so Rosetta has been around since the late
90s um and it was originally designed for um like written in um uh fortran 77 to do protein
structure prediction um and so took small sections of lots of proteins and and sort of um made little
frankenstein monsters out of them um taking a good section from this protein and a good section from that one
and trying to see what a protein would look like if you glued them together that way.
And it did remarkably well, and sort of the functionality of Rosetta has expanded over time
so that it solves more and more problems, and eventually protein design, protein loop modeling,
protein small molecule docking, protein-protein docking.
Lots of interesting problems in the field of computational structural biology. It's become sort of the de facto leader in protein structure prediction and protein design.
And so the software itself, it was written in Fortran.
And then back in 2004 or thereabouts,
it was mechanically ported into C++.
So it was compiled from Fortran into C++
instead of into machine code.
And so we had like a C++ version of Fortran code.
So global arrays,
functions that took three parameters but then interacted with a whole bunch of global data in order to get all of the rest of the
parameters um and uh and like uh it was it was very difficult to understand um i remember um
there was a project uh of trying to get this one module in rosetta that that runs protein design
to eliminate all the global data in that.
And someone was working on that for like six months
and declared themselves done.
And I kind of went and looked and realized that the only thing
that had really changed is just like the three visible parameters
and then all the other ones that were sort of sneaking in
through the back door, those were no longer.
Anyway, so it was very difficult to figure out how the code was going to behave.
Global data makes things very difficult to understand.
And you can't multi-thread it in any way.
So back in 2007, we started a conversion to create a fully object-oriented version of Rosetta.
And we have eliminated most but but not all, of the global data.
There's still like a couple things which are a thorn in my side to this day.
But everything is objects now.
We don't have these big global arrays.
So that's better.
How long did that take?
It went remarkably fast.
By the end of, let's see, so we started in earnest in february
of 2007 and by august we had uh the design module up and running and like most of the functionality
there and then we began porting like lots of other things over so that by yeah by the end of 2008 um we sort of declared ourselves done um it i mean
we have added a lot more functionality since so it continues to expand it's now up to
three million lines of c++ a lot of that is is duplication but uh a lot of it isn't
so did you uh you created a whole new structure and then ported the old code into the new structure?
Right. Yeah.
So we wanted to have the algorithms work as best we could still with the old code.
Right. So that as you're porting it over, as you're porting functions over the indices that you're using to decide what position you're going to be
cutting your loops at,
they made sense in both the,
or it was the same indices as before and after.
And so we index from one in Rosetta.
Like we're scientists, we're written,
most of Rosetta is written by biochemists.
They count from one, so we count from one.
We don't do the C and C++ style counting from zero. So that makes it a little bit different. But yeah, we did try to
keep the code. I mean, preserve as much of the original code as possible, because it's easier,
right? We didn't want to rewrite every line, but we did want to use objects to control how data is stored.
And so that's sort of the biggest change from the Fortran-like C++ to the current version.
Jack, what version of C++ is Rosetta using currently?
So we're using C++11, we switched over in 2015-ish because we needed to wait for all of the supercomputers to get more modern compilers
because all of the Rosetta clients have the option to compile their own source code locally.
Right.
And so we needed newer compilers to disperse before we could upgrade.
However, we do have the ability, Rosetta can compile with 14 and
17 with minimal changes, which have already been done. So it's optional, but we still live at C++
17, or sorry, C++ 11. So, well, this is all, your whole story for both of you is very close to home
because I've been involved in a project for the last 10 years off and on that is fortran that has been converted to c++ through an automatic conversion tool and
we're struggling through getting some of these things we've got i don't even know thousands
of global variables i think it's uncountably many i think it's it's um anyhow uh they are currently on c++ 11 on the project and i've
been saying well can we move to 14 or 17 and and the answer i get back is oh sure we already we
you know we require whatever compilers with ubuntu 18.04 whatever but i can't actually get them at
the moment to commit to a specific C++ version.
So I'm still like stuck on C++ 11, which for a few things is really painful, like generic lambdas,
it's much easier to write a lambda C++ 14 in many cases. And so yeah, I was just curious, like,
are you specifically stuck on C++ 11? Or did you just draw the line there? Or could you probably
switch it to C++ 1414 and no one would even notice
because they're using GCC 4.9 or better?
Or do you know?
The people that would notice would be mad.
So most people would not notice.
I often develop in C++14 and 17
then downgrade when it's time to push to master.
Because that's faster, right?
There's a reason they have the new tools
and the newer versions i never even considered doing that yeah because a lot of the new things
in the 14 and 17 can be hacked in 11 yeah a lot of them yeah yeah we don't do as much constexpr
as you well in this project i'm doing very little constexpr, although I am pushing them towards it in a few places. One of the things which might be familiar to you all is at the top of every function, we have a const static standard string function name. And it's just a representation of what the current function name is. I don't know if this had some history in the way Fortran used to be done back in the day, so that if there's
any log messages after that, then they can just say use that function name that we hard-coded in
here. Well, interestingly, that const static standard string is a performance pessimization
every single time the function is entered, because it has to do a static check to see a thread safe
check to see if that string has been initialized yet. And I'm like, if we had C++17,
I'm doing a find and replace on every single const static string
with const const constexpr static string view.
And that would all be free then after that.
But, you know, I haven't been able to flip that big switch yet.
Anyhow, that's the kind of things that we run into on projects like this. a search for errors and especially typos. The tool supports the analysis of C, C++, C Sharp,
and Java code. The article 012 Freddy's Coming for You, which has been recently posted, clearly
demonstrates the analyzer's outstanding abilities of searching typos. The article shows how easy it
is to make a mistake, even in simple code, and that no one is immune from making it. You can
find a link to the article in the podcast description. There's also a link to the PVS Studio download page. When requesting a license,
write the hashtag CppCast and you'll receive a trial license for a full month instead of one week.
One thing I'm curious about is it sounds like both of you are very well versed in C++,
but it sounds like a lot of people work on this project are maybe more focused on
biochemistry and maybe don't have as much C++ background. What's it like having all of those
people brought in to work on Rosetta? Is that a challenge?
Yeah, it's definitely a challenge. It's both a good and a bad thing, right? That there's a lot
to protein design that's just imagination limited, right?
There's fewer things that I can imagine
than a whole bunch of biochemists can.
But at the same time,
it's really kind of nice
to have people developing the code
who understand all the rules
of what you should not do.
Like let's try to avoid global variables
or let's not change a structure in the middle of
trying to score it. And so, yeah, it is a challenge to work with as many biochemists as we do. But I
think it is useful that they're there. Yeah, so one thing that we've relied heavily on in C++ is const as a decorator for functions and methods.
And so that helps us describe to other developers what we don't want them to do and prevent them from doing with a piece of data in the middle execution, while still also giving us good performance, right?
You can hand out the coordinates to a structure,
inline functions without having to worry about them being modified
by those functions you're handing them out to.
And that's important for like making sure the code
is actually computing what you think it should be computing.
There have been multiple times where people have suggested, hey, can we change the coordinates of this residue in the
middle of scoring the structure? And then if you imagine that, then some amount of scoring has
taken place, and then the structure changes, and then some other amount of scoring takes place,
and then you're not really sure what you've computed at the end. So const keeps us from having naive biochemists come in and change it.
And they're naive enough to know that they can't just cast away the const if they so desire.
Right?
That's another feature of the language.
What's scary, though, is we might talk about this more later, but a lot of our Rosetta users use the Python bindings for it. And when you create, when you use objects you some idea of what you're not supposed to do
but there's there's no protection uh for a lot of the users at that level yeah yeah i feel like as
my experience as a trainer if someone said can i modify this while it's being scored i would say i
have no idea what you're talking about but no uh yeah the the python thing Python thing, there's one particular place where in PyroZetta,
expert users have had the same bug where you can get access to the kinematic description of the system.
And it's supposed to be a const access.
Like it's const in the C++ part,
but it comes through the Python bindings
as just you have the object, right?
You have a pointer to it.
And so you can modify it directly
without the object that's holding it.
So the fold tree is describing the kinematics
and it's held by the confirmation.
When you change the fold tree in the confirmation
like confirmation needs to update some of its other data members um and so if you change it
directly like then it breaks and that's that's actually um messed up multiple people um like
senior developers even um so that's that's one of the yeah one of the frustrating parts of having something that is safe in C++ and then not safe in Python,
it'd be kind of nice if we could enforce in Python constness.
I don't know how to do that.
So does that imply that at some point in your bindings, there's actually a const cast or something happening?
I think the bindings might return a const object, but Python doesn't do anything with that const. Python doesn't care if an object is const or not.
Right. Maybe I'm misunderstanding the question. I don't know. I'm thinking at what point can they
actually modify it? If it has a const object and then they're able to call a non-const member
function on it, that means at some point in your bindings, the constness was dropped.
Because otherwise, the bindings themselves wouldn't be able to generate the code
to call the non-const member function.
Something had to get lost.
Yeah, probably when the object is being returned,
it needs to be put into some sort of generic container, right?
It knows it's a pointer to a fold tree,
but Python can't distinguish between
pointers to fold trees and pointers to constant fold trees. And so it just casts away at that
moment. That sounds probably right. Yeah. I'm curious what tools you're using now that we've
gone this far to generate your Python bindings. Or are you? For a while um we were using pybind 11 and and boost um but uh uh at a certain
point that was um it was just leading to too many headaches and so a developer in the community um
sergey liskov developed um his own um uh python wrapping uh library which he calls Binder. And so that's available and is open source.
And he supports it actively.
I can't tell you too much about how it works, though.
But it's pretty neat stuff.
And I know that it's now much easier to compile PyroZetta yourself.
So if you're a developer and you're adding new functionality,
but you want to use that functionality immediately,
then you can just compile your own Python bindings,
and that used to be very difficult.
The other way of getting Python bindings is to wait until the testing server updates,
and then I think we have a weekly release schedule
so that you can download last week's version of PyroZetta right now.
Yeah, most of our tools are sort of hand-rolled. We're not using a ton of things that maybe we should be. It'd be nice to have, I don't know, like a better sense of what tools are out there
that we should be using. I mean, I've used Swig for Python bindings,
but it has its own different set of headaches.
So it's, you know, you pick your poison.
I personally like Boost Python.
That always treats me well for my own personal projects.
But this binder tool that we use for Rosetta,
like we never have to think about it.
It just works.
And I don't know how much overhead was required
to get that in a state where it is,
but you can pass lambdas between the two languages.
Any complication you can think of, it just works.
It's nice to be able to subclass the C++ classes in Python
and then hand them back to containers
that take those classes
and have Python functions being invoked by C++.
It's a really nice feature,
especially if you're creating a protocol that has multiple stages,
like at stage 5 to have that implemented in Python,
but using the C++ code for stages
one through four.
That feels like magic when you do that kind of thing.
Yeah, it's great.
If the listeners have never
used Python and they've never
tried
bindings before, I highly recommend it.
I never really liked Python until I had that power
and it
just feels incredible. jack i was wondering
if you might want to tell us a little bit more about uh the work you mentioned in your bio with
using quantum computing and machine learning with rosetta yeah sure um so just to quickly re
summarize the problem that we work in um The protein itself is a linear chain of units,
like beads on a necklace, to use Andrew's analogy. And the chain itself is flexible,
but each bead can also adopt one of 20 different chemistries. So there's this two different sampling states because the chemistries
are flexible themselves. So while the big main chain can fold in on itself in many different ways,
each bead can adopt one of 100 or 1,000 different states. So there's a combinatorial optimization problem with the
beads changing states with a very large sample space problem of the main chain changing shape.
What I do mostly is protein design, which is where you can change the chemistries of the protein to
optimize the behavior of it in the cell.
And when you do that, then you're mostly focusing on the state changing sampling
so you can... the main chain of your protein is already in some
conformation that you want to keep, but the chemistries of each bead on the
chain, you want to be able to change.
And like I said, there are tens or hundreds or thousands of different states for hundreds of different beads.
So the combinatorial optimization of that problem is the inner loop of sampling.
And then the outer loop, you allow the main chain to slightly modify to adopt chemistry changes. So the inner loop as a combinatorial optimization problem translates well for quantum computers. And that's
what we're doing at Menton AI, is mapping the Rosetta inner loop of what we call um packing where you try and fit these chemistries in a dense
protein core and you uh you map that into a combinatorial optimization system that can be
put and there are already tools for that on classical computers but quantum's on the rise
so we're trying to jump on that um and then with the outer loop, we're trying to drive that with machine learning.
So we're doing quantum wrapped up machine learning.
What is it?
I mean, are you programming in C++ on the quantum computer?
Like, how does it, what do you do?
We have collaborators that do the quantum computer programming for us.
All we need to do for that is set up the
set up the problem and then there's an api um i would be surprised if it was c++ i think it's
very low level machine code i've never looked at it my full understanding of quantum computing in
general comes from science fiction as i understand it you set up the problem and then someone in an
alternate reality provides you the answer i'm pretty sure that's how it works yeah that's i
mean for us um the alternate reality is in toronto but okay that sounds about right yeah yeah um
it's really neat so the quantum computer itself manifests your problem in real life at a very small quantum level and then uses real physics
to sample the landscape of solutions to your problem and leverages advanced physics that
I never got deep into to understand to use you know real life physics to find the solution to
your computational problem. So it physically models the problem, effectively.
Yeah. Now, there are some shortcuts that need to be taken
with modern-day quantum computers, so you don't always get the best solution,
but we're investing now so that as quantum computers get better, hopefully we can take better use of it.
Interesting. I have no idea what any of that means.
No, it's... Interesting. I have no idea what any of that means.
It's fun, though. It's fun to be at the level where we are, where we don't need to write the code that goes on the quantum computer, but we get to use it.
Right. And just at a ballpark, how much faster is its results than what you were doing with classical algorithms?
Well, we're limited at the moment by the size of the quantum computers,
but that's quickly going away.
It's a rapidly growing field.
But for all of the problems that we're running now,
they can be solved on classical computers.
So at the moment, there's no runtime advantage right now.
Well, to be fair, the problems,
when they're being solved on the classical computer,
I mean, they're NP-com complete. So we're just approximating,
uh,
we're approximating in either case,
uh,
uh,
a stochastic technique for,
for doing,
um,
the optimization.
But,
um,
like,
so the,
yeah,
the quantum computer is supposed to give you the answer,
except when it doesn't.
Um,
and so,
uh,
so sometimes it gets,
you know,
significantly better energies than what we're doing with our – oh, I've gotten frozen again.
I guess I'll finish the thought.
Sometimes.
The quantum computer doesn't always give you the best answer.
But in theory, if you had a sizable computer, it will.
I don't know the total limitations.
Like they operate at very low temperatures and there are physical limitations to the size of problem you can sample, but you can approximate to fit larger problems.
If listeners are interested in this, there's a preprint out from our company at Menton AI.
And maybe I can send that to you for the show notes.
Absolutely.
If you look up, Vicar Mulligan is the big name behind it, Hans Melo.
And it's an interesting problem.
I'm excited to learn more about it.
I've only been at Menton AI a few months now, so I'm still, I haven't personally run the quantum program yet.
Now, again, just based on my science fiction research from what I know of quantum computing,
the other problem is that it could give you the answer yesterday and you missed it, right?
Yeah.
I'm pretty sure that's right, true.
So Vikram, like I mentioned, he pointed out to me that if there are alternate dimensions and Jack McGuire is running a quantum computer in one dimension and Jack McGuire is running a quantum computer in a different dimension we'll get different results so somewhere there's a Jack McGuire that just got a really good Rosetta result from the quantum computer and then I just
got you know the average one or maybe I'm the one that got the good result so it's it's all you know
the science fiction is very much in play that's's great. Just on a side note, since we're talking about quantum computing and science fiction,
there was a great show that just started on Hulu called Devs.
I would recommend it.
It's all about quantum computing and software developers using it.
Very cool.
All I can say is you all are using quantum computers to help generate novel proteins.
This sounds like the end of the
world well uh we'll find out soon or the beginning of the world yeah uh you andrew you talked a lot
about you know some of the successes of uh what's you know being done and solved with rosetta and
you know progress with hiv vaccines uh you know at progress with HIV vaccines, uh, you know, at the risk of
pissing off our listeners by talking about COVID again, is there any chance that there's people
out there like trying to make a COVID vaccine using Rosetta proteins? Yeah. So another,
another thing that, um, people have done is, uh, is design little cages, like little balls of protein. And then on those things, you can present
other proteins on them. And so there's actually a Rosetta lab out of Seattle that's making a COVID
vaccine, basically. So they've taken the S protein from COVID. And I think they
had been looking at it before. So I mean, the coronaviruses have been around for forever.
And in this particular one, CoV SARS-2, they've taken its protein and put it on their nano cage system and are looking to see if that will lead to a vaccine.
I know, I mean, this isn't Rosetta, but there are people who are, this is just biology. this vaccine that um i happen to have a friend in seattle who um was part of the vaccine trial for
um is to take the um the gene basically for one of the genes um as um as not dna but instead rna
and inject it into you and then you'll just make that protein um and uh because we're very good at
fighting like or identifying pathogens
if they come in looking like pathogens.
But if we just get instructions of how to build proteins,
we'll just take those instructions.
It's like almost a computer virus, right?
And start executing those instructions, make the proteins,
and then you have this foreign protein that's sort of circulating in your body,
and it can elicit an immune response.
So there's a vaccine trial that's going on right now for that, and I've got a buddy who got the injection.
But that's not a Rosetta thing.
That's another company.
But yeah, Rosetta.
Go ahead.
I was just going to say the short answer is there are a lot of people uh trying to model covid right now uh and rosetta is one of the big tools to do it i
didn't mean to interrupt there andrew we actually had a a virtual conference um like a week and a
half ago um where a bunch of people within the rosetta community were trying to talk about all
the different ways that they're using rosetta to try and tackle this current global pandemic.
And so there's a lot of exciting work going on.
Yeah, but I mean, certainly one of the most exciting is trying to get a vaccine.
But, you know, if you're going to give a vaccine, you're administering it to a lot of healthy people.
And so you need to make sure that it's not going to kill them. And that takes time, right? You can imagine creating the vaccine
in a matter of weeks, but making sure that you could actually give it to people and not kill
them, that's what takes the year and a half. And it's a good thing to do. So, yeah, I mean, while Rosetta and lots of other techniques are like gearing up to give us vaccines, knowing that any one of them could be used is what's sort of delaying our ability to go and use them. My extraordinarily limited biology knowledge, though, you're talking about
this protein ball that they could attach other proteins onto. Would the goal be then to expose
that protein to your immune system in a harmless way, basically? Yeah, I mean, it's effectively
harmless if it's not able to infect your own cells, right? It's just a ball. And the purpose of it being a ball is that you have like 50 copies of it. And for biochemistry that I won't go into, like the more copies you have,
the easier it is to sort of get weak binders that become better binders.
And so there are like ways to present to the immune system a starting point and get it ramped
up. So it looks a lot like you
imagine the virus itself would look like except it doesn't have a virus inside of it it's just
an empty ball and so it's not gonna it's not gonna infect you it's just a mimic your immune
system will still respond to it get its own antibodies and then be able to react to any invaders that are coming in.
Right.
Okay.
Sounds neat.
Well, it's been great having you both on the show today.
Is there anything else either of you want to go over before I let you go?
If it's okay, I'd like to mention that Menton is hiring data scientists more than C++ programmers.
But if you're a data scientist listening, machine learning, feel free to reach out to Menton. I'll include the website
maybe in the show notes. We probably have a few listening. Sure.
Okay. Well, thank you both for coming on the show today.
Thank you for having us. Thanks for coming. Thanks so much for listening in as we chat about
C++. We'd love to hear what you think of the podcast. Please let us know
if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love
to hear about that too. You can email
all your thoughts to feedback at cppcast.com.
We'd also appreciate
if you can like CppCast on Facebook
and follow CppCast on Twitter.
You can also follow me at
Rob W. Irving and Jason at Lefticus
on Twitter. We'd also like to
thank all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast.
And of course, you can find all that info and the show notes on the podcast website
at cppcast.com.
Theme music for this episode is provided by podcastthemes.com.