CppCast - Rosetta

Starting point is 00:00:00 Thank you. In this episode, we discuss a common C and C++ core. Then we talk to Andrew from UNC and Jack from Menton AI. Andrew and Jack tell us about Rosetta, a C++ library by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing? Doing fine.

Starting point is 00:01:26 Don't think I have too much news to share. You got anything you want to talk about? Nothing big at the moment, although I guess I should mention that I will be coming on CBP Chat also this week. So two for one if you really want to listen to me talk. Awesome. Who else are you going to be on with on CBP Chat?'m i'm not sure but it's going to be an episode about um well i mean i don't have all the names on hand at the moment but it's going to be a little round table about um uh training in this covid era basically right

Starting point is 00:01:59 right okay well i should definitely tune into that one if you're interested. Okay. Well, at the top of our episode, I'd like to read a piece of feedback. We got this tweet from Connor Hoekstra, who we've had on the show before. And he says, Episode 242 of CppCast with guest John Turner and host Robin Jason was so amazing that I set up a blog to write about how awesome it was, hear about TypeScript, Rustlang, and other programming languages, New Shell, and more. And yeah, Connor, thanks for writing this blog post. I did read through it. He was very excited about last week's episode, Jason.

Starting point is 00:02:34 He was. He was. And yeah, I know John read the article also and liked it. It's very complimentary towards John and the work that he has done. Yeah, yeah. Well, it was really great talking to uh to your cousin last week it was fun yeah it was okay well we'd love to hear your thoughts about the show you can always reach out to us on facebook twitter or emails at feedback at cbcast.com and don't forget to leave us a review on itunes or subscribe on

Starting point is 00:03:02 youtube joining us today are andreweverfay and Jack McGuire. Andrew is a research assistant professor at UNC in the Department of Biochemistry. He got his BA from UVA in Philosophy and Cognitive Science and his PhD from UNC's Department of Computer Science. As a postdoc in Brian Coleman's lab at UNC and later in David Baker's lab at UW, he led a team of developers in the rewrite and re-architecturing of the Rosetta Molecular Modeling Program Thank you. It's great to be here. You have way more credentials than I do. spun out of the Coleman Lab at UNC. Andrew, welcome to the show. Thank you. It's great to be here. You have way more credentials than I do.

Starting point is 00:03:54 Sorry, I'm just going back and rereading all of these things here. Although you could have gone to a better school for your undergrad. Oh, no. Sorry, Virginia Tech. I just had to throw that out there when I had the opportunity every thanksgiving we meet on the field also joining us is jack uh starting his career as a baby model and took an early retirement at the age of one he was first exposed to c++ while pursuing his bachelor's in chemistry at the university of rochester where he wrote programs to predict and design rna folding patterns he recently compared completed his ph chemistry at the University of Rochester, where he wrote programs to predict and design RNA folding patterns. He recently completed his PhD at the University of North

Starting point is 00:04:29 Carolina, where he wrote programs to predict and design protein folding patterns. Jack now works at Menton AI on a team that uses quantum computing and machine learning to superpower the Rosetta protein modeling software. Jack, welcome to the show. Thanks, everyone. It's nice to be here. You're also way more credentialed than I am. But I have to ask a serious question. Were you actually a baby model? Oh, yeah. Yeah, no, I don't lie about that. But there's a gimmick, right? So I was pretty heavy as a baby. So, you know, that's really cute. Got those Michelin folding arms. Yeah. Yeah. If there are any babies listening, you know, now's a good time to

Starting point is 00:05:05 put on weight much better than later in life so seriously are you on like i don't know like baby food product labels or what like what was it about i did a few shows where my mom carried me down a runway and uh i don't think i was ever in catalogs. That was the beginning of my career. Was that your college fund or what? Did it actually affect your life in any way? I don't think I was on that scale. Okay.

Starting point is 00:05:39 Have you ever been recognized on the street? Luckily, I lost some of that weight. But if not, maybe not maybe no i'm still waiting for them to call back for another show but you know i'm kind of reading between the lines that's terrible it's funny okay well uh we have a couple news articles to discuss uh feel free to comment on any of these guys and then we'll start talking more about rosetta okay all right so uh this first one i don't expect anyone to have actually read this entire paper because it's some 600 pages long but uh it's programming languages a common c slash c plus plus core specification and i have to say that's like the only time i've seen c slash c plus plus and i'm okay with it it was actually used correctly it would seem yeah say that's like the only time i've seen c slash c plus plus and i'm okay with it it was

Starting point is 00:06:25 actually used correctly it would seem yeah yeah that's weird so this is a proposal basically to uh you know c and c plus plus have diverged a little bit uh over recent years and they want to kind of bring it back together to have a common core that both languages share, which I think is a good idea. So a lot of random interesting things in here, if you're up to reading it. How much of this did you actually read through? Very little, but I searched for things that I was interested in, like constexpr. And there is a discussion at the end for future possible changes of adding constexpr and the spaceship operator into c really so that would be part of the common core okay so how much of this then is adding new

Starting point is 00:07:14 features to c based on what's been added to c++ my takeaway is that a lot of it is more that direction making because c has more built-in types i mean excuse me c++ has more built-in types. I mean, excuse me, C++ has more built-in types. So that was my takeaway, but I don't know if the rest of you had any chance to read this. I don't need to... Yeah, Andrew or Jack, did you read through this at all? No, I looked at it and thought,

Starting point is 00:07:38 we're just so heavily C++ focused that I probably won't ever need the C features. So I kind of thought I would let you guys talk about this. I thought it was pretty timely after Jason's Doom video to see just how different the two languages were, and then to see this proposal come out. I did the same thing where I just searched for a few terms. I searched for ABI, and that didn't come out. I did the same thing where I just searched for a few terms. I searched for ABI,

Starting point is 00:08:05 and that didn't come up. Yeah, I don't think ABI is specifically mentioned, but things like the layout of objects to make sure that they are bit compatible, I think, is at least discussed. And there's one of those other things in here that was like should padding be represented as a ray of void is a question for future discussion okay a ray of void i i feel like that's an entirely separate can of worms because doesn't that imply that void becomes a regular type and matt calabrese's paper on that hasn't gone anywhere. No, I remember talking to him about that. Yeah, regular void, it's still in the back of my mind. I've never had to do very generic programming because it comes up.

Starting point is 00:08:53 Yeah. Mm-hmm. Okay. Well, the next thing we have is that JetBrains is going to be hosting a live webinar. I think this is in a couple weeks. I think it was like May 7th? Yeah. Thursday, May 7th at 5 Central Time, 3 GMT.

Starting point is 00:09:10 So if you want to ask the JetBrains team anything about CLion or I'm sure there are other IDEs, this should be informative. I'm going to try to join it. Yeah. And then the last thing we have, a third annual C++ Foundation Developer survey is going out.

Starting point is 00:09:27 And I guess they had some questions about whether or not they wanted to even do the survey, but they decided they were going to do it. So you can go ahead and put in your feedback. What's in this? Jason, you have a funny look on your face. I do. There's a minor problem with this, Rob. What's that? It's already been closed. Has it?

Starting point is 00:09:44 It has. Am I late? uh not you're not very late you're only late by like uh one or two days um but it definitely will still be closed by the time this airs okay well maybe we'll just cut this out or i'll look silly it'll be fine no no no we should say it's it's was out and look forward to the results soon. Yeah, there you go. Good spin. Good spin. Yeah. How did I miss that? I guess they don't have this out for very long.

Starting point is 00:10:14 No, it was only one week or something like that. Yeah, it closes in one week and the article is posted on the 5th, so it was from the 5th to the 12th, I guess. Okay. Well, why don't we switch gears then and start talking about protein folding and what it is that Rosetta does, because I don't think a whole lot of our listeners are going to be well-versed on that. I don't know, I'm not a biochemist, so could one of you maybe give us an overview of what exactly it means? I have zero degrees

Starting point is 00:10:44 in chemistry, just for the record. Oh, I'd be happy to take a stab here. So, proteins are molecules that our bodies use for basically everything that they need to do. And not just humans, but like all life on earth. They're actually like the most important of the molecules. Everybody's heard of DNA from CSI and Law and Order, but DNA is really a rather boring molecule. It doesn't do anything structurally that's very interesting. And the only thing that really does is code for proteins. And so proteins, they're the purpose of DNA. They scaffold cells. They communicate with other cells.

Starting point is 00:11:28 They receive messages from other cells. They transmit signals. They open and close pores in neurons to have neurons fire. They're like everything that a cell does. And so when you look at DNA, you can read off the sequence of the proteins that the DNA is coding for, but you can't really know from just the sequence what the protein's going to do or how it's going to behave. Really, in order to understand that, you need to know the structure of the protein. Proteins, when you put them in water and not in a vacuum they'll adopt this really compact

Starting point is 00:12:06 structure it's really intricately packed and the chemistry is such that the certain amino acids that make up the protein are on the interior and other amino acids will be on the exterior and and so i mean in broad strokes you can guess at how things are going to go, but to get really like the actual confirmation that a protein adopts, you have to understand exactly how things are going to come together. And so, what Rosetta does is it tries to search through all the different possible combinations, but it can't enumerate them all. They're just vastly too many, much more than there are particles in the universe. And so we have to be kind of smart

Starting point is 00:12:53 in the way that we do our sampling, getting bits and pieces from existing structures and trying to combine them to figure out what a protein might look like when it's folded. And so we use a ton of computational resources to do that. We burn something like 300 million CPU hours per year through a couple different sources. One of the main ones is Rosetta at Home.

Starting point is 00:13:16 So people donate their home computers when they're idle just as like a screensaver that you can download through BOINC, the Berkeley Open Infrastructure for Network Computation, I think is what it stands for. And so we use lots and lots of computers for running Rosetta to try and predict their structures. And we also try to design new proteins. And proteins are like the smallest motors and actors that we know of. And so if you want to control protein biosynthesis, or if you, I'm sorry, like small molecule synthesis, or if you want to try and create new drugs,

Starting point is 00:13:55 proteins are a really attractive target for that. We actually have a couple really neat things in the clinic. There's an enzyme that a bunch of undergraduates were able to design using Rosetta to break down the protein that causes celiacs. And so, you just like take a pill or you drink that protein, it sits in your stomach. And then when you eat bread, it breaks down the parts of alpha-glutenin, these glutamine arginine stretches that can't be broken down by the enzymes that you already have in your stomach. And so for celiac patients, once these proteins leave the stomach and enter the

Starting point is 00:14:40 small intestine, then they get an immune response. Their antibodies attack this protein that's not really doing anything, but it messes up the rest of the small intestine. It causes this large immune inflammation response. And so, if you can just break that down, then you have like basically a way to eat bread. Yeah, so that's pretty exciting. We also have people who are pushing a vaccine that's developed by computational modeling in Rosetta for HIV. The problem with HIV is that when you package it up and send it out, like when it packages itself up and send it out into the bloodstream, it wraps itself in the membrane of the human cell that it left. And so as it's floating around, it's basically invisible to the immune system

Starting point is 00:15:37 because it just looks like another human cell, except for one protein that sits on its surface. And this protein, GP120, is how it recognizes the next human cell that it's going to invade, the helper T cells that HIV kills. And so the immune system can only see this one protein and try to memorize the shape of that protein. But that protein can vary. It has a whole bunch of its surface that isn't terribly important. And HIV just mutates a ton. So that by the time the immune system has memorized the shape of GP120, the HIV virus in your body has already mutated.

Starting point is 00:16:20 So that the next time it goes floating around through the body, it's no longer recognized by the immune system. But there's one section of GP120, this surface protein, that can't mutate away because it's going to bind to a part of the human protein on the helper T cells, CD4. And the CD4 binding region of GP120 is like the exhaust port of the GP120 death star, right? It's the place where if the immune system can target that, if it can memorize that section, then HIV is hosed. So, what the people at Scripps who are developing this vaccine have done is

Starting point is 00:17:02 create a pared-down version of GP120 that just presents that exhaust port to the immune system. And it's a little more complicated because the exhaust port is also kind of buried deep in a trench of GP120. And so only certain antibodies of the immune system can recognize things sort of buried that deeply. And so the vaccine is actually like a series of proteins that you would inject, where the first one is like, let's take a look at what something in a trench would look like, and then what is the exhaust port in a trench, and then what's the exhaust port in the trench with the cannons on the surface firing at you, all these extra glycosylation points that are kind of distracting the immune system um and it

Starting point is 00:17:51 you know it it works in apes so you can take an ape and give it basically a human immune system and and give them the vaccine and they're able to generate antibodies that neutralize hiv and so the next step is human trials. And that's going on right now. It's really exciting. I thought you were going to say the next step would be to take the ape's immune system and put it back in the human system. It's already been immunized. Well, I mean, you could imagine trying to treat patients with antibody injections, and that may be working. I should look at that. Because there are a couple that we sort of understand how broadly neutralizing antibodies as a therapy for HIV.

Starting point is 00:18:34 And I guess there's a couple flaws. One is that you can only inject antibodies. And so if you were trying to treat someone with HIV, you'd have to inject them every six months and probably for the rest of their lives. And so that's not a good way to eradicate the disease. Really, for eradication, you need a vaccine. You need to be able to have everyone in the world have the vaccine, and then it just disappears, right? Or at least enough people so that you get herd immunity. But it's not going to, it's not going to, antibody delivery is not going to treat 8 billion people.

Starting point is 00:19:12 So for either one of you, you mentioned Rosetta at home, and it made me think back to my university days when I remember folding at home was quite popular. And I haven't heard people talk about it recently so much, but are those projects related in any way? It sounds like the same kind of thing where the folding proteins. Yeah, folding at home is developed by a different set of scientists out of Stanford. And they do very short molecular dynamics simulations on people's home computers. Molecular dynamics is sort of a different approach to conformational sampling, where you sort of

Starting point is 00:19:51 have all the atoms, and then you put forces between them so that they'll jiggle, and then you just let them jiggle very small time steps, like femtosecond time steps, for a while, and if you can get enough trajectories, then you can kind of try to predict how proteins will fold. So there's two separate

Starting point is 00:20:14 problems that have the same name. One is, how do you predict, given the sequence of a structure, what its folded state is? And people call this protein folding. And then there's the other question, which is when a protein is in water, how does it actually sample conformations until it gets to its folded state? And that's also called protein folding. So to be very technically correct, I sometimes talk about protein structure prediction is what we try to do with Rosetta. And protein folding is more of the reverse. If you know the folded structure, then you can unfold it and sort of watch what pathway it travels as it unfolds. And I think that's mostly what folding at home does. It runs lots of very short molecular dynamic simulations.

Starting point is 00:21:01 The problem with MD is that you need um either one of the supercomputers that they've built up in manhattan um to do uh like a millisecond of uh simulation or you um or you have to run for a very long time on uh on a single like like you can do um like microsecond simulations in md but proteins fold on the scale of seconds to tens of seconds. And so MD is still sort of outside of what we can do for protein structure prediction. Could we maybe talk a little bit more about the Rosetta library itself? What's the history on it? How long has it been around? Yeah, so Rosetta has been around since the late

Starting point is 00:21:46 90s um and it was originally designed for um like written in um uh fortran 77 to do protein structure prediction um and so took small sections of lots of proteins and and sort of um made little frankenstein monsters out of them um taking a good section from this protein and a good section from that one and trying to see what a protein would look like if you glued them together that way. And it did remarkably well, and sort of the functionality of Rosetta has expanded over time so that it solves more and more problems, and eventually protein design, protein loop modeling, protein small molecule docking, protein-protein docking. Lots of interesting problems in the field of computational structural biology. It's become sort of the de facto leader in protein structure prediction and protein design.

Starting point is 00:22:42 And so the software itself, it was written in Fortran. And then back in 2004 or thereabouts, it was mechanically ported into C++. So it was compiled from Fortran into C++ instead of into machine code. And so we had like a C++ version of Fortran code. So global arrays, functions that took three parameters but then interacted with a whole bunch of global data in order to get all of the rest of the

Starting point is 00:23:11 parameters um and uh and like uh it was it was very difficult to understand um i remember um there was a project uh of trying to get this one module in rosetta that that runs protein design to eliminate all the global data in that. And someone was working on that for like six months and declared themselves done. And I kind of went and looked and realized that the only thing that had really changed is just like the three visible parameters and then all the other ones that were sort of sneaking in

Starting point is 00:23:41 through the back door, those were no longer. Anyway, so it was very difficult to figure out how the code was going to behave. Global data makes things very difficult to understand. And you can't multi-thread it in any way. So back in 2007, we started a conversion to create a fully object-oriented version of Rosetta. And we have eliminated most but but not all, of the global data. There's still like a couple things which are a thorn in my side to this day. But everything is objects now.

Starting point is 00:24:15 We don't have these big global arrays. So that's better. How long did that take? It went remarkably fast. By the end of, let's see, so we started in earnest in february of 2007 and by august we had uh the design module up and running and like most of the functionality there and then we began porting like lots of other things over so that by yeah by the end of 2008 um we sort of declared ourselves done um it i mean we have added a lot more functionality since so it continues to expand it's now up to

Starting point is 00:24:52 three million lines of c++ a lot of that is is duplication but uh a lot of it isn't so did you uh you created a whole new structure and then ported the old code into the new structure? Right. Yeah. So we wanted to have the algorithms work as best we could still with the old code. Right. So that as you're porting it over, as you're porting functions over the indices that you're using to decide what position you're going to be cutting your loops at, they made sense in both the, or it was the same indices as before and after.

Starting point is 00:25:35 And so we index from one in Rosetta. Like we're scientists, we're written, most of Rosetta is written by biochemists. They count from one, so we count from one. We don't do the C and C++ style counting from zero. So that makes it a little bit different. But yeah, we did try to keep the code. I mean, preserve as much of the original code as possible, because it's easier, right? We didn't want to rewrite every line, but we did want to use objects to control how data is stored. And so that's sort of the biggest change from the Fortran-like C++ to the current version.

Starting point is 00:26:14 Jack, what version of C++ is Rosetta using currently? So we're using C++11, we switched over in 2015-ish because we needed to wait for all of the supercomputers to get more modern compilers because all of the Rosetta clients have the option to compile their own source code locally. Right. And so we needed newer compilers to disperse before we could upgrade. However, we do have the ability, Rosetta can compile with 14 and 17 with minimal changes, which have already been done. So it's optional, but we still live at C++ 17, or sorry, C++ 11. So, well, this is all, your whole story for both of you is very close to home

Starting point is 00:27:01 because I've been involved in a project for the last 10 years off and on that is fortran that has been converted to c++ through an automatic conversion tool and we're struggling through getting some of these things we've got i don't even know thousands of global variables i think it's uncountably many i think it's it's um anyhow uh they are currently on c++ 11 on the project and i've been saying well can we move to 14 or 17 and and the answer i get back is oh sure we already we you know we require whatever compilers with ubuntu 18.04 whatever but i can't actually get them at the moment to commit to a specific C++ version. So I'm still like stuck on C++ 11, which for a few things is really painful, like generic lambdas, it's much easier to write a lambda C++ 14 in many cases. And so yeah, I was just curious, like,

Starting point is 00:27:58 are you specifically stuck on C++ 11? Or did you just draw the line there? Or could you probably switch it to C++ 1414 and no one would even notice because they're using GCC 4.9 or better? Or do you know? The people that would notice would be mad. So most people would not notice. I often develop in C++14 and 17 then downgrade when it's time to push to master.

Starting point is 00:28:22 Because that's faster, right? There's a reason they have the new tools and the newer versions i never even considered doing that yeah because a lot of the new things in the 14 and 17 can be hacked in 11 yeah a lot of them yeah yeah we don't do as much constexpr as you well in this project i'm doing very little constexpr, although I am pushing them towards it in a few places. One of the things which might be familiar to you all is at the top of every function, we have a const static standard string function name. And it's just a representation of what the current function name is. I don't know if this had some history in the way Fortran used to be done back in the day, so that if there's any log messages after that, then they can just say use that function name that we hard-coded in here. Well, interestingly, that const static standard string is a performance pessimization every single time the function is entered, because it has to do a static check to see a thread safe

Starting point is 00:29:21 check to see if that string has been initialized yet. And I'm like, if we had C++17, I'm doing a find and replace on every single const static string with const const constexpr static string view. And that would all be free then after that. But, you know, I haven't been able to flip that big switch yet. Anyhow, that's the kind of things that we run into on projects like this. a search for errors and especially typos. The tool supports the analysis of C, C++, C Sharp, and Java code. The article 012 Freddy's Coming for You, which has been recently posted, clearly demonstrates the analyzer's outstanding abilities of searching typos. The article shows how easy it

Starting point is 00:30:17 is to make a mistake, even in simple code, and that no one is immune from making it. You can find a link to the article in the podcast description. There's also a link to the PVS Studio download page. When requesting a license, write the hashtag CppCast and you'll receive a trial license for a full month instead of one week. One thing I'm curious about is it sounds like both of you are very well versed in C++, but it sounds like a lot of people work on this project are maybe more focused on biochemistry and maybe don't have as much C++ background. What's it like having all of those people brought in to work on Rosetta? Is that a challenge? Yeah, it's definitely a challenge. It's both a good and a bad thing, right? That there's a lot

Starting point is 00:31:03 to protein design that's just imagination limited, right? There's fewer things that I can imagine than a whole bunch of biochemists can. But at the same time, it's really kind of nice to have people developing the code who understand all the rules of what you should not do.

Starting point is 00:31:21 Like let's try to avoid global variables or let's not change a structure in the middle of trying to score it. And so, yeah, it is a challenge to work with as many biochemists as we do. But I think it is useful that they're there. Yeah, so one thing that we've relied heavily on in C++ is const as a decorator for functions and methods. And so that helps us describe to other developers what we don't want them to do and prevent them from doing with a piece of data in the middle execution, while still also giving us good performance, right? You can hand out the coordinates to a structure, inline functions without having to worry about them being modified by those functions you're handing them out to.

Starting point is 00:32:17 And that's important for like making sure the code is actually computing what you think it should be computing. There have been multiple times where people have suggested, hey, can we change the coordinates of this residue in the middle of scoring the structure? And then if you imagine that, then some amount of scoring has taken place, and then the structure changes, and then some other amount of scoring takes place, and then you're not really sure what you've computed at the end. So const keeps us from having naive biochemists come in and change it. And they're naive enough to know that they can't just cast away the const if they so desire. Right?

Starting point is 00:32:57 That's another feature of the language. What's scary, though, is we might talk about this more later, but a lot of our Rosetta users use the Python bindings for it. And when you create, when you use objects you some idea of what you're not supposed to do but there's there's no protection uh for a lot of the users at that level yeah yeah i feel like as my experience as a trainer if someone said can i modify this while it's being scored i would say i have no idea what you're talking about but no uh yeah the the python thing Python thing, there's one particular place where in PyroZetta, expert users have had the same bug where you can get access to the kinematic description of the system. And it's supposed to be a const access. Like it's const in the C++ part,

Starting point is 00:34:07 but it comes through the Python bindings as just you have the object, right? You have a pointer to it. And so you can modify it directly without the object that's holding it. So the fold tree is describing the kinematics and it's held by the confirmation. When you change the fold tree in the confirmation

Starting point is 00:34:25 like confirmation needs to update some of its other data members um and so if you change it directly like then it breaks and that's that's actually um messed up multiple people um like senior developers even um so that's that's one of the yeah one of the frustrating parts of having something that is safe in C++ and then not safe in Python, it'd be kind of nice if we could enforce in Python constness. I don't know how to do that. So does that imply that at some point in your bindings, there's actually a const cast or something happening? I think the bindings might return a const object, but Python doesn't do anything with that const. Python doesn't care if an object is const or not. Right. Maybe I'm misunderstanding the question. I don't know. I'm thinking at what point can they

Starting point is 00:35:17 actually modify it? If it has a const object and then they're able to call a non-const member function on it, that means at some point in your bindings, the constness was dropped. Because otherwise, the bindings themselves wouldn't be able to generate the code to call the non-const member function. Something had to get lost. Yeah, probably when the object is being returned, it needs to be put into some sort of generic container, right? It knows it's a pointer to a fold tree,

Starting point is 00:35:44 but Python can't distinguish between pointers to fold trees and pointers to constant fold trees. And so it just casts away at that moment. That sounds probably right. Yeah. I'm curious what tools you're using now that we've gone this far to generate your Python bindings. Or are you? For a while um we were using pybind 11 and and boost um but uh uh at a certain point that was um it was just leading to too many headaches and so a developer in the community um sergey liskov developed um his own um uh python wrapping uh library which he calls Binder. And so that's available and is open source. And he supports it actively. I can't tell you too much about how it works, though.

Starting point is 00:36:33 But it's pretty neat stuff. And I know that it's now much easier to compile PyroZetta yourself. So if you're a developer and you're adding new functionality, but you want to use that functionality immediately, then you can just compile your own Python bindings, and that used to be very difficult. The other way of getting Python bindings is to wait until the testing server updates, and then I think we have a weekly release schedule

Starting point is 00:37:00 so that you can download last week's version of PyroZetta right now. Yeah, most of our tools are sort of hand-rolled. We're not using a ton of things that maybe we should be. It'd be nice to have, I don't know, like a better sense of what tools are out there that we should be using. I mean, I've used Swig for Python bindings, but it has its own different set of headaches. So it's, you know, you pick your poison. I personally like Boost Python. That always treats me well for my own personal projects. But this binder tool that we use for Rosetta,

Starting point is 00:37:40 like we never have to think about it. It just works. And I don't know how much overhead was required to get that in a state where it is, but you can pass lambdas between the two languages. Any complication you can think of, it just works. It's nice to be able to subclass the C++ classes in Python and then hand them back to containers

Starting point is 00:38:04 that take those classes and have Python functions being invoked by C++. It's a really nice feature, especially if you're creating a protocol that has multiple stages, like at stage 5 to have that implemented in Python, but using the C++ code for stages one through four. That feels like magic when you do that kind of thing.

Starting point is 00:38:30 Yeah, it's great. If the listeners have never used Python and they've never tried bindings before, I highly recommend it. I never really liked Python until I had that power and it just feels incredible. jack i was wondering

Starting point is 00:38:47 if you might want to tell us a little bit more about uh the work you mentioned in your bio with using quantum computing and machine learning with rosetta yeah sure um so just to quickly re summarize the problem that we work in um The protein itself is a linear chain of units, like beads on a necklace, to use Andrew's analogy. And the chain itself is flexible, but each bead can also adopt one of 20 different chemistries. So there's this two different sampling states because the chemistries are flexible themselves. So while the big main chain can fold in on itself in many different ways, each bead can adopt one of 100 or 1,000 different states. So there's a combinatorial optimization problem with the beads changing states with a very large sample space problem of the main chain changing shape.

Starting point is 00:39:55 What I do mostly is protein design, which is where you can change the chemistries of the protein to optimize the behavior of it in the cell. And when you do that, then you're mostly focusing on the state changing sampling so you can... the main chain of your protein is already in some conformation that you want to keep, but the chemistries of each bead on the chain, you want to be able to change. And like I said, there are tens or hundreds or thousands of different states for hundreds of different beads. So the combinatorial optimization of that problem is the inner loop of sampling.

Starting point is 00:40:49 And then the outer loop, you allow the main chain to slightly modify to adopt chemistry changes. So the inner loop as a combinatorial optimization problem translates well for quantum computers. And that's what we're doing at Menton AI, is mapping the Rosetta inner loop of what we call um packing where you try and fit these chemistries in a dense protein core and you uh you map that into a combinatorial optimization system that can be put and there are already tools for that on classical computers but quantum's on the rise so we're trying to jump on that um and then with the outer loop, we're trying to drive that with machine learning. So we're doing quantum wrapped up machine learning. What is it? I mean, are you programming in C++ on the quantum computer?

Starting point is 00:41:33 Like, how does it, what do you do? We have collaborators that do the quantum computer programming for us. All we need to do for that is set up the set up the problem and then there's an api um i would be surprised if it was c++ i think it's very low level machine code i've never looked at it my full understanding of quantum computing in general comes from science fiction as i understand it you set up the problem and then someone in an alternate reality provides you the answer i'm pretty sure that's how it works yeah that's i mean for us um the alternate reality is in toronto but okay that sounds about right yeah yeah um

Starting point is 00:42:16 it's really neat so the quantum computer itself manifests your problem in real life at a very small quantum level and then uses real physics to sample the landscape of solutions to your problem and leverages advanced physics that I never got deep into to understand to use you know real life physics to find the solution to your computational problem. So it physically models the problem, effectively. Yeah. Now, there are some shortcuts that need to be taken with modern-day quantum computers, so you don't always get the best solution, but we're investing now so that as quantum computers get better, hopefully we can take better use of it. Interesting. I have no idea what any of that means.

Starting point is 00:43:03 No, it's... Interesting. I have no idea what any of that means. It's fun, though. It's fun to be at the level where we are, where we don't need to write the code that goes on the quantum computer, but we get to use it. Right. And just at a ballpark, how much faster is its results than what you were doing with classical algorithms? Well, we're limited at the moment by the size of the quantum computers, but that's quickly going away. It's a rapidly growing field. But for all of the problems that we're running now, they can be solved on classical computers.

Starting point is 00:43:38 So at the moment, there's no runtime advantage right now. Well, to be fair, the problems, when they're being solved on the classical computer, I mean, they're NP-com complete. So we're just approximating, uh, we're approximating in either case, uh, uh,

Starting point is 00:43:49 a stochastic technique for, for doing, um, the optimization. But, um, like, so the,

Starting point is 00:43:56 yeah, the quantum computer is supposed to give you the answer, except when it doesn't. Um, and so, uh, so sometimes it gets, you know,

Starting point is 00:44:03 significantly better energies than what we're doing with our – oh, I've gotten frozen again. I guess I'll finish the thought. Sometimes. The quantum computer doesn't always give you the best answer. But in theory, if you had a sizable computer, it will. I don't know the total limitations. Like they operate at very low temperatures and there are physical limitations to the size of problem you can sample, but you can approximate to fit larger problems. If listeners are interested in this, there's a preprint out from our company at Menton AI.

Starting point is 00:44:40 And maybe I can send that to you for the show notes. Absolutely. If you look up, Vicar Mulligan is the big name behind it, Hans Melo. And it's an interesting problem. I'm excited to learn more about it. I've only been at Menton AI a few months now, so I'm still, I haven't personally run the quantum program yet. Now, again, just based on my science fiction research from what I know of quantum computing, the other problem is that it could give you the answer yesterday and you missed it, right?

Starting point is 00:45:10 Yeah. I'm pretty sure that's right, true. So Vikram, like I mentioned, he pointed out to me that if there are alternate dimensions and Jack McGuire is running a quantum computer in one dimension and Jack McGuire is running a quantum computer in a different dimension we'll get different results so somewhere there's a Jack McGuire that just got a really good Rosetta result from the quantum computer and then I just got you know the average one or maybe I'm the one that got the good result so it's it's all you know the science fiction is very much in play that's's great. Just on a side note, since we're talking about quantum computing and science fiction, there was a great show that just started on Hulu called Devs. I would recommend it. It's all about quantum computing and software developers using it.

Starting point is 00:45:58 Very cool. All I can say is you all are using quantum computers to help generate novel proteins. This sounds like the end of the world well uh we'll find out soon or the beginning of the world yeah uh you andrew you talked a lot about you know some of the successes of uh what's you know being done and solved with rosetta and you know progress with hiv vaccines uh you know at progress with HIV vaccines, uh, you know, at the risk of pissing off our listeners by talking about COVID again, is there any chance that there's people out there like trying to make a COVID vaccine using Rosetta proteins? Yeah. So another,

Starting point is 00:46:37 another thing that, um, people have done is, uh, is design little cages, like little balls of protein. And then on those things, you can present other proteins on them. And so there's actually a Rosetta lab out of Seattle that's making a COVID vaccine, basically. So they've taken the S protein from COVID. And I think they had been looking at it before. So I mean, the coronaviruses have been around for forever. And in this particular one, CoV SARS-2, they've taken its protein and put it on their nano cage system and are looking to see if that will lead to a vaccine. I know, I mean, this isn't Rosetta, but there are people who are, this is just biology. this vaccine that um i happen to have a friend in seattle who um was part of the vaccine trial for um is to take the um the gene basically for one of the genes um as um as not dna but instead rna and inject it into you and then you'll just make that protein um and uh because we're very good at

Starting point is 00:48:02 fighting like or identifying pathogens if they come in looking like pathogens. But if we just get instructions of how to build proteins, we'll just take those instructions. It's like almost a computer virus, right? And start executing those instructions, make the proteins, and then you have this foreign protein that's sort of circulating in your body, and it can elicit an immune response.

Starting point is 00:48:34 So there's a vaccine trial that's going on right now for that, and I've got a buddy who got the injection. But that's not a Rosetta thing. That's another company. But yeah, Rosetta. Go ahead. I was just going to say the short answer is there are a lot of people uh trying to model covid right now uh and rosetta is one of the big tools to do it i didn't mean to interrupt there andrew we actually had a a virtual conference um like a week and a half ago um where a bunch of people within the rosetta community were trying to talk about all

Starting point is 00:49:03 the different ways that they're using rosetta to try and tackle this current global pandemic. And so there's a lot of exciting work going on. Yeah, but I mean, certainly one of the most exciting is trying to get a vaccine. But, you know, if you're going to give a vaccine, you're administering it to a lot of healthy people. And so you need to make sure that it's not going to kill them. And that takes time, right? You can imagine creating the vaccine in a matter of weeks, but making sure that you could actually give it to people and not kill them, that's what takes the year and a half. And it's a good thing to do. So, yeah, I mean, while Rosetta and lots of other techniques are like gearing up to give us vaccines, knowing that any one of them could be used is what's sort of delaying our ability to go and use them. My extraordinarily limited biology knowledge, though, you're talking about this protein ball that they could attach other proteins onto. Would the goal be then to expose

Starting point is 00:50:10 that protein to your immune system in a harmless way, basically? Yeah, I mean, it's effectively harmless if it's not able to infect your own cells, right? It's just a ball. And the purpose of it being a ball is that you have like 50 copies of it. And for biochemistry that I won't go into, like the more copies you have, the easier it is to sort of get weak binders that become better binders. And so there are like ways to present to the immune system a starting point and get it ramped up. So it looks a lot like you imagine the virus itself would look like except it doesn't have a virus inside of it it's just an empty ball and so it's not gonna it's not gonna infect you it's just a mimic your immune system will still respond to it get its own antibodies and then be able to react to any invaders that are coming in.

Starting point is 00:51:06 Right. Okay. Sounds neat. Well, it's been great having you both on the show today. Is there anything else either of you want to go over before I let you go? If it's okay, I'd like to mention that Menton is hiring data scientists more than C++ programmers. But if you're a data scientist listening, machine learning, feel free to reach out to Menton. I'll include the website maybe in the show notes. We probably have a few listening. Sure.

Starting point is 00:51:32 Okay. Well, thank you both for coming on the show today. Thank you for having us. Thanks for coming. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate

Starting point is 00:51:54 if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website

Starting point is 00:52:16 at cppcast.com. Theme music for this episode is provided by podcastthemes.com.

CppCast - Rosetta

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.