Software Misadventures - Grokking Synthetic Biology | Dmitriy Ryaboy (Twitter, Ginkgo Bioworks)

Episode Date: July 16, 2024

From building a data platform and Parquet at Twitter to using AI to make biology easier to engineer at Ginkgo Bioworks, Dmitriy joins the show to chat about the early days of big data, the conversatio...n that made him jump into SynBio, LLMs for proteins and more. Segments:  (00:03:18) Data engineering roots (00:05:40) Early influences at Lawrence Berkeley Lab  (00:09:46) Value of a "gentleman's education in computer science"  (00:14:34) The end of junior software engineers  (00:20:10) Deciding to go back to school  (00:21:36) Early experiments with distributed systems  (00:23:33) The early days of big data  (00:29:16) "The thing we used to call big data is now ai"  (00:31:02) The maturation of data engineering  (00:35:05) From consumer tech to biotech  (00:37:42) "The 21st century is the century of biology"  (00:40:54) The science of lab automation  (00:47:22) Software development in biotech vs. consumer tech  (00:50:34) Swes make more $$ than scientists?  (00:54:27) Llms for language is boring. Llms for proteins? that's cool  (01:02:52) Protein engineering 101  (01:06:01) Model explainability in biology Show Notes: The Death of the Junior Developer: https://sourcegraph.com/blog/the-death-of-the-junior-developer Dmitriy on twitter: https://x.com/squarecog?lang=en Tech and Bio slack community: https://www.bitsinbio.org/ Stay in touch: - Make Ronak’s day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en

Transcript
Discussion (0)
Starting point is 00:00:00 And then the big table, I think paper came out and there was just like all the stuff coming out of Google and we were reading it and we were reading their ideas, like map to what we were doing. We're like, oh my God, we're not wrong. Like even Google's doing like that. Check that out. Robots mutating some aspects of organisms. Can you elaborate a little bit more on like what that really means? Because yes, it does. Slow down a little bit. Maybe give an example, because that does sound like science fiction. It's kind of like trying to optimize a binary that was compiled from some C code, except you don't have the C code, right? You just have the ASM.
Starting point is 00:00:39 Your problem isn't even to debug it or to understand what it does. It's to make it go 15% faster. Right? By the way, you don't actually know C. Right? And you don't have the ASM manual either. Right? So that's the problem, right? Like you have the code, but it's like the bytecode.
Starting point is 00:00:55 And it goes for something. There's a system down there somewhere, but like the only way to really understand what's going on is to keep poking at it in the physical world. Change it a little thing, see if it blows up. Change another thing, see if it blows up. And so robots come in because you can do this in high throughput, relatively high throughput for biology,
Starting point is 00:01:10 not high throughput for like web services people. Today, because the speed of software iteration is so high, the cost of mistakes is much lower. So sometimes the incentive to get things right in the first place is a little lower because you know you can fix it in the next PR. Whereas for biotech, that's not true. Like cost of mistakes can be brutally high. So what are some things that are done differently from a software perspective? So yeah, one thing is that you're mostly writing internal tools. So all of that idea of like A-B testing that goes out the window, right? You can't expose 5% of the scientists to this thing that might or might not work,
Starting point is 00:01:47 right? So you still want fast iteration cycles, but you also want to make sure that when things are ready, they're really ready. You'd think that leads to some sort of like crazy waterfall situation. What works much better is just a very tight coupling between the customer and the software. Like I really like all the work that Anthropic is doing and others in model explainability. Because right now, the reason Anthropic cares
Starting point is 00:02:10 is because explainability tells us essentially how to control the system. If it does something weird, we want to know what's going on so we can fix it. Or we want to be able to translate what the model did into some sort of human understandable explanation of like why it made a decision, right? It's sort of a QA function. But in biology, if the model is explainable, like you might discover explanations that
Starting point is 00:02:36 are not just confirmations of like rules you know, but new rules. Welcome to the Software Misadventures podcast. We are your hosts, Ronak and Gwan. As engineers, we are interested in not just the technologies, but the people and the stories behind them. So on this show, we try to scratch our own edge by sitting down with engineers, founders and investors to chat about their path, lessons they've learned, and of course, the misadventures along the way. So Dimitri, I thought a fun place to start is your LinkedIn blurb. So it says VP of AI enablement at Jinko. And I think before that you were a CTO at Zemmergen. And then, so you're the
Starting point is 00:03:22 author of The Missing Readmeme but then you also have data engineer on there um yeah i thought that was pretty cool like is that like do you see that as like part of your identity that's why you have it uh very much so to the extent that i think more of a data engineer than a like vpo of whatever and cto nice i like it i throw kind of data engineer first other stuff i have to put it first because you know it's linkedin it's right career stuff but uh i i come with pretty much everything from kind of a data engineer perspective and that's i guess how it's self-identifying were there skills from like data engineering days that you still find useful like in management or
Starting point is 00:04:01 that sort of thing i think the main skill the main body of knowledge I gained as a data engineer was around organizing, structuring data. And I don't mean for BI, I mean for doing engineering on top of the data, right? So laying things out so that they can then power search, they can power BI, they can power experimentation platforms, all of that, and sort of understanding also the connection between how services are designed and instrumented, web services, and the way that they model their domains, and then how that data gets represented in databases and various other data stores, and then how that
Starting point is 00:04:45 gets analyzed and used for other means, right? And the connection between that, sort of the fact that data recorded about a service is often as much of a product as what the service itself is doing. I think just thinking about how those things are connected, just let me think about systems at large. And so then whether taking on sort of a PFS software role or getting into AI, it just really helps understanding sort of fundamentally both what you need to do in order to analyze data at scale and the impedance mismatch between sort of how you structure for runtime and for serving and then
Starting point is 00:05:27 for the backend and where there's friction there. And that helps, I mean, from a management point of view, it helps in terms of kind of aligning different teams and making people, helping people understand sort of the different priorities and shading light on kind of both camps. I don't know that it's a data engineering skill. I think it was like the kind of data engineer I was when I was actually doing engineering made me very sort of, made me have to deal with a lot of different camps and people who don't normally talk to each other. So that I think really translated in terms of leading larger and larger teams over time. So you've been in biotech for like the past few years. And then before that, you were in like tech tech. So leading to the platform.
Starting point is 00:06:07 Consumer tech, yeah. Leading to the platform at Twitter. Yeah. And working on Parquet. But all the way back, your first job out of undergrad was actually being a software dev, I think, at Lawrence Berkeley National Lab. Yeah. So I thought it was... During undergrad even, yeah.
Starting point is 00:06:22 I see, I see. I thought that was kind of... During and then a bit after, yeah. Nice. I thought that was kind of a cool Yeah, I see. I see. I thought that was during and then a bit after. Yeah. Nice. I thought that was kind of a cool, like full circle. A little bit. Yeah.
Starting point is 00:06:33 I guess along the data engineering, like, like, were there skills that you learned, like during those national lab days that, oh yeah, like an important role, like later on you got into tech. Oh yeah, absolutely. So the job you're referring to actually started as a sophomore, essentially as a way to pay the bills. I was working somewhere close to full time, which partly explains full steam. And I was there when that got published. And then for several years afterwards, as sort of sequencing of genomes improved over time, although this was before what's now called X-gen sequencing or NGS, that those technologies came out a little later. So it was still pretty expensive, but it was every year was much bigger than the year before. And what my job evolved into was working with a couple of other engineers
Starting point is 00:07:29 in building distributed systems for processing all of this genomic data that was coming in. And it was sort of the old school distributed systems, you know, running like Beowulf clusters. And we talked about server racks. It was literally like beige boxes on a rack, like a kitchen, like shelves, wire racks that we put beige boxes on top of in a generous closet that was not built for that kind of power or heat. So there was a lot there about sort of
Starting point is 00:08:00 processing data at scale, coordinating workers across multiple systems. As you might imagine, those machines were not the most reliable things in the world, so having to deal with failures. We also built a web service to allow scientists to interact with the data that we were processing there. So there was a bunch there about visualizing data and serving it and giving people feedback about a process that might run for several hours. So just kind of all the state management and that kind of thing. In many ways,
Starting point is 00:08:33 the technology has changed and the capabilities change, but the high level of what that does has remained. Those lessons, I think, translated and then kind of evolved and my next job was at ask.com it got renamed like a month before i started there it was ask jeeves which is i think still how the thing that people actually remember so if you remember ask jeeves from your like childhood that was that became us.com and i got that job because they were like oh cool this guy like actually knows you know large. And then I joined them and it was like, oh, your data is like 10 or 20 times bigger. I thought it was big, but it wasn't big at all. And so that was fun. And that was, again, even bigger systems, writing our own SQL
Starting point is 00:09:18 parsers and reinventing our database, basically writing a database from scratch, not knowing what the hell we're doing at all. But ASC was using Oracle at the time for data warehousing and for the specific kind of data warehousing tests we were trying to cram into it. It just was not scaling. And we had to write our own sort of system that translated SQL into distributed queries on top of Berkeley DB files, if you all remember Sleepy Cat. And that was super fun. That led to sort of wanting to really study that and understand distributed systems and
Starting point is 00:09:51 databases. That led to a master's in very large information systems at CMU, which is what back then we called big data. And big data is what we now call AI. And that led me to Twitter and so on. By the way, like working at Berkeley National Lab, and in this case, you were at school, you said because of a full-time job, you had terrible grades. Between the two, what do you think helped you more in terms of your actual day job?
Starting point is 00:10:24 This is when you were at ask.com and beyond. Gosh, that is a surprisingly tricky question. I think in terms of helping me get the job, it was having the LBNL background because it spoke to some ability to actually get shit done. In terms of useful skills, the stuff that echoed kind of showed up later from the job, but a lot of what I learned in classes comes up even to this day. I loved the database class, didn't know I would wind up working in database and distributed systems, but that came up in a big way a few
Starting point is 00:10:58 years later. I was really into the AI class, but that was all neural networks. So obviously, nobody did anything with that for like over a decade. And then suddenly, oh, I remember that. Where's the textbook? So I think the sort of gentleman's education in computer science, you know, like Berkeley really makes you sample across and understanding what I like, what I don't like, where I have some sort of, you know,
Starting point is 00:11:24 some stuff that comes more easily to me than other things was super helpful. And just the ability to like sit down and grind through a CRL. It was CRL back then, not CRLS. Wait, I don't know what CRL is. And I don't know how many of our listeners would know CRL. So can you elaborate on what that stands for? Oh, I'm looking for it, but it's currently serving as my monitor stand, which is conveniently bought by my laptop. But it's a classic algorithms textbook that's used at most US colleges. I don't remember the sheet. It's Laserson, Rivest. And of course, I don't know.
Starting point is 00:12:03 Oh, this one. Yeah, yeah yeah you're talking about the algorithms book okay at this i remember introduction to the algorithms it's got all that's right introduction to algorithms right it's a fantastic textbook and it was like one of the hardest classes i took and i just loved that particular class like i remember it's cs170 at uh uc berkeley this was 24 years ago i remember the number yeah just like knowing i think having the sort of skills that working through all of that builds and you even if you don't remember the actual algorithms right i think just being able to read the text understand understand approaches and kind of getting the journal just to it i mean that was fantastic and definitely the thing that translated more and kind of for a longer time. I think the management lessons I learned more
Starting point is 00:12:49 from Lawrence Berkowap. Oh, management lessons from Lawrence Berkowap. Okay. Well, how is elaborate on that, please? Because this is still early in your career. Oh yeah. It wasn't, I wasn't managing anybody. It was sort of what I felt as a junior, what kind of motivation worked, what kind of motivation didn't. I'm kind of a late bloomer, like I'm now reasonably successful, but back then I was kind of not great. And a lot of that was me, and some of that I think was my tech lead. Nobody ever taught them how to be a tech lead. And so they made some mistakes that you know then when i was in a similar position i was like i remember what not to do
Starting point is 00:13:30 but also where like words of encouragement and things like that like really did help and like remembering that experience and trying to bring that forward and i assume this also kind of played a role in later you writing the book? Yes, in some ways, although that was maybe a fairly minor bit of motivation, mostly the writing the book was because I spent a good 10 years sort of teaching and explaining these concepts that weren't written down and like having to remember, reinvent every time and remember like, oh, right, you just graduated.
Starting point is 00:14:03 So you don't know this thing that like everybody who's been around for a couple of years just innately understands. But we have to explain it because nobody teaches you that because you don't learn how to work with latency code in college, right? Like you don't know how to really work in a team when it's not a team that's like last for six weeks or whatever. And, you know, it's the classic college team kind of projects. You don't know why like these processes are there right you don't know what it's like to write code that needs to survive past the grade right needs to be maintainable like why does it
Starting point is 00:14:36 all look this way right there's just like a bunch of that kind of stuff that i felt like oh yeah we like keep teaching that and over and over and over and my my coauthor, Christopher Minney was also saying, experiencing something along those lines. And, you know, he tweeted something like, oh, you know, wouldn't it be great if there was a course or a book or something? And I was like, I said the same thing a year and a half ago. We should like just do that now. And so we got together and did it. That's pretty cool. Like before we move on from that school topic, recently I've been having some conversations with a few folks who have, their kids were about to go to school and they're wondering, hey, what major should we pick? And part of what you mentioned earlier, like you even remember the course number for your algorithms
Starting point is 00:15:18 class and you still remember that to have contributed a lot to your career with LLMs and tools like chat, GT, cloud and whatnot. A lot of the easy problems are becoming way easier and entry-level jobs are becoming much harder. And this is a question, like I recently talked to folks, again, one person's in India, one person's here in Canada, and all of these people are grappling with the same question. Like two years ago, we were planning our kids would go to computer science because our kids were interested in that. At this point, looking at where the industry is going,
Starting point is 00:15:50 we are kind of questioning that choice and thought process. Like, should we even do this or not? What's your perspective on this? Yeah, I think that's, I expected my perspective will evolve over the next couple of years. That's a really hot topic these days. Steve Yeager just recently posted a great blog post
Starting point is 00:16:08 called, I think, like the end of junior software engineer. Oh, yeah, that's a really good one. Yeah, it's quite long, but it's good. And also, that topic has just been coming up. Charity majors from Honeycomb recently posted about the need to create those jobs and how, like, in the 90s, right, like, you just kind of were out of high school, became a sysadmin somewhere because you knew how to type. And, like, that was your on-ramp. And then, like, a lot of those people are, like, hugely successful in the industry now, right?
Starting point is 00:16:41 And, like, that on, even before the AI thing happened, like, those on-rramps are gone, right? Like you have to know a lot more in order to get in. So it becomes harder and harder to find those roles where somebody who is new and not experienced can be successful and learn how to not be new, right? And with AI, it exacerbates the problem, right? Because you don't need a junior to write like the little bash scripts and the basic thing once you've outlined the problem. And outlining the problem and explaining what it actually needs to do and how it fits into the context of everything, that's what seniors do. Pretty much. You can't do that until you're a senior. How do you learn to be senior when those jobs are gone, right? Yeah, it's a huge problem, I think. But to the point of, is it worth studying computer science?
Starting point is 00:17:26 I think just like learning math or something, it's more about teaching you how to think and how things are connected than it is about doing the thing. Like what I think is going to die. And I see you're wearing the Insight t-shirt, which is maybe a little on the nose. So Insight Data Science, but they're not called Insight Data Science anymore, right? They renamed, but it was called Insight Data Science. That's right. That's how we met.
Starting point is 00:17:56 Oh, yeah? Are you both Insight? We both were at Insight. That's how we got into data engineering. And then I went the infrastructure engineering route and Guang worked up the stack. Okay. Yeah. So, I mean, specialize in like essentially reforming like physicists into data scientists, right? Like, or folks who acquire a lot of skills through studying hard sciences, but like those skills maybe aren't as marketable, but are hugely useful.
Starting point is 00:18:26 And with just a little bit of polish and here are the tools and here's how you represent that, you can turn that into a data science or data engineer position because you actually have everything required. It's just that you need some terminology and a little bit of a finishing school, which is what Insight is. I thought it was a great program, by the way. Like I've hired multiple people out of that. The point here is learning like how to think about large problems and break them down, learning how to work through them, right? Like I think those skills are critical. CS is one way to get there. These days, I barely write any code, but having a background in computer science
Starting point is 00:19:09 is hugely valuable for my job as a VP or manager because I need to understand what's going on. I need to understand how systems work together. In fact, organizations are systems. And if you understand distributed systems and how they interact, with a little bit of a transformation, you can start thinking about the organizations, how they interact and how to build that system. So I think systematic thinking is still critical.
Starting point is 00:19:34 And you're not going to get that from like chat GPT, right? Like you can have chat GPT do some routine work for you and incorporate that into a creative process. But the actual thinking about what needs to happen and why it needs to happen, right? You need to have your brain prepared to be able to do that. And to me, CS, SRS curriculum is a really good one for that. But like, you know, my major was actually EECS, so electrical engineering, computer science. I don't do any electrical engineering whatsoever.
Starting point is 00:20:10 I still don't entirely know what a transistor does. You know, like there's a bunch of stuff where I'm like, you know, like that's just somewhere inside the computer and it doesn't matter. But then like GPUs come out and it's really helpful to have that background
Starting point is 00:20:27 to understand how they're actually different from CPUs, right? Because, like, understanding pipelining, for example, understanding other things, right? Like, it comes in, even if you're not designing the stupid chips. Yeah, that's true. So, Ranec is writing down theoretical physics PhD in the message
Starting point is 00:20:46 to his friends in terms of, uh, with their kids. Yeah. For, for me, actually, this has been relevant to like someone I just talking was just talking to like last week, they are like a few years into their career as a software engineer. And then they're thinking about going back to school for like a master's. So you did exactly that. Like, what was the calculus like? Like, I guess, first of all, like, do you think it was worth it? Absolutely. Yes.
Starting point is 00:21:16 Yeah. So what was the calculus, I guess, going into it? What made you decide? So it was huge for my career. I think so. First off, you know, I have a fairly pedigreed resume, but it really didn't feel like that the first few years out of school. So after the Lawrence Berkeley Lab job, I went to ask.com and I was there with a really wonderful data engineering team that worked on sort of all the ETL and data warehousing
Starting point is 00:21:39 and an application on top of that for data analysis. And that's where we sort of started building that distributed system kind of that maybe in retrospect, we shouldn't have been allowed to build, but we were doing it anyway. And it was a lot of like stumbling around in the dark and rediscovering from base principles, like how you build these things, which was a great lot of fun. And it was a team of four or so people, kind of depends on how you count. And two of us,
Starting point is 00:22:08 so it was me and my buddy, Pete Alvaro, which is a name you may recognize if you're in distributed systems or databases. So, and he and I were building these things, most of him, and I was sort of the junior assistant and getting really really excited and like having this problem where Oracle was falling over and it was Oracle Rack a fairly high powered system but like the queries would just run forever we had to we realized that there's particular ways that are not what they teach you in the textbook that we were organizing our tables because they were so big and we had to get Oracle to like give it enough hints to be able to process these summaries. And then our prototype on 17 machines that we scrounged from the search team
Starting point is 00:22:51 because they moved on to bigger and better ones. And we were like, oh, keep those. We'll take them. We had to know which rack which one is because the interconnect on the same rack is higher throughput than in between, so you have to be smart. And we started putting it there and like suddenly our thing is like way faster to do it our way than it is on oracle and oracle is obviously a giant so that was a lot of fun and this is like 2007 2008 era now our management was kind of churning you know as Ask was changing out from CEO on down. And they were always very appreciative of us and very kind, but also strongly felt that the way for them to solve this problem that we're having with data processing was to buy a bigger Oracle machine or to get more licenses for the latest,
Starting point is 00:23:42 whatever feature Oracle was trying to upsell them on, or to change the storage and put it on a SAN and all of this kind of stuff where it was sort of pay more for bigger, fancier hardware, kind of scale up versus scale out. And at the same time, they had a consultant who was a professor, a CS professor from Santa Barbara named Divi Agrawal, who really encouraged us to sort of keep experimenting with our stuff. And eventually it got to a point like the Hadoop, sorry, the MapReduce paper came out from Google. We were super excited about that.
Starting point is 00:24:20 Hadoop wasn't open source yet. And then the Bigtable, I think, paper came out. And there was just like all the stuff coming out of Google. And we were reading it. And we were reading their ideas, like mapped to what we were doing. We're like, oh my God, we're not wrong. Like even Google's doing like, check that out. We started going to technically public database group reading,
Starting point is 00:24:38 paper reading seminars at UC Berkeley, which was like, you know, I was still living in Berkeley, like ASselt Com was headquartered in Oakland. So we'd like take extra long lunches and go to the Berkeley campus and sit in on their paper discussions with like a bunch of grad students talking about database papers. And first off, it was hugely intellectually stimulating. Second, it was sort of slowly dawning on both of us that we could hang. Like we thought we were total, you know, imposters. And we're like, wait, I understand what they're talking about.
Starting point is 00:25:11 Wait, I have opinions about what they're talking about. Like, I'm actually like, I kind of, I'm kind of grokking this. Like, this is cool. And Divi was encouraging us. So one thing led to another and both Pete and I were like, we're not going to get to do the things we want to do at Ask because what we want to do is build distributed systems. The Hadoop MapReduce style world is coming and people at our job weren't listening to us yet about that. They wanted to buy bigger SANs. And so we were like,
Starting point is 00:25:44 good luck with the Bigger Sans. We're going to do the distributed system and database thing. And so we both applied to grad schools. He wound up, Pete wound up going to Berkeley and being a intellectual legitimate part of that paper reading club. And he was Joe Hellerstein's grad student and is now a professor at? It's Santa Cruz, I think. Santa Cruz, yeah. And I wasn't man enough for a PhD, so I went into a master's program at CMU, so I went there. So that was kind of the story there.
Starting point is 00:26:21 It wasn't really a calculus of like, this will make me more, you know, hireable. It was just, I was really enjoying the thing. I saw that this is where, I had this feeling that the problem Ask was experiencing with too much data was a problem other companies were going to experience soon. And Ask was just a little bit ahead of the curve because of the nature of being like a top, you know, one of the top hundred like web destinations at the time and the amount of traffic that was hitting it. I was like, we're seeing a preview. This is going to come. And I want to have the time to really get into it and like read the papers, talk to other people who are reading the papers, like just have that Berkeley seminar, but all the time for a year.
Starting point is 00:27:04 And that's what I did. And Carnegie Mellon was great. It was a really interesting program because it sort of allowed you to pick and choose, you know, if you wanted to go into ML, if you wanted to go into distributed systems, you wanted databases, language technologies, like language understanding and stuff like that. So, and a lot of kind of time for yourself. So I mainly went because, yeah, I wanted some time to have a good excuse to just read the papers and implement stuff and talk to people. And at the end of that, you know, I got into open source because at that point, Hadoop and Pig and various other things that are now like half of all of Apache Foundation projects were starting to get open source and so I made connections with the community because that was
Starting point is 00:27:48 like my master's thesis project that led to interest from Facebook, Cladera, Twitter who were all like also in on that technology right and so like it's just kind of yeah that carried me along. Nice, nice. Was PhD like an intentional, like, not made for PhD? Like, let's break that down. Is that a commitment or like you don't want to go into academia? Like, what was it? I didn't know that there was that much thought put into it, really. Like, I think maybe it was the time commitment. There's a balance of sort of how much you want to be an expert on a thing
Starting point is 00:28:29 versus like have a deep enough understanding of a thing, but like have time for other things. So I think maybe there was part that, maybe it was part sort of, yeah, how much time do I want to spend in academia versus like going back to industry. Obviously, there's a financial component involved there. So it worked out.
Starting point is 00:28:50 But yeah, also, I know a lot of CS PhDs. And man, it's fun to be able to do that, I bet. Really work a problem for six years or something. Some really cool stuff comes out of that. So this joining the the or unofficially being part of the berkeley paper seminars that you mentioned where you were just part of that group reading papers with the folks or the students at berkeley how did you find them and how did you become part of that group i mean it's posted on the website i don't know if they're
Starting point is 00:29:20 still doing it then i certainly don't want to send like hordes of people. But like, I went to Berkeley. So like, I knew where the room is, you know? And yeah, they were like, they had stuff posted on the website. They were like, here it is. It's open to the public. And so we did make sure not to eat the pizza until all the students had pizza. We figured that'd be a step too far. And we like step back. I see.
Starting point is 00:29:45 So one thing you mentioned earlier on the topic of like big data, for example, so it's a thing that we call big data is now AI. Say more. Why do you say that? Well, I was, that was a joke. I understand. I used to say very large information systems
Starting point is 00:30:02 is what we now call the data, but now big data is not sexy anymore right nobody cares about big data but what they do care is your token count and the number of parameters in your in your machine learning model but it is the case that a huge amount of work in building training lms or RAG systems or any of that stuff that is sort of right now a big topic of conversation is where do you get the data? How do you store it? How do you access it?
Starting point is 00:30:35 How do you ensure quality? Right? It's all the same problems. The solutions may be a bit different. Like the requirements are different. Well, exactly how you store it. Actually, most of the data is stored in Parquet anyway. But, you know, like various data set formats and all of that stuff, that's different.
Starting point is 00:30:51 But the fundamental problems of like data quality, data collection, data freshness, data versioning, and doing that at scale, those are the same, right? Like some things change, right? Like now nobody uses HDFS anymore, or at least not if they're starting from scratch. You're going to use S3 or something and build on top of that. Streaming has come out and how to incorporate that. A bunch of things have happened since, but the fundamentals of data engineering are critical to having good AI systems.
Starting point is 00:31:23 So it all feeds one into another. Is this a solved problem at this point? The reason I ask is, when I say solved problem, what I mean to say is there is a recipe for how it should be done largely, and then it's a matter of applying that to a specific domain and getting engineering right.
Starting point is 00:31:38 And the reason I say that is, if I rewind the clock and go back to 2015, around the time when I first got into data engineering, it used to be like, well, there are just so many jobs in data engineering and not many companies have a good ETL pipeline. For example, many of these projects were new and you kept seeing new systems become part of Apache Software Foundation and one replaced the other. You don't see any of that chatter or not enough of that chatter. Maybe I'm just not following that as much. But in your perspective, has this specific domain of ETL matured enough where it's become more of like apply the same principles but to a specific domain? I think some. I think that it's like so many problems. Once you solve a number of them,
Starting point is 00:32:23 you climb that hill and you realize there's another hill behind that hill that's even bigger and you didn't see it before because you were busy climbing the hill. You know? So yeah, like, well, first off, orchestrating your ETL, right? Like, sure, like that was solved in 2015
Starting point is 00:32:38 with Airflow or maybe in 2013 with Luigi or maybe with like whatever came before that, right? Like they were all solved, but it turns out you can keep inventing better mousetraps and now we have flight and we have prefect and we have we have a number of these and we have temporal which is like a whole other like way to to build these things so Daxter I should always mention Daxter because they're really cool for some reason I didn't remember to mention them so first off like as we evolve and
Starting point is 00:33:05 as like the system of tools that we're connecting together changes like that has an effect backwards and assist these changes but also you know despite having been in data engineering since like whatever 2000 early people would talk about data versioning and I would be like, I don't get it. And they'd say it's Git for data. And I'm like, I know all of those words, use Git all the time, work with data all the time, still don't get it. And then you talk to data versioning people and it's actually four different things, what they actually mean by data versioning. And it's only when I got to running the AI enablement team at Ginkgo Bioworks that I finally got why I actually need data versioning.
Starting point is 00:33:53 Because it's so important to know for a given model, what the version of the dataset was that you tested it on, or that you trained it on, and for reproducibility, you need to have the raw source data when you're interacting with the LLM and it's giving you weird answers. And you need to be like, wait, was the data X included or not included in this? And historically, what we've had is data lineage. So you can say these datasets fed into this aggregation or whatever, and that gives you the data lineage. But you really want to trace back to like,
Starting point is 00:34:31 was the, not just this data source was included, but like which version of that data source, like I need the specific tag of which specific datums were in there, right? And that is coming up in a huge way now. And so I think that will also change how people think about things like Iceberg, which by the way, for a soft problem, that's a hell of an acquisition for DataWorks. Oh yeah. That already had basically the same thing. So I think there's a lot to be done for
Starting point is 00:34:56 data engineering. As different systems evolve, as what our expectations are change. You know, like block storage was basically done and then people started like working off of plain S3. Then like Kafka was basically a solved problem. And now like it's being completely re-architected within Confluent because storage technology has changed. Then like how people use storage technology has changed. So I think it just keeps, it keeps moving. Good luck to any LLM writing one of those. So you worked at Twitter after your graduate school, and then eventually you moved to biotech.
Starting point is 00:35:37 That's a jump that's not very commonly seen. So what prompted to go from consumer tech to biotech? Well, first off, it's more commonly seen that you'd think, although more often so into health tech than like viewers in bio. So for me, that was a couple of years after the IPO. You know, I grew up with the company. Like I joined Twitter when it was like 100 odd people. And it was in the thousands by the time I left. And six years, so like a long time also. I was looking to do something new, something
Starting point is 00:36:05 different. And I was having a bit of a sort of struggling at the time with the whole ad supported, like what is actually our product dilemma, right? The users aren't the people that you're getting your money from. There's kind of a three-sided thing going on with advertisers and the company and the users. And I had this desire for something simpler, rightly. Also, I wanted something that would be impacting kind of the physical world. So I started investigating drones and self-driving cars and this and that. And then my friend from when I briefly worked at Cloudera, Aaron Kimball, wanted to grab a beer.
Starting point is 00:36:49 And he started telling me about this company that he joined, that he was CTO of and it was pretty small. And that was Zymergen. And he explained to me what it did. And I had to have him explain it to me like three times because he just thought of like complete science fiction. He was like, okay, so we have software and machine learning that takes DNA sequence and decides what changes to make to the DNA sequence to do something. And then we tell a bunch of robots to actually make that DNA sequence and make that change in an organism. And then we grow that organism. And then the organism makes for us as a tiny little factory, some new molecule that hadn't existed before. And then that molecule is really useful in a bunch of industrial uses, or agriculture, or pharma, or whatever. And then we look at how that little factory works and say,
Starting point is 00:37:37 we can optimize that. And we use the ML to figure out how to optimize it in the software and so you have machine learning driving robots changing dna of microbes that make new chemicals right like every one of those steps is magic indeed we can do what like i even had background in this right like i was around when the human genome project was happening you know like i had the front seat front row seat to that and still he was telling me what's happening. I was like, that's impossible. What are you talking about? You can't have robots changing DNA of organisms. You can, as it turns out. So it just sounded so nuts and so futuristic, sci-fi and cool. I was looking at all these companies that were doing cool things where AI and ML and
Starting point is 00:38:25 data could make an impact. I just kept, my brain kept coming back to that company. I was just like, that would be, that could be, you know, the 21st century is a century of biology. Like, what an awesome dream. Like it just really grabbed me, the vision. So I joined Zymergen and became a VPS software there. Yeah. And then it was eight years of Symbio. So one question, this is mostly to understand some of the aspects that you said. So robots mutating some aspects of organisms. Can you elaborate a little more on what that really means? Because yes, it does sound... Slow down a little bit. Maybe give an example, because that does sound like science fiction.
Starting point is 00:39:07 Yeah. And this was 2015, right? Like, okay. So to make it a little bit more comprehensible, a single cell organism is a tiny factory that has a bunch of chemical reactions going on inside it. And those chemical reactions create new molecules, new enzymes. Like they produce a bunch of small molecules. This is just a process of being a lot. They convert sugar into other stuff. Basically, that's what cells do to
Starting point is 00:39:33 be grossly reductive. We can sequence the DNA of the organism. We can figure out function for a decent chunk of the genes that we discover in the DNA. We can look at the regulatory element of that, and we can say like, hey, you know, this reaction that's catalyzed by an enzyme that is coded for in the DNA is really useful. We want that reaction to happen more because the cell makes something that we want to collect, right? Like, I don't know, vitamins or pharmaceutical enzymes or like
Starting point is 00:40:08 antibodies or what have you, right? In order to do that, we can change other DNA, like other regulatory sequences that essentially are switches that tell the organism like how much of a certain reaction to do or how much of a piece of DNA to do transcription on. DNA turns into RNA. The more of that you do, the more of the RNA you have, the more RNA you have, the more of the protein you're going to make, et cetera, et cetera. But it's kind of like trying to optimize a binary that was like compiled from some C code, except you don't have the C code, you just have the ASM.
Starting point is 00:40:48 And your problem isn't even to debug it or to understand what it does, it's to make it go 15% faster. By the way, you don't actually know C. And you don't have the ASM manual either. So that's the problem. You have the code, but it's like the bytecode. And it goes for something. There's a system down there somewhere. But like the only way to really understand what's going on is to keep poking at it in the physical world.
Starting point is 00:41:15 Change it a little thing, see if it blows up. Change another thing, see if it blows up. And so robots come in because you can do this in high throughput. Relatively high throughput. High throughput for do this in high throughput, relatively high throughput for biology, not high throughput for like web services people. What is high physical real world? What does that translate to in some numbers form? Depends on what you're doing, but the standard unit is a 96 well plate.
Starting point is 00:41:38 So 96 concurrent experiments, actually a few smaller because like you want to replicate them. So divide by two and also some of them are controls, but on that order. And then you can run a few plates at a time. So maybe you run a dozen plates at a time. And that experiment might take you multiple weeks to get the results back. And so you use robotics to do that because you're not going to hand pipette all those things. There's actually a bit of a combinatorial explosion happening there too. But you can do pooled experiments. People who actually know Symbio will think that this was a very gross explanation,
Starting point is 00:42:14 but I'm simplifying things. All right, guys? Thanks for the simplified explanation for all the software engineers out there who might not be as familiar with the Symbio side of things. I'm one of them. So you have robotic automation, sorry. You have lab automation, which is like robots that move these arms and they have little pipettes on them and they can move all the little liquids around. But biology is basically the science of moving clear liquids around
Starting point is 00:42:39 and keep a track of which one's which. And then some of them sometimes turn color and that's very exciting. My parents are both in medical research. Okay, well, ask them. Probably they described to me. See if they agree. So you use robotic automation. So then you need software to drive the robot. Like old school bio was done by hand. Like you pipette things by hand, but you can't do it in high throughput.
Starting point is 00:43:08 So you get a robot, but now you need to tell the robot exactly what to do. And then you're trying to do too much for the robot. And so you need to start like optimizing things and being like, you know, it would be cool. So much of our time is spent on like setting up the experiment. And then the robot is fast, but the setup was slow. Like it'd be cool if we could set up many things at once and like somehow multi-thread the robot and then like what we actually need is like multiple robots working together and handing things off to each
Starting point is 00:43:35 other and like how do we coordinate that whole system so everything becomes workload management right like and it's like gigantic dag in the sky so we needed to do all that and there's a lot of like scientific instruments that read the results of all that stuff and then you need to analyze it and then you need to figure out the correlations between the dna changes you introduced and the results and like what will be the next dna change that you do and these days there's all kinds of cool systems that allow you to do generative modeling of proteins. And then given that protein that you model, get a DNA sequence that would result in that protein. That's one to many.
Starting point is 00:44:16 There are multiple sequences that might result in the same protein. Test them all out by printing the DNA, shoving it into a microorganism, getting it to express, seeing what happens, and so on and so forth. It's a very, it's actually a very high tech and very data intensive process. Like when you say like the iterations are so slow, what comes to mind is that cartoon of like the two devs, like doing sword fighting and then saying like, oh, my code is compiling, like when their managers are asking them. But now you're like sword fighting for like days.
Starting point is 00:44:52 I guess that must have been kind of a surprise going into it, just how long the iteration cycles are. Like how did you guys like kind of work with that? Yeah, I mean, that's coming from like, so I build a product experimentation platform at Twitter where like, you know, my tool or my team's tool is what collected the data on all the A-B test experiments that Twitter ran and, like, provided all kinds of analysis and that side. And, yeah, the amount of data points you can collect on Twitter by, like, turning that thing on versus the amount of data points you collect from a biotech experiment in the lab. I mean, it's orders of magnitude, different in terms of number of data points and the
Starting point is 00:45:34 amount of time inversely, right? And that's really difficult. It means that every data point is massively more expensive in the bio world than it is in the sort the internet world. But you can just get a million more impressions in no time on the web. You cannot get a million more experiments, at least not with high fidelity and a lot of control. You can do random mutagenesis and you'll get all kinds of things. But then most of the time, you don't know what most of them were,
Starting point is 00:46:06 except the ones that were the winners. So you lose some information. So there was a lot of focus on where can we take the slack out? So we're only limited by biology and not by like, well, this team starts their runs on Mondays and this other team starts their runs on Mondays and this other team starts their runs on
Starting point is 00:46:26 Wednesdays. So, you know, you need to make sure to have the handoff from the Monday team to the Wednesday team happen on a Tuesday, because if they're late by one day, like you have to wait a whole week. How do we like remove these strong couplings? Or again, it's like it becomes optimizing systems and just figure like sometimes the problem isn't even the biology. It's like, did we organize ourselves in a way that will be conducive to rapid experimentation? Why are these even different teams? Can the same team do this, right? Like, can we just get rid of the handoff? Especially as you scale, like what you want to keep an eye on is we know there is a fundamental limitation,
Starting point is 00:47:07 which is the speed of biology, like how fast the organism grows or whatever. How much longer than that are we right now per cycle? Where is that coming from? And then driving that down. Assuming speed is your problem, sometimes you will pay some extra time to be able to get higher throughput because you're okay with just getting more data points at the cost of time versus being able to iterate. So a couple aspects I can think which would be different
Starting point is 00:47:39 from just consumer tech or internet web companies and biotech, for example, like you mentioned, speed is something that's different every data point is way more expensive and way more valuable from that perspective i would imagine precision would be another part which would be way more valuable in biotech as opposed to like web companies margin of error and like just the cost of mistakes what are some of the things that are done differently when it comes to software when you look at biotech versus consumer tech and i'll add some context
Starting point is 00:48:12 to this question too we were having a conversation at the on a previous podcast where it was like i'm talking to matt klein who built on y and he was saying that today because the speed of software iteration is so high the cost of mistakes is much lower. So sometimes the incentive to get things right in the first place is a little lower because you know you can fix it in the next PR or you can roll it out very quickly. Whereas for biotech, that's not true. Like cost of mistakes can be brutally high. So what are some things that are done differently from a software perspective then?
Starting point is 00:48:44 So yeah, one thing is that you're mostly writing internal tools. So all of that idea of like A-B testing that goes out the window, right? You can't expose 5% of the scientists to this thing that might or might not work, right? Like, no, that is not how you ship that kind of software. So you still want fast iteration cycles, but you also want to make sure that when things are ready, they're really ready. You'd think that leads to some sort of like crazy waterfall situation.
Starting point is 00:49:20 What works much better is just a very tight coupling between the customer and the software engineer. Also software engineers frequently just don't have the domain knowledge, right? It's not like I can kind of, I can use Twitter, right? Like I can test the thing myself, like maybe not as well as somebody who does a profession name, but like, I get it. Or like ads, like I kind of get it, you know, I can learn it. You're not going to learn like that. Even if you learned a lot of biology, it's just so wide that the particular thing
Starting point is 00:49:45 you're dealing with is probably not a thing you've learned already. Like every month it's some new thing. So very tight collaboration and acceptance testing, you know, which is like a thing people used to do in the nineties. It's a real thing when you're shipping for internal customers and it can just like not work or work incorrectly. There are some practices, like make sure you can redo everything. So like save data every step of the way, like that sort of thing so that nothing is lost.
Starting point is 00:50:12 And, you know, sometimes things get lost and it's a problem. But developing that tight relationship with the customer, like the worst is when the customer team is sort of doesn't have time for whatever the new software is getting developed. So you don't get feedback from them. So the first time they're exposed to it is when like you've changed how the thing works and that's when they discover there's a problem, right? Because like they're super dissatisfied.
Starting point is 00:50:38 You spent like six months working on this thing. It turns out they didn't need it in the first place or whatever. So we put in a bunch of processes like weekly check-ins slash demos, having our scientists walk us through what they're doing, being very open to kind of changing course as we discover and learn new things from these interactions so that by the time that sort of the feature is complete, it's co-developed with the scientists, not sort of developed for the scientists. So one thing which makes, or rather I would say, a lot of engineers are attracted to what Guang was referring to earlier as tech tech or like consumer tech, for example, where software is the product. When you look at companies like Zimmergen, for example where software is the product when you look at companies like zimmergen for example software is a means to getting to the product it's not the end
Starting point is 00:51:30 itself yeah in that case you're always optimizing for what that product needs to do and software is a priority to get there but it's not something that you highlight very often so when it comes to attracting talent is that something you saw different or were there challenges in attracting talent for the company considering this difference? Yeah. So first off, yeah, that's a real thing, right? Like the software is not a product, something else is a product. You're sort of a cost center, right? Like software is also very expensive. It is also the case that on average, software engineers get paid more than scientists are. So you have this really bizarre situation where the science is in the sort of, it's the thing the company does. And scientists are like, why is my support team making all this money and I'm not making all this money? What's going on?
Starting point is 00:52:18 This is true for biotech companies too? Oh, wow. I didn't know that. Hopefully I didn't just create problems for a bunch of companies because scientists listen to those podcasts and go wait what no i mean i think it's like a supply and demand right just because traditionally the there's just been a lot more training for more scientists like i mean yeah and and the fact that like software engineers have options outside of biotech right yeah it's the other companies which are setting that benchmark not necessarily biotech companies and. Yeah. It's the other companies which are setting that benchmark, not necessarily biotech companies.
Starting point is 00:52:45 And you still want good software engineers to work at these companies. Exactly. Yeah. And we were able to attract people who had experience at Google, at Twitter, in my case, all kinds of flagship places. And it's mostly, there's some baseline amount of comp that you have to offer where people don't feel like they're sacrificing their quality of life or not providing for their family or something like that. And that number is different for everybody where they feel like
Starting point is 00:53:17 this is an acceptable number. It maybe is not the number of their dreams, but it's an acceptable number. And then it's, but what do you want to work on? And some people are just really interested in synthetic biology as a problem. They're really into the vision of 21st century being the century of bioengineering and the way that 20th century was era of petroleum, essentially. Right. There's a motivation to get into and have impact on healthcare, on sort of green technologies, all that. Or just like, they're just really into the domain, like really into the problem. It's just really interesting. And it is really interesting.
Starting point is 00:54:00 You know, I've heard doing like AI for bio, like I heard a number of people say, as I'm excited to talk about some latest model from Cloud or whatever, or Anthropic, the latest Cloud model. And they're like, yeah, but LLMs for languages are boring. LLMs for proteins, that's cool. Having an LLM that can spit out a protein for you, that's awesome. That's what I want to work on. Everybody else can work on the whole write an essay in the voice of a pirate.
Starting point is 00:54:37 I'm just not into it. It's not cool. Proteins are cool. So it's mission and problem motivated, as long as there's sort of a baseline of we respect your skills. We don't get a chance to talk to many people with the right ground. So having the software and biotech, what's your, very curious, what's your take on AlphaFold 3?
Starting point is 00:54:58 I heard it on All In podcast, and they were like, oh, this shit's going to change everything, going to make so much money for google but yeah yeah old news esm3 came out two days ago but that's what we should be talking about now apologies apologies for my ignorance no no i'm joking so first off alpha fold and alpha 2 in particular was like mean, this was a major moment for biology, right? It didn't like solve protein folding, but it was a massive leap forward. The way sort of like, kind of like the AlexNet moment in AI world where like the game changed, right? What was impossible before became like fairly straightforward now and you
Starting point is 00:55:45 don't have to be an expert. And immediately it was like, well, yes, that problem was solved, but that's not actually the problem we care about. What we actually care about is designing a drug that's going to be only expressed in the liver and is going to pass all of the phase three trials and whatnot and not have any side effects or unpredictable side effects, et cetera. And like an alpha fold, give me that. And the answer is no, it can't. It tells you how the thing folds, right?
Starting point is 00:56:13 But as a result, it was the hill in front of us. And then there are many, many hills behind that, right? So like alpha fold is a, and alpha fold three is a fantastic improvement. ESM three is also a great improvement. All of these models are getting better and the field is moving forward really fast. And we are so far from solving biology,
Starting point is 00:56:36 it is hard even to describe how far we are from solving biology. Like this sort of sci-fi dream of like, we wave a scanner in front of you and then like we'll print the medicine and just stick it in. You just think nowhere near. And it's not a like more tokens problem. It's not like, well, if we just like sequence more DNA and shove it in, like it's just going to pop out. It's not that.
Starting point is 00:57:01 Remember, the problem is you get a bunch of ones and zeros, and you need to make the program do three more things and go faster, and also run on a different architecture. That is roughly equivalent to the kinds of things we're trying to do with biology. Can I relate it to with autonomous vehicles? At the start, people were making so much progress that they were doing projections in terms of super linear or linear projections. It's like, related to with like autonomous vehicles at the start people were making so much progress that they were doing projections in terms of like super linear or like linear projections it's like oh yeah we're gonna have self-driving by 2020 i remember exactly and then what they realize is like oh shit like the last like five percent or three percent is substantially harder than all
Starting point is 00:57:40 the stuff previous to that like is that similar to what you're seeing in vinyl? I think it's a slightly different problem. The self-driving problem is at least we're able to evaluate whether the thing works. And we kind of understand the problem. We've all driven a car, right? We've all driven a car. We kind of understand what like, maybe we don't understand all the complexities.
Starting point is 00:58:03 Like we didn't account for whatever fog or pedestrians doing crazy things or balloons floating in front of windshields or like other things that, you know, you start hitting all the edge cases. Here, it's more like, we don't understand how the cell works. We don't understand why the things that happen in biology happen. We understand some of them, and there is a lot more that we just have no clue about. So we can't train in a model that will either generate or even predict the effects of things where we do not know the mechanism. We know what will happen when you turn the wheel on a car. We do not know what will happen when you change a bunch of DNA most of the time. I mean, we kind of do.
Starting point is 00:58:52 Most likely, either nothing will happen or the cell will die. If you just start randomly changing things, those are the two outcomes. But if you're trying to actually achieve something, you just don't know what will happen. So in this case, given these generative models, and especially the part that you mentioned earlier around using LLMs to generate new proteins and finding some ways to test and verify how many of them are even viable and actually good, would you say that this helps speed up the process to an extent, but doesn't solve the problem entirely? It speeds up the design process a lot. So where these things are being applied now is just like the areas where we have more data and more knowledge, right? There's a bunch of stuff
Starting point is 00:59:34 where we just don't know how stuff works. And it's very hard to develop proteins with certain characteristics. The function might be be right but it might not be powerful enough or maybe like it will be too promiscuous or maybe it will you know it does the right thing but like it does it in the brain as well as in the liver and like that means a death sentence and the line between like poison and drug is sometimes very thin. So it definitely helps with the design and a wider diversity of designs because you're not limited by sort of human intuition and imagination. You can explore a wider space, but then we're still throughput limited in terms of synthesis. So like these things that are designed, it's only a hypothesis that they will do a thing,
Starting point is 01:00:23 right? And like you actually build them in the lab, test them in the lab, get a bunch of data back, maybe do a few cycles of this. And now you have some preliminary results of like, this has promise. And then you have, in the case of drugs, your 10 year long pipeline of like, further refining it, figuring out how to like determine that whether or not like it really works, what the side effects could be, experiments in mammals, if everything goes great, eventually recruiting human subjects, and so on and so forth.
Starting point is 01:00:53 Designing a drug is also different than designing a protein, right? Like actually how you deliver it, how you can scale a production, like, oh, that stuff. The part where an LLM designed the protein that folded the right way is the very beginning of a very long process so there's still more work to be done like it's not so so this is me asking a new question and pardon me if this sounds dumb a lot of what you described if we had to map it back to the earlier explanation that you shared about zimmergen for example where we are trying to make some changes in a microorganism and seeing the kind of chemicals it produces and collecting
Starting point is 01:01:32 data along the way and seeing whether whatever is produced is what you useful or not from a protein perspective can if we had to contrast the two cases where one is there's no llm someone has to design a new protein versus now there is an LLM. So LLM is helping you design, but someone still needs to prompt it in a certain direction. So can you like maybe differentiate what these two ways of designing a protein looks like? Okay. I'll add some more context. The reason I'm asking this question is because in my head, I don't have a very simple explanation that I can explain it to someone else, where when we say design a protein, to me, it's like, I have no idea what that means. So how is an LLM able to come up with stuff that
Starting point is 01:02:15 is right? So maybe you want like a before and after, like without an LLM, how do you design a protein? Yeah. Which is different than the whole, the same thing was even different because that was more about, we know how to make a thing that we like, we want to make more of it, or we want to make it in a different organism. Like there's a rare plant that produces some pheromone that is very useful for whatever reason. We get those genes and stick them into a microorganism that we can grow in a vat and have it make that without having to harvest
Starting point is 01:02:46 a rare exotic plant. Leaving that aside, designing proteins before and after. First off, the before methods are still very actively used. And it's a scientific process, right? You have some hypothesis about the mechanism of action. You have some literature that suggests that a certain thing can be possible and you know that certain changes to a protein can affect in certain ways. So you maybe go through a workflow where you change the protein sequence, you run it through alpha fold to see how it will fold. And then you literally visualize the folded protein, which is like a 3D structure, and it will have a pocket. And you want to see if that pocket will bind to the molecule you're trying to bind to, for example,
Starting point is 01:03:33 if that's like the kind of thing you want your protein to do. And so you go through that and you kind of have ways that you know generally work and you try a bunch of them. And it's sort of like directed search guided by human experts. And if I sound hand wavy about it, it's because like I don't have a PhD in protein engineering. And in fact, there are different kinds of protein engineering and like the different PhDs will do it differently, right? And it's a very deep subject, right? With the LLM, it's getting to the point where saying that I want to have this backbone and
Starting point is 01:04:07 I want to make it have higher thermostability or something. So essentially, it's conditioning your generation. If you think not LLMs, but diffusion models, you can do conditional generation. And so that's getting easier where you can say like, I have a protein like this and I want to have a family of similar proteins because they have similar functions, but I want to adjust in certain ways, right? This will generate candidates for you. And then you can go backwards from the shape that you generated to a sequence. And then you can synthesize a sequence, actually build the protein, confirm
Starting point is 01:04:45 the shape, confirm the function, and so on. And then you can actually go backwards and be like, why did that shape work? Or why did that sequence work, given what I know about the mechanism of action? Maybe potentially even discover things, because it's possible that the model learned some correlations that were not previously obvious or nobody observed. But yeah, and I think this field is particularly interesting right now because sort of with language models or image models or any of that, we may not be able to spell out the rules of the English language. We may not be able to spell out the rules of what looks nice or doesn't look nice, but we can instantly tell whether the image is aesthetically pleasing or whether the sentence
Starting point is 01:05:31 is grammatically correct. We cannot with a protein. We're like, I don't know, it looks like a sequence of amino acids. So what's cool is there aren't actually rules in our head that are implicit that these things have to follow, right? Like we're still discovering the rules by probing at nature, right? So to some extent, it's entirely possible that we will, by getting better at interrogating the state of the models and how and why they're generating things, when they generate things that are successful, we can go back and see what is actually activated
Starting point is 01:06:12 and find ways to examine that to lead us towards potential discoveries in science. So from that perspective, I really like all the work that Anthropic is doing and others in model explainability. Because right now, the reason Anthropic cares is because explainability tells us essentially how to control the system, right? If it does something weird, we want to know what's going on so we can fix it. Or we want to be able to translate what the model did into some sort of human understandable explanation of why it made a decision, right?
Starting point is 01:06:47 It's sort of a QA function. But in biology, if the model is explainable, like you might discover explanations that are not just confirmations of like rules you know, but new rules. That's a hypothesis. It hasn't actually happened. So it might not happen. Maybe I don't know what I'm talking about, but that'd be cool.
Starting point is 01:07:11 For all the listeners out there who didn't know enough about biotech or didn't find it cool enough, you definitely make it sound super cool. By the way, any resources that people could go to to learn a little more about this and educate themselves with just the terminology that you've been using here?
Starting point is 01:07:27 Yeah, so there is a great Slack community, and I think there's also a website called, I think they're called Tech and Bio. Let's see. You know what, maybe we'll later add to... Oh, yeah, for sure. We can add, yeah. Or something.
Starting point is 01:07:41 But yeah, there is a community of people who are sort of tech people who are interested in biology and are learning about it. There's like a pretty active Slack group. There is meetups and paper discussions and things of that nature. So that's probably be the first place to go. And then there'll be lots of pointers from there. Well, this has been an awesome conversation, Dimitri. I know you have to go. We would love to continue going on the podcast and hope there is a second time when we get to speak with you again.
Starting point is 01:08:07 But for today. Sounds good. We can do a part two. I do have to go and do a- Oh, no, totally, totally get that. But thank you so much for joining the show today. This was awesome. All right.
Starting point is 01:08:18 My pleasure. Great meeting you guys. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.