Software Misadventures - Grokking Synthetic Biology | Dmitriy Ryaboy (Twitter, Ginkgo Bioworks)
Episode Date: July 16, 2024From building a data platform and Parquet at Twitter to using AI to make biology easier to engineer at Ginkgo Bioworks, Dmitriy joins the show to chat about the early days of big data, the conversatio...n that made him jump into SynBio, LLMs for proteins and more. Segments: (00:03:18) Data engineering roots (00:05:40) Early influences at Lawrence Berkeley Lab (00:09:46) Value of a "gentleman's education in computer science" (00:14:34) The end of junior software engineers (00:20:10) Deciding to go back to school (00:21:36) Early experiments with distributed systems (00:23:33) The early days of big data (00:29:16) "The thing we used to call big data is now ai" (00:31:02) The maturation of data engineering (00:35:05) From consumer tech to biotech (00:37:42) "The 21st century is the century of biology" (00:40:54) The science of lab automation (00:47:22) Software development in biotech vs. consumer tech (00:50:34) Swes make more $$ than scientists? (00:54:27) Llms for language is boring. Llms for proteins? that's cool (01:02:52) Protein engineering 101 (01:06:01) Model explainability in biology Show Notes: The Death of the Junior Developer: https://sourcegraph.com/blog/the-death-of-the-junior-developer Dmitriy on twitter: https://x.com/squarecog?lang=en Tech and Bio slack community: https://www.bitsinbio.org/ Stay in touch: - Make Ronak’s day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Transcript
Discussion (0)
And then the big table, I think paper came out and there was just like all the stuff coming out of
Google and we were reading it and we were reading their ideas, like map to what we were doing.
We're like, oh my God, we're not wrong. Like even Google's doing like that. Check that out.
Robots mutating some aspects of organisms. Can you elaborate a little bit more on like what
that really means? Because yes, it does. Slow down a little bit. Maybe give an example, because that does sound like science fiction.
It's kind of like trying to optimize a binary that was compiled from some C code,
except you don't have the C code, right?
You just have the ASM.
Your problem isn't even to debug it or to understand what it does.
It's to make it go 15% faster.
Right?
By the way, you don't actually know C.
Right?
And you don't have the ASM manual either.
Right?
So that's the problem, right? Like you have the code, but it's like the bytecode.
And it goes for something.
There's a system down there somewhere,
but like the only way to really understand what's going on
is to keep poking at it in the physical world.
Change it a little thing, see if it blows up.
Change another thing, see if it blows up.
And so robots come in because you can do this in high throughput,
relatively high throughput for biology,
not high throughput for like web services people.
Today, because the speed of software iteration is so high,
the cost of mistakes is much lower.
So sometimes the incentive to get things right in the first place
is a little lower because you know you can fix it in the next PR.
Whereas for biotech, that's not true. Like cost of mistakes can be brutally high. So what are some things that are done differently from a software perspective? So yeah, one thing is that
you're mostly writing internal tools. So all of that idea of like A-B testing that goes out the
window, right? You can't expose 5% of the scientists to this thing that might or might not work,
right?
So you still want fast iteration cycles, but you also want to make sure that when things
are ready, they're really ready.
You'd think that leads to some sort of like crazy waterfall situation.
What works much better is just a very tight coupling between the customer and the software.
Like I really like all the work that Anthropic is doing
and others in model explainability.
Because right now, the reason Anthropic cares
is because explainability tells us essentially
how to control the system.
If it does something weird, we want
to know what's going on so we can fix it.
Or we want to be able to translate
what the model did into some sort
of human understandable explanation of like why it made a decision, right? It's sort of a QA function.
But in biology, if the model is explainable, like you might discover explanations that
are not just confirmations of like rules you know, but new rules.
Welcome to the Software Misadventures podcast.
We are your hosts, Ronak and Gwan.
As engineers, we are interested in not just the technologies, but the people and the stories behind them.
So on this show, we try to scratch our own edge by sitting down with engineers, founders
and investors to chat about their path, lessons they've learned, and of course, the misadventures along the way.
So Dimitri, I thought a fun place to start is your LinkedIn blurb. So it says VP of AI
enablement at Jinko. And I think before that you were a CTO at Zemmergen. And then, so you're the
author of The Missing Readmeme but then you also have data
engineer on there um yeah i thought that was pretty cool like is that like do you see that
as like part of your identity that's why you have it uh very much so to the extent that i think more
of a data engineer than a like vpo of whatever and cto nice i like it i throw kind of data
engineer first other stuff i have to put it
first because you know it's linkedin it's right career stuff but uh i i come with pretty much
everything from kind of a data engineer perspective and that's i guess how it's self-identifying
were there skills from like data engineering days that you still find useful like in management or
that sort of thing i think the main skill the main body of knowledge I gained as a data engineer was around
organizing, structuring data.
And I don't mean for BI, I mean for doing engineering on top of the data, right?
So laying things out so that they can then power search, they can power BI, they can
power experimentation platforms,
all of that, and sort of understanding also the connection between how services are designed and
instrumented, web services, and the way that they model their domains, and then how that data gets
represented in databases and various other data stores, and then how that
gets analyzed and used for other means, right?
And the connection between that, sort of the fact that data recorded about a service is
often as much of a product as what the service itself is doing.
I think just thinking about how those things are connected, just let me think about systems
at large. And so then whether
taking on sort of a PFS software role or getting into AI, it just really helps understanding sort
of fundamentally both what you need to do in order to analyze data at scale and the impedance
mismatch between sort of how you structure for runtime and for serving and then
for the backend and where there's friction there. And that helps, I mean, from a management point
of view, it helps in terms of kind of aligning different teams and making people, helping people
understand sort of the different priorities and shading light on kind of both camps.
I don't know that it's a data engineering skill. I think it was like the kind of data engineer I was when I was actually doing engineering made me very sort of, made me
have to deal with a lot of different camps and people who don't normally talk to each other.
So that I think really translated in terms of leading larger and larger teams over time.
So you've been in biotech for like the past few years. And then before that, you were in like tech tech.
So leading to the platform.
Consumer tech, yeah.
Leading to the platform at Twitter.
Yeah.
And working on Parquet.
But all the way back, your first job out of undergrad was actually being a software dev, I think, at Lawrence Berkeley National Lab.
Yeah.
So I thought it was...
During undergrad even, yeah.
I see, I see.
I thought that was kind of...
During and then a bit after, yeah. Nice. I thought that was kind of a cool Yeah, I see. I see. I thought that was during and then a bit after.
Yeah.
Nice.
I thought that was kind of a cool, like full circle.
A little bit.
Yeah.
I guess along the data engineering, like, like, were there skills that you learned, like during those national lab days that, oh yeah, like an important role, like later
on you got into tech.
Oh yeah, absolutely.
So the job you're referring to actually started as a sophomore, essentially as a way to pay the bills. I was working somewhere close to full time, which partly explains full steam. And I was there when that got published.
And then for several years afterwards, as sort of sequencing of genomes improved over time,
although this was before what's now called X-gen sequencing or NGS, that those technologies came
out a little later. So it was still pretty expensive, but it was every year was much
bigger than the year before. And what my job evolved into was working with a couple of other engineers
in building distributed systems for processing all of this genomic data
that was coming in.
And it was sort of the old school distributed systems, you know,
running like Beowulf clusters.
And we talked about server racks.
It was literally like beige boxes on a rack,
like a kitchen, like shelves, wire racks that we put beige boxes on top of in a generous closet
that was not built for that kind of power or heat. So there was a lot there about sort of
processing data at scale, coordinating workers across multiple systems.
As you might imagine, those machines were not the most reliable things in the world,
so having to deal with failures.
We also built a web service to allow scientists to interact with the data that we were processing
there.
So there was a bunch there about visualizing data and serving it and giving people feedback
about a process that might run for
several hours. So just kind of all the state management and that kind of thing. In many ways,
the technology has changed and the capabilities change, but the high level of what that does
has remained. Those lessons, I think, translated and then kind of evolved and my next job was at
ask.com it got renamed like a month before i started there it was ask jeeves which is i think
still how the thing that people actually remember so if you remember ask jeeves from your like
childhood that was that became us.com and i got that job because they were like oh cool this guy
like actually knows you know large. And then I joined them
and it was like, oh, your data is like 10 or 20 times bigger. I thought it was big, but it wasn't
big at all. And so that was fun. And that was, again, even bigger systems, writing our own SQL
parsers and reinventing our database, basically writing a database from scratch, not knowing what the hell we're doing at all.
But ASC was using Oracle at the time for data warehousing and for the specific kind of data
warehousing tests we were trying to cram into it.
It just was not scaling.
And we had to write our own sort of system that translated SQL into distributed queries
on top of Berkeley DB files, if you all remember Sleepy Cat.
And that was super fun.
That led to sort of wanting to really study that and understand distributed systems and
databases.
That led to a master's in very large information systems at CMU, which is what back then we
called big data.
And big data is what we now call AI.
And that led me to Twitter and so on.
By the way, like working at Berkeley National Lab, and in this case, you were at school,
you said because of a full-time job, you had terrible grades.
Between the two, what do you think helped you more in terms of your actual day job?
This is when you were at ask.com and beyond.
Gosh, that is a surprisingly tricky question.
I think in terms of helping me get the job,
it was having the LBNL background because it spoke to some ability to actually get shit done.
In terms of useful skills,
the stuff that echoed kind of showed up later from the job, but a lot of what
I learned in classes comes up even to this day. I loved the database class, didn't know I would
wind up working in database and distributed systems, but that came up in a big way a few
years later. I was really into the AI class, but that was all neural networks. So obviously,
nobody did anything with that for like over a decade.
And then suddenly, oh, I remember that.
Where's the textbook?
So I think the sort of gentleman's education in computer science,
you know, like Berkeley really makes you sample across
and understanding what I like, what I don't like,
where I have some sort of, you know,
some stuff that comes more easily to me than other things was super helpful.
And just the ability to like sit down and grind through a CRL.
It was CRL back then, not CRLS.
Wait, I don't know what CRL is.
And I don't know how many of our listeners would know CRL.
So can you elaborate on what that stands for?
Oh, I'm looking for it, but it's currently serving as my monitor stand, which is conveniently bought by my laptop. But it's a classic algorithms textbook that's used at most US colleges.
I don't remember the sheet. It's Laserson, Rivest. And of course, I don't know.
Oh, this one. Yeah, yeah yeah you're talking about the algorithms
book okay at this i remember introduction to the algorithms it's got all that's right introduction
to algorithms right it's a fantastic textbook and it was like one of the hardest classes i took and
i just loved that particular class like i remember it's cs170 at uh uc berkeley this was 24 years ago i remember the number yeah just like knowing i think having the
sort of skills that working through all of that builds and you even if you don't remember the
actual algorithms right i think just being able to read the text understand understand approaches
and kind of getting the journal just to it i mean that was fantastic and definitely the thing that
translated more and kind of for a longer time. I think the management lessons I learned more
from Lawrence Berkowap. Oh, management lessons from Lawrence Berkowap. Okay. Well,
how is elaborate on that, please? Because this is still early in your career.
Oh yeah. It wasn't, I wasn't managing anybody. It was sort of what I felt as a junior,
what kind of motivation worked, what kind of motivation didn't. I'm kind of a late bloomer,
like I'm now reasonably successful, but back then I was kind of not great. And a lot of that was me,
and some of that I think was my tech lead. Nobody ever taught them how to be a tech lead.
And so they made some
mistakes that you know then when i was in a similar position i was like i remember what not to do
but also where like words of encouragement and things like that like really did help
and like remembering that experience and trying to bring that forward and i assume this also kind
of played a role in later you writing the book? Yes, in some ways,
although that was maybe a fairly minor bit of motivation,
mostly the writing the book was because I spent a good 10 years sort of teaching
and explaining these concepts that weren't written down
and like having to remember, reinvent every time
and remember like, oh, right, you just graduated.
So you don't know this thing that like everybody
who's been around for a couple of years just innately understands.
But we have to explain it because nobody teaches you that
because you don't learn how to work with latency code in college, right?
Like you don't know how to really work in a team
when it's not a team that's like last for six weeks or whatever.
And, you know, it's the classic college team kind of projects.
You don't know why like these processes are there right you don't know what it's like to write code that needs to survive past the grade right needs to be maintainable like why does it
all look this way right there's just like a bunch of that kind of stuff that i felt like oh yeah we
like keep teaching that and over and over and over and my my coauthor, Christopher Minney was also saying, experiencing something along those lines.
And, you know, he tweeted something like, oh, you know, wouldn't it be great if there was a course
or a book or something? And I was like, I said the same thing a year and a half ago. We should like
just do that now. And so we got together and did it. That's pretty cool. Like before we move on
from that school topic, recently I've been having some conversations with a few folks who have,
their kids were about to go to school and they're wondering, hey, what major should we pick? And
part of what you mentioned earlier, like you even remember the course number for your algorithms
class and you still remember that to have contributed a lot to your career with LLMs
and tools like chat, GT, cloud and whatnot.
A lot of the easy problems are becoming way easier and entry-level jobs are becoming much harder.
And this is a question, like I recently talked to folks, again, one person's in India, one person's
here in Canada, and all of these people are grappling with the same question. Like two years
ago, we were planning our kids would go to computer science
because our kids were interested in that.
At this point, looking at where the industry is going,
we are kind of questioning that choice and thought process.
Like, should we even do this or not?
What's your perspective on this?
Yeah, I think that's,
I expected my perspective will evolve
over the next couple of years.
That's a really hot topic these days.
Steve Yeager just recently posted a great blog post
called, I think, like the end of junior software engineer.
Oh, yeah, that's a really good one.
Yeah, it's quite long, but it's good.
And also, that topic has just been coming up.
Charity majors from Honeycomb recently posted
about the need to create those jobs and how, like, in the 90s, right, like, you just kind of were out of high school, became a sysadmin somewhere because you knew how to type.
And, like, that was your on-ramp.
And then, like, a lot of those people are, like, hugely successful in the industry now, right?
And, like, that on, even before the AI thing happened, like, those on-rramps are gone, right? Like you have to know a lot more in order to get in. So
it becomes harder and harder to find those roles where somebody who is new and not experienced can
be successful and learn how to not be new, right? And with AI, it exacerbates the problem, right?
Because you don't need a junior to write like the little
bash scripts and the basic thing once you've outlined the problem. And outlining the problem
and explaining what it actually needs to do and how it fits into the context of everything,
that's what seniors do. Pretty much. You can't do that until you're a senior. How do you learn
to be senior when those jobs are gone, right? Yeah, it's a huge problem, I think. But to the point of, is it worth studying computer science?
I think just like learning math or something, it's more about teaching you how to think
and how things are connected than it is about doing the thing.
Like what I think is going to die.
And I see you're wearing the Insight t-shirt, which is maybe a little on the nose.
So Insight Data Science, but they're not called Insight Data Science anymore, right?
They renamed, but it was called Insight Data Science.
That's right.
That's how we met.
Oh, yeah?
Are you both Insight?
We both were at Insight.
That's how we got into data engineering.
And then I went the infrastructure engineering route and Guang worked up the stack.
Okay. Yeah. So, I mean, specialize in like essentially reforming like physicists into
data scientists, right? Like, or folks who acquire a lot of skills through studying hard sciences,
but like those skills maybe aren't as marketable, but are hugely useful.
And with just a little bit of polish and here are the tools and here's how you represent that,
you can turn that into a data science or data engineer position because you actually have
everything required. It's just that you need some terminology and a little bit of a finishing
school, which is what Insight is. I thought
it was a great program, by the way. Like I've hired multiple people out of that. The point here
is learning like how to think about large problems and break them down, learning how to work through
them, right? Like I think those skills are critical. CS is one way to get there. These days, I barely write any code,
but having a background in computer science
is hugely valuable for my job as a VP or manager
because I need to understand what's going on.
I need to understand how systems work together.
In fact, organizations are systems.
And if you understand distributed systems
and how they interact,
with a little bit of a transformation, you can start thinking about the organizations, how they interact and how to build that system.
So I think systematic thinking is still critical.
And you're not going to get that from like chat GPT, right?
Like you can have chat GPT do some routine work for you and incorporate that into a creative process.
But the actual thinking about what needs to happen and why it needs to happen, right?
You need to have your brain prepared to be able to do that.
And to me, CS, SRS curriculum is a really good one for that.
But like, you know, my major was actually EECS,
so electrical engineering, computer science.
I don't do any electrical engineering whatsoever.
I still don't entirely know
what a transistor does.
You know, like there's a bunch of stuff
where I'm like, you know,
like that's just somewhere inside the computer
and it doesn't matter.
But then like GPUs come out
and it's really helpful to have that background
to understand how they're actually different from CPUs, right?
Because, like, understanding pipelining, for example,
understanding other things, right?
Like, it comes in,
even if you're not designing the stupid chips.
Yeah, that's true.
So, Ranec is writing down theoretical physics PhD
in the message
to his friends in terms of, uh, with their kids. Yeah. For, for me, actually, this has been
relevant to like someone I just talking was just talking to like last week, they are like a few
years into their career as a software engineer. And then they're thinking about going back to
school for like a master's. So you did exactly that.
Like, what was the calculus like?
Like, I guess, first of all, like, do you think it was worth it?
Absolutely.
Yes.
Yeah.
So what was the calculus, I guess, going into it?
What made you decide?
So it was huge for my career. I think so.
First off, you know, I have a fairly pedigreed resume, but it really didn't
feel like that the first few years out of school.
So after the Lawrence Berkeley Lab job, I went to ask.com and I was there with a really
wonderful data engineering team that worked on sort of all the ETL and data warehousing
and an application on top of that for data analysis.
And that's where we sort of started building that distributed system kind of
that maybe in retrospect, we shouldn't have been allowed to build,
but we were doing it anyway.
And it was a lot of like stumbling around in the dark
and rediscovering from base principles, like how you build these things,
which was a great lot of fun.
And it was a team of four or so people, kind of depends on how you count. And two of us,
so it was me and my buddy, Pete Alvaro, which is a name you may recognize if you're in distributed
systems or databases. So, and he and I were building these things, most of him, and I was
sort of the junior assistant and getting really really excited and like having
this problem where Oracle was falling over and it was Oracle Rack a fairly high powered system but
like the queries would just run forever we had to we realized that there's particular ways that are
not what they teach you in the textbook that we were organizing our tables because they were so
big and we had to get Oracle to like give it enough hints to be able to process these summaries.
And then our prototype on 17 machines that we scrounged from the search team
because they moved on to bigger and better ones.
And we were like, oh, keep those. We'll take them.
We had to know which rack which one is because the interconnect on the same rack
is higher throughput than in between, so you have to be smart.
And we started putting it there and like suddenly our thing is like way faster to do it our way than it is on oracle and oracle is obviously a giant so that was a lot of fun and
this is like 2007 2008 era now our management was kind of churning you know as Ask was changing out from CEO on down. And they were always very appreciative of us
and very kind, but also strongly felt that the way for them to solve this problem that we're having
with data processing was to buy a bigger Oracle machine or to get more licenses for the latest,
whatever feature Oracle was trying to upsell them on,
or to change the storage and put it on a SAN and all of this kind of stuff
where it was sort of pay more for bigger, fancier hardware, kind of scale up versus scale out.
And at the same time, they had a consultant who was a professor, a CS professor from Santa Barbara named Divi Agrawal,
who really encouraged us to sort of keep experimenting with our stuff.
And eventually it got to a point like the Hadoop,
sorry, the MapReduce paper came out from Google.
We were super excited about that.
Hadoop wasn't open source yet.
And then the Bigtable, I think, paper came out.
And there was just like all the stuff coming out of Google.
And we were reading it.
And we were reading their ideas, like mapped to what we were doing.
We're like, oh my God, we're not wrong.
Like even Google's doing like, check that out.
We started going to technically public database group reading,
paper reading seminars at UC Berkeley, which was like, you know,
I was still living in Berkeley, like ASselt Com was headquartered in Oakland.
So we'd like take extra long lunches and go to the Berkeley campus and sit in on their
paper discussions with like a bunch of grad students talking about database papers.
And first off, it was hugely intellectually stimulating.
Second, it was sort of slowly dawning on both of us that we could hang.
Like we thought we were total, you know, imposters.
And we're like, wait, I understand what they're talking about.
Wait, I have opinions about what they're talking about.
Like, I'm actually like, I kind of, I'm kind of grokking this.
Like, this is cool.
And Divi was encouraging us.
So one thing led to another and both Pete and I were like,
we're not going to get to do the things we want to do at Ask because what we want to do is build
distributed systems. The Hadoop MapReduce style world is coming and people at our job weren't
listening to us yet about that. They wanted to buy bigger SANs. And so we were like,
good luck with the
Bigger Sans. We're going to do the distributed system and database thing. And so we both applied
to grad schools. He wound up, Pete wound up going to Berkeley and being a intellectual legitimate
part of that paper reading club. And he was Joe Hellerstein's grad student and is now a professor at?
It's Santa Cruz, I think.
Santa Cruz, yeah.
And I wasn't man enough for a PhD,
so I went into a master's program at CMU, so I went there. So that was kind of the story there.
It wasn't really a calculus of like, this will make me more, you know, hireable. It was
just, I was really enjoying the thing. I saw that this is where, I had this feeling that the problem
Ask was experiencing with too much data was a problem other companies were going to experience
soon. And Ask was just a little bit ahead of the curve because of the nature of being like a top,
you know, one of the top hundred like web destinations at the time and the amount of traffic that was hitting it.
I was like, we're seeing a preview.
This is going to come.
And I want to have the time to really get into it and like read the papers, talk to other people who are reading the papers, like just have that Berkeley seminar, but all the time for a year.
And that's what I did.
And Carnegie Mellon was great.
It was a really interesting program because it sort of allowed you to pick and choose,
you know, if you wanted to go into ML, if you wanted to go into distributed systems,
you wanted databases, language technologies, like language understanding and stuff like that.
So, and a lot of kind of time for yourself. So I mainly went because, yeah, I wanted some time to have a good excuse to just read the papers and implement stuff and talk to people.
And at the end of that, you know, I got into open source because at that point, Hadoop and Pig and various other things that are now like half of all of Apache Foundation projects
were starting to get open source and so I made connections with the community because that was
like my master's thesis project that led to interest from Facebook, Cladera, Twitter who
were all like also in on that technology right and so like it's just kind of yeah that carried me along. Nice, nice. Was PhD like an intentional, like, not made for PhD?
Like, let's break that down.
Is that a commitment or like you don't want to go into academia?
Like, what was it?
I didn't know that there was that much thought put into it, really.
Like, I think maybe it was the time commitment.
There's a balance of sort of how much you want to be an expert on a thing
versus like have a deep enough understanding of a thing,
but like have time for other things.
So I think maybe there was part that,
maybe it was part sort of, yeah,
how much time do I want to spend in academia versus like going back to
industry.
Obviously, there's a financial component involved there.
So it worked out.
But yeah, also, I know a lot of CS PhDs.
And man, it's fun to be able to do that, I bet.
Really work a problem for six years or something.
Some really cool stuff comes out of that.
So this joining the the or unofficially
being part of the berkeley paper seminars that you mentioned where you were just part of that
group reading papers with the folks or the students at berkeley how did you find them and
how did you become part of that group i mean it's posted on the website i don't know if they're
still doing it then i certainly don't want to send like hordes of people.
But like, I went to Berkeley.
So like, I knew where the room is, you know?
And yeah, they were like, they had stuff posted on the website.
They were like, here it is.
It's open to the public.
And so we did make sure not to eat the pizza until all the students had pizza. We figured that'd be a step too far.
And we like step back. I see.
So one thing you mentioned earlier
on the topic of like big data, for example,
so it's a thing that we call big data is now AI.
Say more.
Why do you say that?
Well, I was, that was a joke.
I understand.
I used to say very large information systems
is what we now call the data,
but now big data is not sexy
anymore right nobody cares about big data but what they do care is your token count
and the number of parameters in your in your machine learning model but it is the case that
a huge amount of work in building training lms or RAG systems or any of that stuff that is sort
of right now a big topic of conversation is where do you get the data?
How do you store it?
How do you access it?
How do you ensure quality?
Right?
It's all the same problems.
The solutions may be a bit different.
Like the requirements are different.
Well, exactly how you store it.
Actually, most of the data is stored in Parquet anyway.
But, you know, like various data set formats and all of that stuff, that's different.
But the fundamental problems of like data quality, data collection, data freshness,
data versioning, and doing that at scale, those are the same, right?
Like some things change, right?
Like now nobody uses HDFS anymore, or at least not if they're starting from scratch.
You're going to use S3 or something and build on top of that.
Streaming has come out and how to incorporate that.
A bunch of things have happened since, but the fundamentals of data engineering are critical
to having good AI systems.
So it all feeds one into another.
Is this a solved problem at this point?
The reason I ask is,
when I say solved problem,
what I mean to say is
there is a recipe for how it should be done largely,
and then it's a matter of applying that
to a specific domain and getting engineering right.
And the reason I say that is,
if I rewind the clock and go back to 2015,
around the time when I first got into data engineering, it used to be like, well, there are just so many jobs in data engineering and not many companies have a good ETL pipeline.
For example, many of these projects were new and you kept seeing new systems become part of Apache Software Foundation and one replaced the other.
You don't see any of that chatter or not enough of that chatter. Maybe I'm
just not following that as much. But in your perspective, has this specific domain of ETL
matured enough where it's become more of like apply the same principles but to a specific domain?
I think some. I think that it's like so many problems. Once you solve a number of them,
you climb that hill and you realize there's another hill
behind that hill that's even bigger
and you didn't see it before
because you were busy climbing the hill.
You know?
So yeah, like, well, first off,
orchestrating your ETL, right?
Like, sure, like that was solved in 2015
with Airflow
or maybe in 2013 with Luigi
or maybe with like whatever came before that,
right?
Like they were all solved, but it turns out you can keep inventing better mousetraps and now we have flight and we have
prefect and we have we have a number of these and we have temporal which is like a whole other
like way to to build these things so Daxter I should always mention Daxter because they're
really cool for some reason I didn't remember to mention them so first off like as we evolve and
as like the system of tools that we're connecting together changes like that has an effect backwards
and assist these changes but also you know despite having been in data engineering since like
whatever 2000 early people would talk about data versioning and I
would be like, I don't get it. And they'd say it's Git for data. And I'm like, I know all of those
words, use Git all the time, work with data all the time, still don't get it. And then you talk
to data versioning people and it's actually four different things, what they actually mean by data
versioning. And it's only when I got to running the AI enablement team at Ginkgo Bioworks that I
finally got why I actually need data versioning.
Because it's so important to know for a given model, what the version of the
dataset was that you tested it on, or that you trained it on, and for reproducibility, you need to have the raw source data when you're interacting
with the LLM and it's giving you weird answers.
And you need to be like, wait, was the data X included or not included in this?
And historically, what we've had is data lineage.
So you can say these datasets fed into this aggregation or whatever, and that gives you
the data lineage.
But you really want to trace back to like,
was the, not just this data source was included,
but like which version of that data source,
like I need the specific tag of which specific datums were in there, right?
And that is coming up in a huge way now.
And so I think that will also change
how people think about things like Iceberg,
which by the way, for a soft problem, that's a hell of an acquisition for DataWorks. Oh yeah.
That already had basically the same thing. So I think there's a lot to be done for
data engineering. As different systems evolve, as what our expectations are change. You know, like block storage was basically done
and then people started like working off of plain S3.
Then like Kafka was basically a solved problem.
And now like it's being completely re-architected
within Confluent because storage technology has changed.
Then like how people use storage technology has changed.
So I think it just keeps, it keeps moving. Good luck to any LLM writing one of those.
So you worked at Twitter after your graduate school, and then eventually you moved to biotech.
That's a jump that's not very commonly seen. So what prompted to go from consumer tech to biotech?
Well, first off, it's more commonly seen that you'd think, although more often so into health tech than like viewers in bio.
So for me, that was a couple of years after the IPO.
You know, I grew up with the company.
Like I joined Twitter when it was like 100 odd people.
And it was in the thousands by the time I left.
And six years, so like a long time also.
I was looking to do something new, something
different. And I was having a bit of a sort of struggling at the time with the whole ad supported,
like what is actually our product dilemma, right? The users aren't the people that you're getting
your money from. There's kind of a three-sided thing going on with advertisers and the company
and the users. And I had this desire for something simpler, rightly.
Also, I wanted something that would be impacting kind of the physical world.
So I started investigating drones and self-driving cars and this and that.
And then my friend from when I briefly worked at Cloudera, Aaron Kimball, wanted to grab
a beer.
And he started telling me about this company that he joined, that he was CTO of and it was pretty small.
And that was Zymergen.
And he explained to me what it did.
And I had to have him explain it to me like three times because he just thought of like complete science fiction. He was like, okay, so we have software and machine learning that takes DNA sequence and decides what changes to make to the DNA sequence to do something. And then we tell a bunch of robots
to actually make that DNA sequence and make that change in an organism. And then we grow that
organism. And then the organism makes for us as a tiny little factory, some new molecule that
hadn't existed before. And then that molecule is really useful in a bunch of industrial uses,
or agriculture, or pharma, or whatever. And then we look at how that little factory works and say,
we can optimize that. And we use the ML to figure out how to optimize it in the software and so you have machine learning
driving robots changing dna of microbes that make new chemicals right like every one of those steps
is magic indeed we can do what like i even had background in this right like i was around when
the human genome project was happening you know like i had the front seat front row seat to that
and still he was telling me what's happening.
I was like, that's impossible. What are you talking about? You can't have robots changing
DNA of organisms. You can, as it turns out. So it just sounded so nuts and so futuristic,
sci-fi and cool. I was looking at all these companies that were doing cool things where AI and ML and
data could make an impact. I just kept, my brain kept coming back to that company. I was just like,
that would be, that could be, you know, the 21st century is a century of biology. Like,
what an awesome dream. Like it just really grabbed me, the vision. So I joined Zymergen
and became a VPS software there.
Yeah. And then it was eight years of Symbio. So one question, this is mostly to understand
some of the aspects that you said. So robots mutating some aspects of organisms. Can you
elaborate a little more on what that really means? Because yes, it does sound... Slow down a little
bit. Maybe give an example, because that does sound like science fiction.
Yeah.
And this was 2015, right?
Like, okay.
So to make it a little bit more comprehensible, a single cell organism is a tiny factory that
has a bunch of chemical reactions going on inside it.
And those chemical reactions create new molecules, new enzymes.
Like they produce a bunch of small molecules. This is just a
process of being a lot. They convert sugar into other stuff. Basically, that's what cells do to
be grossly reductive. We can sequence the DNA of the organism. We can figure out function for
a decent chunk of the genes that we discover in the DNA. We can look at the regulatory element of that,
and we can say like, hey, you know,
this reaction that's catalyzed by an enzyme
that is coded for in the DNA is really useful.
We want that reaction to happen more
because the cell makes something that we want to collect, right?
Like, I don't know, vitamins or pharmaceutical enzymes or like
antibodies or what have you, right? In order to do that, we can change other DNA,
like other regulatory sequences that essentially are switches that tell the organism like how much
of a certain reaction to do or how much of a piece of DNA to
do transcription on. DNA turns into RNA. The more of that you do, the more of the RNA you have,
the more RNA you have, the more of the protein you're going to make, et cetera, et cetera.
But it's kind of like trying to optimize a binary that was like compiled from some C code,
except you don't have the C code, you just
have the ASM.
And your problem isn't even to debug it or to understand what it does, it's to make it
go 15% faster.
By the way, you don't actually know C. And you don't have the ASM manual either.
So that's the problem.
You have the code, but it's like the bytecode.
And it goes for something.
There's a system down there somewhere.
But like the only way to really understand what's going on is to keep poking at it in the physical world.
Change it a little thing, see if it blows up.
Change another thing, see if it blows up.
And so robots come in because you can do this in high throughput.
Relatively high throughput. High throughput for do this in high throughput, relatively high throughput
for biology, not high throughput for like web services people.
What is high physical real world?
What does that translate to in some numbers form?
Depends on what you're doing, but the standard unit is a 96 well plate.
So 96 concurrent experiments, actually a few smaller because like you want to replicate
them. So divide by
two and also some of them are controls, but on that order. And then you can run a few plates
at a time. So maybe you run a dozen plates at a time. And that experiment might take you multiple
weeks to get the results back. And so you use robotics to do that because you're not going to
hand pipette all those things. There's actually a bit of a combinatorial explosion happening there too.
But you can do pooled experiments.
People who actually know Symbio will think that this was a very gross explanation,
but I'm simplifying things. All right, guys?
Thanks for the simplified explanation for all the software engineers out there who
might not be as familiar with the Symbio side of things. I'm one of them.
So you have robotic automation, sorry.
You have lab automation, which is like robots that move these arms
and they have little pipettes on them
and they can move all the little liquids around.
But biology is basically the science of moving clear liquids around
and keep a track of which one's which.
And then some of them sometimes turn color and that's very exciting.
My parents are both in medical research.
Okay, well, ask them.
Probably they described to me.
See if they agree. So you use robotic automation. So then you need software to drive the robot.
Like old school bio was done by hand.
Like you pipette things by hand, but you can't do it in high throughput.
So you get a robot, but now you need to tell the robot exactly what to do.
And then you're trying to do too much for the robot.
And so you need to start like optimizing things and being like, you know,
it would be cool.
So much of our time is spent on like setting up the experiment.
And then the robot is fast, but the setup was slow.
Like it'd be cool if we could set up many things at once and like somehow multi-thread the robot and then like
what we actually need is like multiple robots working together and handing things off to each
other and like how do we coordinate that whole system so everything becomes workload management
right like and it's like gigantic dag in the sky so we needed to do all that and there's a lot of like scientific
instruments that read the results of all that stuff and then you need to analyze it and then
you need to figure out the correlations between the dna changes you introduced and the results
and like what will be the next dna change that you do and these days there's all kinds of cool
systems that allow you to do generative modeling of proteins.
And then given that protein that you model, get a DNA sequence that would result in that protein.
That's one to many.
There are multiple sequences that might result in the same protein.
Test them all out by printing the DNA, shoving it into a microorganism, getting it to express,
seeing what happens, and so on and so forth.
It's a very, it's actually a very high tech and very data intensive process.
Like when you say like the iterations are so slow, what comes to mind is that
cartoon of like the two devs, like doing sword fighting and then saying like, oh,
my code is compiling, like when their managers are asking them.
But now you're like sword fighting for like days.
I guess that must have been kind of a surprise going into it,
just how long the iteration cycles are.
Like how did you guys like kind of work with that?
Yeah, I mean, that's coming from like,
so I build a product experimentation platform at Twitter where like, you know, my tool or my team's tool is what collected the data on all the A-B test experiments that Twitter ran and, like, provided all kinds of analysis and that side.
And, yeah, the amount of data points you can collect on Twitter by, like, turning that thing on versus the amount of data points
you collect from a biotech experiment in the lab.
I mean, it's orders of magnitude, different in terms of number of data points and the
amount of time inversely, right?
And that's really difficult.
It means that every data point is massively more expensive in the bio world than it is
in the sort the internet world.
But you can just get a million more impressions in no time on the web. You cannot get a million
more experiments, at least not with high fidelity and a lot of control. You can do random mutagenesis
and you'll get all kinds of things. But then most of the time,
you don't know what most of them were,
except the ones that were the winners.
So you lose some information.
So there was a lot of focus on
where can we take the slack out?
So we're only limited by biology
and not by like,
well, this team starts their runs on Mondays
and this other team starts their runs on Mondays and this other team starts their runs on
Wednesdays. So, you know, you need to make sure to have the handoff from the Monday team to the
Wednesday team happen on a Tuesday, because if they're late by one day, like you have to wait a
whole week. How do we like remove these strong couplings? Or again, it's like it becomes
optimizing systems and just figure like
sometimes the problem isn't even the biology. It's like, did we organize ourselves in a way
that will be conducive to rapid experimentation? Why are these even different teams? Can the same
team do this, right? Like, can we just get rid of the handoff? Especially as you scale, like what
you want to keep an eye on is we know there is a fundamental limitation,
which is the speed of biology, like how fast the organism grows or whatever.
How much longer than that are we right now per cycle?
Where is that coming from?
And then driving that down.
Assuming speed is your problem, sometimes you will pay some extra time to be able to get higher throughput
because you're okay with just getting more data points
at the cost of time versus being able to iterate.
So a couple aspects I can think which would be different
from just consumer tech or internet web companies
and biotech, for example, like you mentioned,
speed is something
that's different every data point is way more expensive and way more valuable from that
perspective i would imagine precision would be another part which would be way more valuable
in biotech as opposed to like web companies margin of error and like just the cost of mistakes what
are some of the things that are done differently when
it comes to software when you look at biotech versus consumer tech and i'll add some context
to this question too we were having a conversation at the on a previous podcast where it was like
i'm talking to matt klein who built on y and he was saying that today because the speed of
software iteration is so high the cost of mistakes is much lower.
So sometimes the incentive to get things right in the first place is a little lower because
you know you can fix it in the next PR or you can roll it out very quickly.
Whereas for biotech, that's not true.
Like cost of mistakes can be brutally high.
So what are some things that are done differently from a software perspective then?
So yeah, one thing is that you're mostly writing internal tools.
So all of that idea of like A-B testing that goes out the window, right?
You can't expose 5% of the scientists to this thing that might or might not work, right?
Like, no, that is not how you ship that kind of software.
So you still want fast iteration cycles,
but you also want to make sure that when things are ready,
they're really ready.
You'd think that leads to some sort of like crazy waterfall situation.
What works much better is just a very tight coupling between the customer and the software engineer.
Also software engineers frequently just don't have the domain knowledge, right?
It's not like I can kind of, I can use Twitter, right?
Like I can test the thing myself, like maybe not as well as somebody who does a profession
name, but like, I get it.
Or like ads, like I kind of get it, you know, I can learn it.
You're not going to learn like that.
Even if you learned a lot of biology, it's just so wide that the particular thing
you're dealing with is probably not a thing you've learned already.
Like every month it's some new thing.
So very tight collaboration and acceptance testing, you know, which is like a thing people
used to do in the nineties.
It's a real thing when you're shipping for internal customers and it can just like not
work or work incorrectly.
There are some practices, like make sure you can redo everything.
So like save data every step of the way, like that sort of thing so that nothing is lost.
And, you know, sometimes things get lost and it's a problem.
But developing that tight relationship with the customer, like the worst is when the customer
team is sort of doesn't have time for whatever the new software is
getting developed.
So you don't get feedback from them.
So the first time they're exposed to it is when like you've changed how the thing works
and that's when they discover there's a problem, right?
Because like they're super dissatisfied.
You spent like six months working on this thing.
It turns out they didn't need it in the first place or whatever. So we put in a bunch of processes like weekly check-ins slash demos, having our scientists walk us through
what they're doing, being very open to kind of changing course as we discover and learn new
things from these interactions so that by the time that sort of the feature is
complete, it's co-developed with the scientists, not sort of developed for the scientists.
So one thing which makes, or rather I would say, a lot of engineers are attracted to what
Guang was referring to earlier as tech tech or like consumer tech, for example, where software
is the product. When you look at companies like Zimmergen, for example where software is the product when you look at companies like zimmergen for example software is a means to getting to the product it's not the end
itself yeah in that case you're always optimizing for what that product needs to do and software is
a priority to get there but it's not something that you highlight very often so when it comes
to attracting talent is that something you saw different or were there challenges in attracting talent for the company considering this difference?
Yeah. So first off, yeah, that's a real thing, right? Like the software is not a product,
something else is a product. You're sort of a cost center, right? Like software is also very
expensive. It is also the case that on average, software engineers get paid more than scientists are. So you have this really bizarre situation where the science is in the sort of, it's the thing the company does.
And scientists are like, why is my support team making all this money and I'm not making all this money?
What's going on?
This is true for biotech companies too?
Oh, wow.
I didn't know that.
Hopefully I didn't just create problems for a bunch of companies because scientists listen to those podcasts and go wait what
no i mean i think it's like a supply and demand right just because traditionally the
there's just been a lot more training for more scientists like i mean yeah and and the fact that
like software engineers have options outside of biotech right yeah it's the other companies which
are setting that benchmark not necessarily biotech companies and. Yeah. It's the other companies which are setting that benchmark, not necessarily biotech companies.
And you still want good software engineers to work at these companies.
Exactly.
Yeah.
And we were able to attract people who had experience at Google, at Twitter, in my case,
all kinds of flagship places.
And it's mostly, there's some baseline amount of comp that you have to offer
where people don't feel like they're sacrificing their quality of life or not providing for their
family or something like that. And that number is different for everybody where they feel like
this is an acceptable number. It maybe is not the number of their dreams, but it's an acceptable
number. And then it's, but what do you want to work on? And some people are just really interested in synthetic biology as a
problem. They're really into the vision of 21st century being the century of bioengineering and
the way that 20th century was era of petroleum, essentially. Right. There's a motivation to get into and have impact on healthcare, on sort of green technologies,
all that.
Or just like, they're just really into the domain, like really into the problem.
It's just really interesting.
And it is really interesting.
You know, I've heard doing like AI for bio, like I heard a number of people say, as I'm excited to talk about some latest model from Cloud or whatever, or Anthropic, the latest Cloud model.
And they're like, yeah, but LLMs for languages are boring.
LLMs for proteins, that's cool.
Having an LLM that can spit out a protein for you,
that's awesome.
That's what I want to work on.
Everybody else can work on the whole
write an essay in the voice of a pirate.
I'm just not into it.
It's not cool.
Proteins are cool.
So it's mission and problem motivated,
as long as there's sort of a baseline of we respect your skills.
We don't get a chance to talk to many people with the right ground.
So having the software and biotech,
what's your, very curious, what's your take on AlphaFold 3?
I heard it on All In podcast,
and they were like, oh, this shit's going to change everything,
going to make so much money for google but yeah yeah old news esm3 came out two days ago but that's what we
should be talking about now apologies apologies for my ignorance no no i'm joking so first off
alpha fold and alpha 2 in particular was like mean, this was a major moment for biology, right?
It didn't like solve protein folding, but it was a massive leap forward.
The way sort of like, kind of like the AlexNet moment in AI world where like the game changed, right?
What was impossible before became like fairly straightforward now and you
don't have to be an expert. And immediately it was like, well, yes, that problem was solved,
but that's not actually the problem we care about. What we actually care about is designing a drug
that's going to be only expressed in the liver and is going to pass all of the phase three trials
and whatnot
and not have any side effects or unpredictable side effects, et cetera.
And like an alpha fold, give me that.
And the answer is no, it can't.
It tells you how the thing folds, right?
But as a result, it was the hill in front of us.
And then there are many, many hills behind that, right?
So like alpha fold is a,
and alpha fold three is a fantastic improvement.
ESM three is also a great improvement.
All of these models are getting better
and the field is moving forward really fast.
And we are so far from solving biology,
it is hard even to describe how far we are from solving biology.
Like this sort of sci-fi dream of like, we wave a scanner in front of you and then like
we'll print the medicine and just stick it in.
You just think nowhere near.
And it's not a like more tokens problem.
It's not like, well, if we just like sequence more DNA and shove it in, like it's just going
to pop out.
It's not that.
Remember, the problem is you get a bunch of ones and zeros, and you need to make
the program do three more things and go faster, and also run on a different architecture.
That is roughly equivalent to the kinds of things we're trying to do with biology.
Can I relate it to with autonomous vehicles? At the start, people were making so much progress
that they were doing projections in terms of super linear or linear projections. It's like, related to with like autonomous vehicles at the start people were making so much progress that
they were doing projections in terms of like super linear or like linear projections it's like oh
yeah we're gonna have self-driving by 2020 i remember exactly and then what they realize is
like oh shit like the last like five percent or three percent is substantially harder than all
the stuff previous to that like is that similar to what you're seeing in vinyl?
I think it's a slightly different problem.
The self-driving problem is at least we're able to evaluate whether the thing works.
And we kind of understand the problem.
We've all driven a car, right?
We've all driven a car.
We kind of understand what like,
maybe we don't understand all the complexities.
Like we didn't account for whatever fog or pedestrians doing crazy things or balloons floating in front of windshields or like
other things that, you know, you start hitting all the edge cases.
Here, it's more like, we don't understand how the cell works. We don't understand why
the things that happen in biology happen.
We understand some of them, and there is a lot more that we just have no clue about.
So we can't train in a model that will either generate or even predict the effects of things
where we do not know the mechanism. We know what will happen when you turn the wheel on a car. We do not know
what will happen when you change a bunch of DNA most of the time. I mean, we kind of do.
Most likely, either nothing will happen or the cell will die. If you just start randomly changing
things, those are the two outcomes. But if you're trying to actually achieve something,
you just don't know what will happen.
So in this case, given these generative models, and especially the part that you mentioned earlier around using LLMs to generate new proteins and finding some ways to test and
verify how many of them are even viable and actually good, would you say that this helps
speed up the process to an extent, but doesn't solve the problem entirely?
It speeds up the design process a lot. So where these things are being applied now is just like
the areas where we have more data and more knowledge, right? There's a bunch of stuff
where we just don't know how stuff works. And it's very hard to develop proteins with certain
characteristics. The function might be be right but it might not be
powerful enough or maybe like it will be too promiscuous or maybe it will you know it does
the right thing but like it does it in the brain as well as in the liver and like that means a
death sentence and the line between like poison and drug is sometimes very thin. So it definitely helps with the design
and a wider diversity of designs because you're not limited by sort of human intuition and
imagination. You can explore a wider space, but then we're still throughput limited in terms of
synthesis. So like these things that are designed, it's only a hypothesis that they will do a thing,
right? And like you actually build them in the lab, test them in the lab, get a bunch of
data back, maybe do a few cycles of this.
And now you have some preliminary results of like, this has promise.
And then you have, in the case of drugs, your 10 year long pipeline of like,
further refining it, figuring out how to like determine that whether or not like
it really works, what the side effects could be, experiments in mammals,
if everything goes great, eventually recruiting human subjects,
and so on and so forth.
Designing a drug is also different than designing a protein, right?
Like actually how you deliver it, how you can scale a production,
like, oh, that stuff.
The part where an LLM designed the protein that folded the right way is the very
beginning of a very long process so there's still more work to be done like it's not so so this is
me asking a new question and pardon me if this sounds dumb a lot of what you described if we
had to map it back to the earlier explanation that you shared about zimmergen for example where we are trying to
make some changes in a microorganism and seeing the kind of chemicals it produces and collecting
data along the way and seeing whether whatever is produced is what you useful or not from a protein
perspective can if we had to contrast the two cases where one is there's no llm someone has
to design a new protein versus
now there is an LLM. So LLM is helping you design, but someone still needs to prompt it in a certain
direction. So can you like maybe differentiate what these two ways of designing a protein looks
like? Okay. I'll add some more context. The reason I'm asking this question is because in my head,
I don't have a very simple explanation that I can explain it to someone else, where when we say design a protein,
to me, it's like, I have no idea what that means. So how is an LLM able to come up with stuff that
is right? So maybe you want like a before and after, like without an LLM, how do you design
a protein? Yeah. Which is different than the whole, the same thing was even different because
that was more about, we know how to make a
thing that we like, we want to make more of it, or we want to make it in a different organism.
Like there's a rare plant that produces some pheromone that is very useful for whatever
reason.
We get those genes and stick them into a microorganism that we can grow in a vat and have it make
that without having to harvest
a rare exotic plant. Leaving that aside, designing proteins before and after. First off, the before
methods are still very actively used. And it's a scientific process, right? You have some
hypothesis about the mechanism of action. You have some literature that suggests that a certain thing can be possible and you know that certain changes to a protein can affect in certain ways.
So you maybe go through a workflow where you change the protein sequence, you run it through
alpha fold to see how it will fold. And then you literally visualize the folded protein,
which is like a 3D structure, and it will have a pocket.
And you want to see if that pocket will bind to the molecule
you're trying to bind to, for example,
if that's like the kind of thing you want your protein to do.
And so you go through that and you kind of have ways
that you know generally work and you try a bunch of them.
And it's sort of like directed search guided by human experts.
And if I sound hand wavy about it, it's because like I don't have a PhD in protein engineering.
And in fact, there are different kinds of protein engineering and like the different
PhDs will do it differently, right? And it's a very deep subject, right? With the LLM,
it's getting to the point where saying that I want to have this backbone and
I want to make it have higher thermostability or something.
So essentially, it's conditioning your generation.
If you think not LLMs, but diffusion models, you can do conditional generation.
And so that's getting easier where you can say like, I have a protein like
this and I want to have a family of similar proteins because they have similar functions,
but I want to adjust in certain ways, right? This will generate candidates for you. And then
you can go backwards from the shape that you generated to a sequence. And then you can
synthesize a sequence, actually build the protein, confirm
the shape, confirm the function, and so on.
And then you can actually go backwards and be like, why did that shape work?
Or why did that sequence work, given what I know about the mechanism of action?
Maybe potentially even discover things, because it's possible that the model learned some
correlations that were not previously obvious or nobody observed.
But yeah, and I think this field is particularly interesting right now because sort of with language models or image models or any of that, we may not be able to spell out the rules of the English language.
We may not be able to spell out the rules of what looks nice or doesn't look nice, but
we can instantly tell whether the image is aesthetically pleasing or whether the sentence
is grammatically correct.
We cannot with a protein.
We're like, I don't know, it looks like a sequence of amino acids.
So what's cool is there aren't actually rules in our head that are implicit that
these things have to follow, right? Like we're still discovering the rules by probing at nature,
right? So to some extent, it's entirely possible that we will, by getting better at interrogating
the state of the models and how and why they're generating things,
when they generate things that are successful, we can go back and see what is actually activated
and find ways to examine that to lead us towards potential discoveries in science.
So from that perspective, I really like all the work that Anthropic is doing and others
in model explainability.
Because right now, the reason Anthropic cares is because explainability tells us essentially
how to control the system, right?
If it does something weird, we want to know what's going on so we can fix it.
Or we want to be able to translate what the model did into some sort of human understandable
explanation of why it made a decision, right?
It's sort of a QA function.
But in biology, if the model is explainable, like you might discover explanations that
are not just confirmations of like rules you know, but new rules.
That's a hypothesis.
It hasn't actually happened.
So it might not happen.
Maybe I don't know what I'm talking about,
but that'd be cool.
For all the listeners out there
who didn't know enough about biotech
or didn't find it cool enough,
you definitely make it sound super cool.
By the way, any resources that people could go to
to learn a little more about this
and educate themselves with just the terminology
that you've been using here?
Yeah, so there is a great Slack community,
and I think there's also a website called,
I think they're called Tech and Bio.
Let's see.
You know what, maybe we'll later add to...
Oh, yeah, for sure.
We can add, yeah.
Or something.
But yeah, there is a community of people
who are sort of tech people who are
interested in biology and are learning about it. There's like a pretty active Slack group. There is
meetups and paper discussions and things of that nature. So that's probably be the first place to
go. And then there'll be lots of pointers from there. Well, this has been an awesome conversation,
Dimitri. I know you have to go. We would love to continue going on the podcast and
hope there is a second time
when we get to speak with you again.
But for today.
Sounds good.
We can do a part two.
I do have to go and do a-
Oh, no, totally, totally get that.
But thank you so much for joining the show today.
This was awesome.
All right.
My pleasure.
Great meeting you guys.
Hey, thank you so much for listening to the show.
You can subscribe wherever you get your podcasts
and learn more about us at softwaremisadventures.com.
You can also write to us at hello at softwaremisadventures.com.
We would love to hear from you.
Until next time, take care.