No Priors: Artificial Intelligence | Technology | Startups - AI-Powered Biological Software with Jakob Uszkoreit, CEO of Inceptive

Episode Date: August 24, 2023

"Biological Software" is the future of medicine. Jakob Uszkoreit, CEO and Co-founder of Inceptive, joins Sarah Guo and Elad Gil this week on No Priors, to discuss how deep learning is expanding the ho...rizons of RNA and mRNA therapeutics. Jakob co-authored the revolutionary paper Attention is All You Need while at Google, and led early Google Translate and Google Assistant teams. Now at Inceptive, he's applying these same architectures and ideas to biological design, optimizing vaccine production, and magnitude-more efficient drug discovery. We also discuss Jakob's perspective on promising research directions, and his point of view that model architectures will actually get simpler from here, and be driven by hardware. Show Links:  Inceptive - CEO & Founder - Jakob Uszkoreit | LinkedIn   Inceptive Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @kyosu Show Notes:  (0:00:00) - Creating Biological Software (0:06:54) - The Hardware Drivers of Large-Scale Transformers (0:14:32) - Challenges in Optimizing Compute Allocation (0:23:25) - Deep Learning in Biology and RNA (0:32:49) - The Future of Drug Discovery (0:41:41) - Collaboration and Innovation at Inceptive

Transcript
Discussion (0)
Starting point is 00:00:00 What would the world look like if we could create biological software that allows us to compile RNA? That's the big question this week on the podcast. Sarah and I are sitting down with Jacob Uskerite, co-founder and CEO of Inceptive. Jacob spent more than a decade at Google, where he co-authored The Attention is All You Need Paper, and several other papers set the foundation for today's AI revolution. He has also started and led the research teams that transform Google Search, Google, translate and Google Assistant. Now at Inceptive, he built biological software with the aim to make widely accessible medicines and biotechnologies. Yakup, welcome to NoPriars. Thank you. Thank you for
Starting point is 00:00:40 having me. You worked at Google for more than a decade, working on many leading research teams. You were really seminal in the original Transformer paper. And I think when I talked to the other authors of the Transformer paper, people sort of in the know at Google, you're widely credited with really coming up with the idea of focusing on attention, which was sort of the basis for the attention is all you need paper. Could you tell you? talk a little bit more about how you came up with that and how the team started working on it and sort of the origins of that pretty foundational breakthrough in terms of the transformer. It's really not that simple, right?
Starting point is 00:01:11 It's also really important to keep in mind that always in deep learning. You can't make something, in quotes, really work that is maybe pretty far on, say, the theoretical or formal end without really going deep on the engineering implementation side. And it just has to be efficient at the end of the day, in my mind, the one and only thing we know really works if you want to push your learning forward is to make it faster and more effective and more efficient on a given piece of hardware. There's a lot of evidence that the way we actually understand language, and that's something that then shapes language in terms of its statistical properties, is actually somewhat hierarchical. And the best
Starting point is 00:01:51 piece of kind of just circumstantial anecdotal evidence for that is just looking at what the linguists do, right? They draw these trees. And while I don't think that they're really true, they're also definitely not always false. And so they do capture some of the statistics that are inherent in language, and probably language was actually evolved this way in order to exploit our cognitive capacities really in a fairly optimal way. And so you can safely assume that it is not necessary to go through the entirety of a sequential signal beginning to end and maybe also end to beginning simultaneously in order to
Starting point is 00:02:28 understand it, but actually you can gain a lot of the understanding in air quotes by looking at individual groups of, say, your signal, right? And ultimately, if you now are given a piece of hardware that has the very key strength of doing lots and lots of simple computations in peril, as opposed to complicated structured computations sequentially, then really that's actually a kind of statistical property you really want to exploit, right? You want to imperil, understand pieces of an image first, and then maybe that's not possible in its entirety, but you can actually get a lot of it. And then only once you've done some of that, you put these incomplete understandings or representations together, and as you put them together more and more, that's when you
Starting point is 00:03:12 disambiguate the last remaining, or that's when you get rid of the last remaining ambiguity at the end of the day. And when you think about what that process looks like, it's a tree, and when you think about how you would actually run something that evaluates all possible trees, then a reasonable approximation is that you repeat an operation where you look at all combinations of things, that's this quadratic step, right, that ultimately is at the core of this attention step, and then you effectively pull information in for a given representation of a given piece, the other representations of all the other pieces, and rins and repeat. And it seems intuitive and it also seems intuitively clear that that's a really good fit
Starting point is 00:03:52 for the kind of accelerators that we had at the time that we still have today. And so that's really where that idea came from. And if you want to look at, say, the biggest difference is, for example, between the Transformer, as it was described, and the attention is all you need paper, and some of its ancestors like this decomposable attention model, the big difference is just that the transformer was implemented by folks like No Arm, Industries, etc., in a way that's such an excellent fit for the accelerators that we had at the time. So one question that I've kind of heard people bring up is a lot of the behavior that we've
Starting point is 00:04:20 seen in Transformers, to some extent, is most interesting at scale, right? You get interesting emerging properties. Yeah. And there may be other architectures that have equally interesting or perhaps more interesting properties at scale, but there's sort of two impediments. Number one is people just aren't throwing a lot of money and compute at it. And two is the underlying accelerator architecture actually fits so well. It is dramatically less performant to do other architectures, and therefore we may never actually
Starting point is 00:04:43 test them. Do you think that's a true statement? I think that the big question is, does it matter? It would be really interesting to evaluate, especially if we can make them simpler, to evaluate combinations of different hardware and then models or architectures that fit like gloves to those. And I feel at the moment,
Starting point is 00:05:01 given where GPUs came from, they weren't built for this. Why would it be that they are anywhere near optimal? If at least they were engineered for this purpose and lots of people basically banged their head against walls until they had this somewhat optimized, but that's not how the basic architecture came to be. And so you can talk a lot about, and reason a lot about, and I think that some of that is true,
Starting point is 00:05:25 the generality of basically really fast, scalable matrix multipliers and how that just does everything in scientific computing really well, but there's still lots of bells and whistles, and there are lots of specific tradeoffs, say, for example, things like memory, bandwidth, and ultimately inherent parallelism versus latency. I don't think GPUs are at the sweet spot when it comes to large-scale deep learning with respect to exactly those trade-offs. And so it may very well be that if we actually try these combinations, we might actually even quickly find something that's better. When you think about how we get progress from here,
Starting point is 00:06:04 usually people think of software as driving the hardware, right? Do you think we get accelerators designed for the large-scale transformer architectures we already have or new hardware designs. Like, it's chicken or egg a little bit here. It's chicken and egg. And if you look at the newest accelerator designs, they are taking this into account to a significant extent, actually, increasingly. So there are a couple of interesting examples.
Starting point is 00:06:33 We had a computer vision architecture that really was just an MLP called Mixer. And while it wasn't significantly better, it also wasn't significantly worse than the vision transformers. And I think that already goes to show. it's not that difficult. And especially if you simplify on the way, it might really be a possibility. I will say one other thing, aside from efficiency, just really raw efficiency in terms of the architecture is fit to the accelerator hardware. The other main contributor, I think, to the success of this architecture was optimism and hope. So suddenly you were in a situation
Starting point is 00:07:08 where, for whatever reason, a bunch of things that people tried with this started to work. and then more started to work and that's not coincidence. It's really just because ultimately the human cycles invested to getting all these things, all these diverse things to work are ultimately fueled by suspension of disbelief and
Starting point is 00:07:28 AKA Hope or whatever you want to call it and that really I mean the community became so energized so quickly and then just try everything under the sun and because the fire was just a different one. The fire now was oh look we have this thing where it just works which is just not true
Starting point is 00:07:44 the reality is where you try something else the first time and you really have to work hard for a long period of time, then, lo and behold, sometimes it works. And if you do that many more times, then it will work many more time. And I think that's really what we're seeing. Where do you think people should invest that sort of optimism going forward? Like, what are the big areas that people need to work on to increase the performance of these systems or add memory or do other things that you feel are, if you were to sort of paint their roadmap going ahead in terms of making these really valuable performance systems, what would you feel? focus on. I mean, I think there's one thing that still boggles my mind in terms of just from first principles that it can't be optimal. And that is that if you think about it, the way you today scale the compute that's invested in a given problem, right? Let's say the problem is what's the response to a prompt in some large language model. Then ultimately, the way you scale that compute depends on the prompt and how much, how long that is. The longer the prompt, the more compute you get. And it depends on, and there's, of course, many different screws to tweak here, the length of the response. There are many very hard problems where the response is incredibly short.
Starting point is 00:08:52 And you can, in many cases, actually formulate those problems very, very succinctly. So you're not going to be using a lot of compute, even though the problem we know is really, really difficult. Say, I don't know, prime factorization. A problem like that simply stated big potential impact. And right now, there's no knob that you can easily tweak as a user. but also really there's no knob that the architecture can tweak itself when it comes to then basically deciding, oh, this is hard, I actually need to use more compute for this.
Starting point is 00:09:23 And ironically, and this comes back to a question that many people ask, I think around, does it make any sense to train on generated data? Because information theory, founding information theory, very clearly says, nope, you're not going to get more information out of it. You can do it all you want. But there is an artifact, or there's an emission maybe even in that information, in that flavor of information theory, which is it doesn't take into account compute. It doesn't take into account actually the energy expenditure necessary to generate that data. So if you now think back to these problems, right, if you were to just let LLMs run, generate stuff and then train new LLMs or even the same LLM actually on that output, what you do is you amortize compute that was expended at some point.
Starting point is 00:10:09 time. And so now suddenly, right, you actually have models that if you retrain them over and over again, they're starting to spend more compute on the same problems, but it's amortized over all of these iterations effectively of the system. And that seems clunky. That just seems so clunky that ultimately it should be something where at inference time, at runtime, the model effectively can decide or maybe even query, right? So there's this notion of any time algorithms where it might just depend on your resources. If you have more time or more money, then let it run longer.
Starting point is 00:10:44 But you don't want that to happen in cases where the answer or the problem in question is simple. You only want to do that in cases where it's actually hard. And that right now doesn't work. Because if you pose a very, very simple problem, like 2 plus 2 to GPD4 right now, and you write that in a very long-winded way in a prompt, and you ask GPD4 to generate a very complicated answer,
Starting point is 00:11:08 then it will actually expend a ton of compute to add to two, to two, which makes no sense. And so that, I mean, out of all the different problems that I currently see at a high level, because it's how clear how you would exactly address it, that is maybe the one that boggles my mind most. Yeah. Are there other big research areas that you're excited about right now or areas where you see enormous progress being made? So in terms of foundations, I think different flavors. of elasticity are really interesting. So you could actually claim
Starting point is 00:11:44 that a lot of these questions boil down to the question that I just, to basically this problem that I just described, right, that compute is in a certain sense very crudely allocated. But you can look at different incarnations of this problem. So another one would be, why don't we have models
Starting point is 00:12:00 that in an elegant way manage to consume, say, visual sensor output of different resolutions, different sampling rates, different durations. Right now, it's actually quite tricky to have, other than maybe recurrent architectures, a model that takes videos of different links,
Starting point is 00:12:17 different image resolutions, or ultimately different densities, if you wish, in different sizes, and really elegantly adjusts compute to what you really want to know about this, or how difficult it really has to be to generate the representations that you need in order to do whatever you want to do.
Starting point is 00:12:35 And here, again, an example that makes this, I think pretty clear is you can take a video, you can scale it up, you can frame and interpolate with trivial algorithms, and then run it again. And if the problem you're trying to solve conditioned on that video is the same, then I wouldn't want more computers to use. But right now, that's what's going to happen. You're going to use a ton more computer.
Starting point is 00:12:56 And so effectively, these types of, in a certain sense, kind of elasticity or your flexibility of these models, I believe our lack of techniques addressing those, ultimately is incredibly wasteful. I've seen increasing attention around like two different concepts in these general directions. One is, I think it was some people at Meta that did depth adaptive transformers, right? So just adjusting the amount of computation for each input and like a prediction on that, right? And then I don't know how much more work has gone in that direction.
Starting point is 00:13:28 And then I think a number of people are more excited about doing test time search, especially for problems like code generation where you can evaluate. it with compilation or something to sort of get loop of success in the model itself? I think it's super effective in test time search. I do think it's clunky because
Starting point is 00:13:50 it's not something that you can easily end to end optimize. So basically this is also what I'm what I was trying to get at a little bit maybe with saying some of these efficiency improvements that were not yet really financing, I believe would dramatically affect training time and if you look at kind of how test time
Starting point is 00:14:06 actually affects training, it's just a clunky. And I don't think we'll be able to optimize it as well, although as an engineering in a certain sense, I don't know, hack could sound negative. That's not what I mean. I think it's an awesome hack. As an engineering hack around this problem, it's really, really effective. It basically comes back to this whole idea of amortizing compute in a certain sense, with the stuff you already have lying around and memorized,
Starting point is 00:14:31 even though it was the humans that actually put it there in many cases. in terms of adaptive, adaptive time transformers, et cetera, we tried this universal transformer thing actually a long time ago. It just hasn't caught on, and that's because it just doesn't work, right? At this point, it doesn't work well enough. It's not like it doesn't work at all,
Starting point is 00:14:48 but if it worked really well, then because of the fact that compute right now is this incredibly scarce resource, we would see it everywhere. And I think what that tells us is, and I don't think here it's really just for a lack of trial, probably there's too little experimentation, but at least those known or proposed methods here,
Starting point is 00:15:09 they just don't work well enough yet. So one thing that you've been working on for the last few years is Inceptive, which is really starting to focus on how can you apply machine learning and different aspects of software to biology. Could you share a little bit about the company, how you got interested in bio,
Starting point is 00:15:25 and what you view is some of the interesting problems there? Yeah, so basically, I've always been interested in bio and know nothing about it. And that's a conundrum because it's difficult to learn a lot about biology when you're not in school and I didn't want to go back to school. But at the same time, it always felt like something where there's a lot of headroom in terms of efficiency and actually also where maybe even alternative approaches, at least what you are interested in is really solving acute problems, where there's maybe a dire need for alternative approaches. alternative to basically biology, the science that is trying to develop a complete conceptual understanding of how life works. I don't have very high hopes for humanity to develop that complete conceptual understanding to the level that we would need in order to do all the
Starting point is 00:16:11 interventions we want to do. We don't really have great tools in our toolbox, or we didn't have them until somewhat recently as alternatives to understanding how it works and then basically based on that understanding, fixing it if it needs fixing. And I think now we have an alternative that's an extremely good match, and that's deep learning at scale. We're really, we can potentially, to a pretty large extent, if not entirely, whatever this even means, work around the following two problems. Number one is we don't know all the stuff that's going on in life, right? So we still just don't even have a complete inventory, let alone really understand all the
Starting point is 00:16:48 mechanisms. And number two, we ultimately, even for the stuff that we do know, so far haven't real in many cases, have been able to come up with sufficiently predictive theories to really make that understanding useful. A concrete example, here is protein folding, right? Or basically, even if you just act
Starting point is 00:17:08 as if there are no chaperones, there is no other stuff in this environment in which folding or whatever you want to call it, in which that process in which the earliest kinetics during translation happen, even if you make that massively simplifying assumption,
Starting point is 00:17:24 the theory just wasn't practical, and it seems like deep learning is at least potentially a really good answer to both of those aspects, because you can basically treat everything in quotes as a black box, and as long as you are able to observe that black box in terms of whatever input output pass enough, and at sufficient scale, you might go somewhere with that. So Inceptive is pretty stealthy. Is there anything you can share in terms of how you're applying deep learning or other techniques to biology in the context of the company?
Starting point is 00:17:54 Yep. My daughter was born, my first child, and just that entire process gave me a really fundamentally different appreciation for the fragility of life and a really wonderful one, but also a pretty fundamentally different one. And so here we are, we have this new tool, namely Alpha-Fold 2, that solves one of these fundamental problems in structural biology. We have instances of a macromolec family that's basically about to save the world, and I basically want to fix life because I now have this wonderful daughter. It became clear that using the exact rules we had been working on at Google before and applying those to this neglected stepchild, namely RNA, or more specifically at first MRI, could have massive impact
Starting point is 00:18:37 on the world. And ultimately, what we're trying to do is to design better RNA and at first mRNA molecules for a pretty broad variety of different medicines. Infectious disease vaccines, are, I guess, maybe the obvious first example given the COVID vaccines. But if you look at the pipelines of Moderna and Biontech and all those companies, the at least potential applicability of RNA, more specifically, is near the limitless. There's already now hundreds of programs underway in different stages of development. That number is expected to climb hitting high triple digits before the end of the decade. And now we're talking about a modality that might end up before the end of the decade being the second or third biggest modality in terms of revenue
Starting point is 00:19:31 and potentially also in terms of impact. And if you now take that in terms of just trajectory and look at how suboptimal in a certain sense the mRNA vaccines were when you compare it to what's possible using RNA, just looking around in nature, looking at how, severe the side effects were for what fraction of ultimately patients that received the vaccines, how few people comparatively really had access to any of those vaccines when they really were necessary and needed. And it seems like currently, if we look around in our toolkit, the only tool we have to potentially change that quickly is deepering. So at Inceptive, we think of this now as something that you could call biological software, where MRNA and RNA in
Starting point is 00:20:18 general is maybe the equivalent to bytecode that then forms the substrate, forms like the actual stuff that the software is made of. And what you do is you learn models that allow you to translate biological programs, programs that might look like some bit of Python code that specify what you want a certain medicine to do inside yourself, inside yourselves, and translate those programs compile them into descriptions of RNA molecules that then hopefully actually do what you wrote, what you programmed them to do. And ultimately right now, if you look at mRNA vaccines, our programming language is just a print statement, right, just print this protein. But you can easily imagine that with self-amplifying RNA as one example, and with ribos switches, so-called
Starting point is 00:21:08 ribos switches, basically RNAs that change dramatically in structure or self-destruct in the presence of, say, given small molecule or so, you can effectively have conditionals, you can have recursion, and as a computer scientist, you squint, and you're like, oh, wow, okay, this is basically touring complete, you have some I-O, and you kind of have all sorts of tools now at your disposal to really build very, very complex, ultimately, medicines that then might also be produced, manufactured, and distributed in a way that is much more scalable than anything that we've been able to do so far. Protein-based biologics oftentimes don't make it to the market because it's just not possible to manufacture them at scale.
Starting point is 00:21:46 If we wanted to medicate everybody in the world with all the protein-based biologics that they should actually receive, the real estate on the planet wouldn't be enough to make all the stuff. But right now, if you look at RNA manufacturing and distribution infrastructure, we're going to have 6 to 8 billion doses two years from now, manufacturable and distributable across the globe. And that number is going to go up really, really quickly. At Inceptive right now in our lab, we can actually print pretty much
Starting point is 00:22:12 any given R-name. And that's just something you can't do with small molecules. You can't easily do with protein, certainly not at scale. And that's not something that only matters when you have a product in your hand. If you want to treat this as a machine learning problem, you need to generate training data. It doesn't already exist. And so you also really want to have scalable synthesis and manufacturing,
Starting point is 00:22:32 which is unprecedented as a consolidation. So your view is that you can actually search for the program that codes for, let's say, the COVID spike protein at a certain amount with different stability characteristics, with different immune reaction characteristics that doesn't need cold chain logistics, that condition of whatever cell type, I'm saying in the future, not inceptive today, but that's the goal of all of the 10 to 630 variants. That's right. And it's not certain, I mean, ultimately it's not going to be a search, right? Just like today, the output of an LLM isn't coming out of a proper search procedure.
Starting point is 00:23:12 It has to be a generation procedure exactly in the same way and for the same reason as you basically see it in large language models or image generation models. But yeah, that's exactly the goal. Because screening is just not going to cut it 10 to the 60th. And that's really just one antigen that we're coding for there when we actually want to code for many and update those for any given. For any given, yeah. exactly. When you do personalized cancer vaccines, it is going to be many antigens for each patient over time, right? And there's just no hope of basically tackling this with screening approaches at all. Yeah, I'm excited to just get to the right answer without having to understand or
Starting point is 00:23:52 discover every single mechanic and do the mass expensive screens we have today. I mean, that's really the big question. Are we here maybe at a crossroads where the discovery and understanding is actually a hindrance. The hope to discover and really get it how this works might actually be holding us back. And there is a pretty direct analogy to language understanding. Computational linguistics and linguistics in general tried this for a while to develop a sufficiently accurate and complete theory of language to make this really actionable. Yeah, when you talked about how transformer model works, for example, I actually was thinking about genomic sequencing where you used to do the sequential sequencing contig by contig and you'd have
Starting point is 00:24:32 these big chunks of chromosomes that you'd sequence through sequentially, and then eventually you moved into an era where you just broke it up into tons and tons of tons of tiny little sequences that were randomly generated, and then you'd reassemble it with the machine, right? And that felt like a very interesting parallel or analog to what you were talking about from a language perspective. It's effectively the same thing. It is exactly. And the parallels are so striking, and they don't end there. So, yeah, it's really, really interesting to see. And the invariant that I feel just holds through across the board is that these formalisms that we make up in order to communicate our
Starting point is 00:25:07 conceptual understanding or intuitive understanding and conceptualizing explicitly is great for education. It's also great for many other types of maybe that reasoning about them. It might actually, because of our limited cognitive capabilities, really not be the right tool to actually really predict what's going to happen with a given intervention. Yeah. And I think the other point that I think really resonated in terms of what you mentioned was just if you look at drugs, especially traditionally, we actually didn't understand how most drugs worked until very recently. And so aspirin, we had no idea how it worked when it was taken out of the bark of a U-tree or whatever in the 1800s. And it was fine. Like people were fine taking these things
Starting point is 00:25:47 that had minimal side effects. There's very popular drugs in the market like metformin that bind a multiple targets. We still aren't sure exactly how they work. And so a lot of the emphasis right now from a regulatory pathway for drugs is, oh, you need a mechanism of function or you need to proven pathway. And all these things that create hurdles that don't necessarily help with drug efficacy. And some of them might actually also be, in a certain sense, kind of, I should say. It's a waste of time and money. If the thing works, it works. Yes, it's a waste of time and money, and it might not even be true. And we have no way of telling. Because in the end, the ground truth is, right, does it work and does it actually do more good than harm? And it's empirical.
Starting point is 00:26:23 And yeah, maybe there's really just, maybe that should be the focus. Yeah. And everything else should be treated as something that we should at least do after we get the first take the first thing. In that historical framing of we don't actually understand many of the things that have been most important in medicine or if we've discovered their mechanisms after the fact, you know, the end-to-end like black box like deep learning pipeline approach seems a little more rational, a little less heretical, which I think upon first blush it certainly is controversial. Yeah, I mean, the part that one can look at as blasphemous is that now suddenly you don't know the theory anymore that you're testing, right? And you might never, because it's not clear to us today, as far as I can tell, that if there is a theory in that black box today, that we could get it out.
Starting point is 00:27:15 There are people trying, and I think it's worth trying. I'm not super optimistic about that. I think it'll work for some cases, right, where it's simple enough that we can get it. I think there are many cases where it just isn't, like say, client. and weather forecasting, I just don't think we're going to get it. We're going to get it in the sense that we understand, I think we understand the Schrodinger equation and how that could be used intractively, though, in theory, to just solve all these things. But that's not practical. And to develop a theory that is both predictive and practical here might just
Starting point is 00:27:49 not be something we can put in our heads. Yeah. This is kind of interesting because I actually feel like this, again, is the basis of a lot of traditional drug discovery from way back when, as well as just the basis for how you think about genetic screens, right? You'd basically do functional screens, so you'd mutagenize a bunch of organisms. You'd look for output, and then you'd say, okay, I've identified genes that are part of this pathway or output, and I can map in some ways that they're interacting with each other, but before molecular biology, we actually didn't understand anything from a function perspective. We just understood sequencing and output, right?
Starting point is 00:28:17 And so it feels like deep learning is really just a throwback to other forms of biology that have been incredibly fruitful, but just with a new sort of technology modality to interrogate these systems. Exactly. So how do you think about human augmentation in the context of all this stuff? You know, how bullish are you on human augmentation and what forms do you think it'll take in the near term? I'm very bullish on human augmentation in the very long term, but it's one that I don't see intuitively. I think looking at our brains, even just physically, they seem to be very focused, and this is not surprising, on RIO. And why would there somewhere in there be some kind of computational capacity that if we just boosted our IO by a few orders of magnitude
Starting point is 00:29:02 could still cope. Why would evolution put that there? I don't know why. And so, yes, you could argue, you know, maybe to do long-term planning tasks and so on and so forth, but sure, let's bound it a lifetime. So, right, it's just not so clear whether there would have been any evolutionary pressures to really make our capacity there much bigger than, say, some multiplier, basically time on our IO capacity. If you look at the number of tokens that you use to train in LLM, and then you look at the number of tokens or words that are used to train a kid, right, a child,
Starting point is 00:29:36 a human baby or a human toddler. I mean, a human toddler is probably exposed to what? Hundreds of thousands, maybe millions of words before they can speak, like fluently. But I think that's because we confuse fine-tuning and pre-training. Pretraining is all of evolution. Sure. and then basically you arrive at this thing
Starting point is 00:29:54 that it's maybe doing something that's completely in a certain sense a completely irrelevant task at first but it has all the capacity in there to then with a comparatively small amount of data maybe it's something in between, but be then fine-tuned towards something that we would regard as oh so advanced cognitively.
Starting point is 00:30:11 The compute has been amortized over the last several millennia of 10 millennia of humans and we come pre-wired for language and so it only takes a million tokens at the end. Exactly. And now the thing is that you can now say, okay, great. So we come to rewired. Let's look at our wires and try to find language. That might not, it might not be that simple, right? Because, of course, it's this co-evolution and it's all fuzzy. And so how much we're pre-wired for it, how much language is in a certain sense also pre-wired for. It might be, it might be the case that it's maybe even impossible, right, to actually read out what it's pre-wired for from just looking at the wire. Yeah. You can see circumstances where people are literally born without a hemisphere of their brain, or there's other sort of mass scale deficiencies brain-wise, and then things just rewire to effectively compensate. And so you have parts of the brain taking over other functionality that they're normally not designed for, which is also fascinating because it seems like certain parts are extremely specialized visual cortex, et cetera, and then other parts are basically almost general-purpose machines that can be reallocated. I completely agree with what you're saying. I feel general-purpose machines,
Starting point is 00:31:21 is a really tricky term because, right, I mean, could they, could the brain after a massive trauma rewired to do something very different? Fair. I'm clear, right? So it could be that it's actually still specific, but it is in a certain sense, general, namely preparing for a certain flavor of redundancy. And this is also why I find AGI as a term particularly problematic, because I don't know what the general means. I think they're referring to general toast chicken as part of, no, I'm just, sorry, really dumb joke. I finally get it. I'm sorry.
Starting point is 00:31:54 Finally all makes sense. What's the theory of data generation at Inceptive? I feel like I understand the mission you describe, and then you need to go do wet lab experiments with observation to understand all the properties of these sequences, and you have to figure out how to do that efficiently, right? Still a young company with all your pedigree and resources. Yes.
Starting point is 00:32:17 I would love any intuition on that. Yeah. So let me try to get across how we think about this. So number one, we look at ourselves actually as one anti-disciplinary team. So it's not quite anti-disciplinary, although there is a correlation maybe with a lack of discipline or disregard for fundamental discipline or disciplines and being anti-disciplinary. But we think we're really in the sense pioneers of a new discipline. It doesn't have a name yet, but it draws a lot from deep learning and draws a lot from biology.
Starting point is 00:32:45 We think ultimately designing the experiments or assays that we're using to generate the data, that we need to then train the models in a certain sense is at the core of this discipline, if you wish, because the experiments or the assays that were running, they use the models that we're training on the data that their predecessors actually produced. And so really, if you squint, then in a certain sense, I guess there was always this dream of, and I think it's a pipe dream, of having the cycle between experimentation, and then you put that into something in silico, something running on computers, and then that informs the experiments,
Starting point is 00:33:21 and then you kind of iterate that cycle. I think that's just, it would be beautiful and simple and nice. I don't think it's really that easy. So what you see at the incentive is actually there is not that one cycle, although maybe now somewhere hazily there actually is that cycle too, but by design, actually, there are tons of little cycles. So, right, you started an assay, and the first thing you do is actually you query a neural network,
Starting point is 00:33:46 and then you do some stuff, and then you get certain readouts, and those, you then together with some other stuff feed into yet another model. And then that actually gives you parameters for some instrument. And then you run that instrument on the stuff that you've created. And so it's really just this kind of giant mess where the boundary
Starting point is 00:34:02 actually is increasingly blurry. And so we actually think that our work happens on the beach, because that's where the wet and the dry meet in harmony. Ah, huh. And so initially, folks join inceptive and they usually, most of them, they come from, say, either
Starting point is 00:34:19 end quote side, right? They've spent most of their careers working on deep learning or either robotics or biology. But ultimately, it doesn't take them that long to start speaking some weird kind of creole of all of these languages and also think in these ways. And what then happens is magic. It's really amazing because then you suddenly find solutions to problems that say the biologists they were two years ago just wouldn't even think about. and they work together with folks. They would have otherwise maybe never even met. And the results sometimes don't work at all,
Starting point is 00:34:56 but sometimes they really are magic. That's a really inspiring note to end on. Thanks, Jakob. Thank you. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen.
Starting point is 00:35:13 That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.