Grey Beards on Systems - 108: GreyBeards talk DNA storage with David Turek, CTO, Catalog DNA

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. Hey everybody, this Greybeard on Storage episode was recorded on October 9th, 2020. We have with us today David Turek, CTO of CatalogDNA, a startup focused on bringing DNA storage to the world. So, David, why don't you tell us a little bit about yourself and what CatalogDNA is up to? Okay, thank you. So I was the executive in IBM responsible for our high-performance

Starting point is 00:00:47 computing business prior to joining Catalog this summer. And I made that jump because I saw a real transformational opportunity with Catalog in terms of really bringing the concepts embodied in the efficiency of data encoding and DNA, coupled with the possibility of not only storing vast amounts of information for a long time with low energy, but also beginning to explore the concept of applying DNA for computing as well. I've written some blog posts over the past couple of years, both on the DNA storage emerging technology, as you would say, as well as using cellular or cell logic kinds of structures to provide rudimentary computation and that sort of thing. Is that what CatalogDNA is trying to bring together? Our play is really centered on synthetic DNA,

Starting point is 00:01:45 so it's not inside of a cell. We're not invoking cellular machinery to do anything here. Rather, we're taking the attributes of the DNA molecule and leveraging that to store information. And we think that there are ways to encode instructions in DNA as well

Starting point is 00:02:01 to compute on that data. Wasn't this something that Microsoft was dabbling around with a few years ago? A number of companies have dabbled with it. I think a couple of key things that differentiate Catalog from the previous efforts has been a very unique encoding scheme that we employ that radically reduces the amount of chemistry involved and really opens up opportunities for automation in terms of encoding data that other methods that people know about in the industry are simply incapable of pursuing. I was going to make the comment that, you know, we've kind of shied away from chemistry in computer science, at least down at the consumption level, like the idea of combining chemistry with processors and such.

Starting point is 00:02:55 I know there's chemistry there, but that stuff has come a long, long way and it's invisible to us. Hard drives today, there's some chemistry there. There's plenty of chemistry in the magnetic material that's used and all that stuff. Yeah, absolutely. Exactly. So help kind of make this real for us because when I think about DNA using DNA for proper storage, it seems at this time it still seems like kind of Star Trek-ish science fiction versus real.

Starting point is 00:03:29 Where's the innovation happened in the past few years? So I'll speak to our innovation explicitly, and we can talk about other things going on in the market as well. But I think if we step back from a second and think about the problems that people face with respect to storing information, they fall into categories, right? It's volume, it's velocity, it's quality, it's energy, and it's longevity. And by longevity, I mean the ability to store data today that you'll actually be able to read sometime in the future. So if you look at tape technology as an example, and you look at the transition from LTO6 to LTO8, what you find is the change in the media and other factors require you to do a rip and replace strategy with respect to your infrastructure. And this goes on all the time.

Starting point is 00:04:20 It's a reason why people lament, you know, the loss loss of floppy drives or even today, more recently, even CD players and CDs. Technology advances and the way in which you store data previously becomes put at risk as a result. So when you look at DNA and its fundamental attributes for storing data, you find that it's got tremendous capability in terms of data per unit of density, maybe a million times greater than what you get with conventional digital technologies. You find that you can do things with really, really low energy. And I think critical to this is you can preserve this data forever effectively and know that no matter when in the future you

Starting point is 00:05:05 want to get back at this data, you'll have a way to read it because DNA is a structure that will forever be read and capable of being read by a whole bunch of different kinds of current and future tools. So that fixed nature of what the media is, is actually a tremendous strain as far as being able to give clients the confidence that what they store today will be retrieved sometime in the future. It's hard for me to understand. And, you know, correct me if I'm wrong. You know, DNA seems to be such a malleable structure. It's prone to,, it's got, it has to, to some extent, as far as I know, it had to exist inside some sort of cellular mechanism, how that could pretend, how that could

Starting point is 00:05:58 be portrayed as a permanent storage solution. Now, I realize that, you know, they've been extracting DNA from, I don't know, you know, humans, you know, that have died, you know, 20,000 years ago, and then maybe insects that had died 65 million years ago and things like that. But even that extraction doesn't necessarily reconstitute the whole DNA strand, does it? Well, that's nature and the way it handles the preservation of DNA. What we're doing is we're building DNA molecules synthetically, so we construct them from the ground up. By virtue of doing that, we can impart certain attributes to it which lend longevity, shall we say. One of the ways you do that, for example, is limit the length of the

Starting point is 00:06:46 DNA molecule. The longer a DNA molecule is, the more fragile it becomes, kind of like a bamboo stick, if you will. You know, it's fine if it's four feet long. If it's 60 feet long, it starts to become a little bit of a problem. So there are tricks that you can do, but we're not talking about embedding this non-biological synthetic DNA into of years, or we can keep it in the liquid form as well. But I think there's one other attribute of DNA, which is important to mention, which is it's really, really easy to cheaply create as many copies of DNA as you want. So when we ligate together building blocks of DNA, you know, stitch them together, think of it that way. And we try to amplify these molecules that we create. We can create a million copies in just a fraction of time. And so one of the ways to deal with the risk of data decaying, media decaying over time is you proliferate copies and you store them in a geographically dispersed kind of way.

Starting point is 00:08:12 So there are a lot of features of DNA that are contrary to the way we conventionally think about the behavior of media that can be exploited in a lot of different interesting ways. So I realize that replication is an inherent aspect of the cellular machinery for DNA and that sort of thing. But what you're saying is that for DNA and that sort of thing. But what you're saying is that even without using that sort of machinery, you can, through lab types of mechanisms, replicate a single strand of DNA a million times without expending any serious amounts of energy or time? Is that? Yeah, that's correct. And the flip side of that is that the archives of DNA that we build are also capable of being accessed in a random

Starting point is 00:08:55 fashion. I always thought DNA strands were read sequentially. Ah, well, so here's the difference. And I suppose this characterizes the innovation of catalog versus other players. So historically, and if you review the literature, most of the efforts to encode data into DNA focused on the base pairs, A, G, C, T. And a scheme was imparted, you know, an A is a 0, 1 and a T is a 1, 0, something like that. But that's a strategy that is really expensive and really slow and really consumptive of a lot of chemistry. Because every time you do another base, there's a chemistry step involved. What we did is we borrowed sophisticated ideas from the world of Tinker Toys and Legos. And we built short strands of DNA, maybe 25 to 50 base pairs in length, and made them diverse in nature. And so think of a Lego box with a lot of pieces of the same size,

Starting point is 00:09:57 but 1,000, 2,000, a million different colors. And we can take those things and we can build structures out of them in the laboratory. We actually invented a machine to do this automatically. And by linking these different colored Lego pieces or different characterized pieces of DNA together, we can create molecules that impart two critical pieces of information. One, the value that you're trying to encode. And secondly, the order in a bitstream of data that you're trying to represent. And by virtue of doing that, every molecule that we produce lets you know where it should be in the bitstream and also what the value is, whether it's a one or a zero. And then, of course, because the world is binary, we actually don't do anything

Starting point is 00:10:43 with the zeros. We just assume if there's not something targeted for a particular space in a bitstream, it must be a zero. So we've elevated this to a point of automation by using building blocks and not base pair kinds of technology. Now, that's really efficient from an automation perspective. We just do ligation of these different Lego building blocks, if you will, to create these different molecules that represent the data we're trying to encode. But it's also very efficient in terms of the reading part of the equation, where we use, in our case right now, we're using Oxford nanopore technology, which doesn't require the same degree of fidelity, if you will, that, let's say, an aluminum machine might do in terms of synthesizing a DNA molecule. So you're

Starting point is 00:11:35 not looking at base pair resolution. You're looking at strands of DNA resolution. If you're talking about Lego pieces with millions of different colors. That's great, but you'd have to have some way of identifying where each Lego piece fits into, I'll call it the bit stream or the byte stream or the data stream and that sort of stuff. Are you encoding some sort of an address to each of these Lego pieces that says, okay, you belong at section 27, bit 14, or something like that? Is that how this works? In a way, yes. But don't take the representation too precisely in terms of what's happening physically. We actually have a lot of mathematics behind this embedded in different sort of algorithmic approaches using combinatorial mathematics and so

Starting point is 00:12:26 on that actually will look at these universe of different building blocks we have and will make selections about which ones go together to convey the kind of information we want to encode. And we have this automated machine I refer to. We actually call it Shannon after Claude Shannon. And it passes this webbing through the machine at a certain pace. And we have modified inkjet printheads to deposit on this webbing a little ink drop, if you will. But the ink drop is actually composed of these Lego building blocks from DNA. Every dot can contain something different. In fact, we can actually do more than one kind of data representation of a single dot if we wanted to, multiplexing, if you will.

Starting point is 00:13:16 But if you look at a single dot, we put the pieces in there, and then under thermodynamical conditions, the right enzymes, the other kind of chemistry that needs to come along, they stitch themselves together in the right order. And then we have a whole collection of these dots, each of which contains a molecule that's been created. But those molecules now convey both value and location. And we can read those through nanopore device and spit out what the data is on the back end. So writing and reading is pretty direct. So I'm not struggling with the concept of this immutable storage medium. This seems like we've heard this over the past few years. Perfect storage medium, to your point, it doesn't decay.

Starting point is 00:14:06 It's not going to change. A zero today is going to be a zero. If you're reading the DNA, the format of DNA just doesn't change. What I'm having a hard time working my mind through is the concept of the fact that it's immutable means that I'm right once and I never change it. So how do you go from that to creating a usable, and I don't want to go all the way up to the file system, but a usable data storage or something like that. Yeah. So, so let's take random access for a second. So I create a file using that word in quotes, which is actually a bunch of DNA molecules in a solution. Test tube.

Starting point is 00:14:52 Okay. That's my file. It's a test tube. And you say, but that's not representative of the kinds of way I think about files. Well, yeah, because we're dealing with chemistry. We're not dealing with physics in a certain sense. But by virtue of having all that data, which every DNA molecule is identified in terms of what it represents in a bit stream, by virtue of doing that or encoding that data to represent something else for that matter, I can create a search molecule. And that search molecule can be

Starting point is 00:15:25 thrust into that file and it can map to and connect to what it's searching for. And so I can do that without regard to how big that file is in the same amount of time continuously. So in other words, if it takes me, I'm going to make the numbers up, but if it takes me half a second to find the search object among 10 million molecules, it's going to cost me half a second to find that search object in 10 billion molecules or 10 trillion molecules. So there is no scaling of time as I execute these kinds of functions that take advantage of the chemical representations of the data. Tremendous feature when you're looking at problems that are awash in data, but have just maybe one little needle in that haystack that you're searching for, right?

Starting point is 00:16:18 So you don't have to study everything. You don't have to look at every piece of data. You don't have to compare every piece of data to something else. You just throw that one molecule that represents your needle into that haystack and voila, you've got it. sort of the key portion of some value portion, which is encoded in this DNA. And you create this, I'll call it search key molecule, and you insert it into this test tube, I guess, and it will go out and find, you know, anywhere from one instance of this to billions of instances of this

Starting point is 00:17:00 without any additional work and stuff like that. And once you've got that key and its associated data, you can read this out somehow. So aren't they all existing in this solution at the same time? Yeah, they are. So what you do is you pass a solution through this nanopore device and it'll read these molecules and it'll read the molecule that you've been searching for and that'll be represented by the nanopore device in digital fashion which you can then convert into conventional ones and zeros and there's your answer. What type of compute is needed? What level of compute needed to do that processing?

Starting point is 00:17:47 So there are multiple levels of computing in this. So let me let me begin by creating a picture of the device first. And and I'll preface my comments by saying that what we're talking about is a confluence of hardware, software and chemistry in one device. Right now, that's orthogonal to the way we ordinarily think about computing or storage. is the confluence of hardware, software, and chemistry in one device, right? Now, that's orthogonal to the way we ordinarily think about computing or storage. We think about that as being the confluence of hardware and software. Chemistry is not involved, but here it's different. So imagine, if you will, kind of a conveyor belt that spins out this polypropylene sheet

Starting point is 00:18:24 that enters into a machine. And it goes under these printheads and the printheads under the control of a software program will formulate the contents of these building blocks of DNA that go into a particular drop, that go through a particular nozzle in the printhead, that goes to a particular spot on this webbing that's going through the machine. So the first thing is you have a stack of software that's orchestrating the behavior of the machine. That webbing then carries on and enters into kind of a wet lab process, also automated, where all the DNA molecules are incubated. And what that does is it causes the Lego pieces to connect to one another in the right way.

Starting point is 00:19:15 And then after that, it comes into another stage in the machine where it just gets put and deposited into this fluid. Essentially, the DNA gets scraped off of the polypropylene webbing. The polypropylene webbing is discarded, and the DNA is what's retained in the solution. All right. So there's this sort of electrochemical mechanical machine that exists. That does all this writing process, I'll call it, right? This is the writing process. And then that end step where you have everything in what we call a pool, or it's just a collection of DNA in this liquid, is then taken to a wet lab station at this stage. And the volume is reduced and the DNA is isolated to the extent that it's not prepared for entry into a sequencer, if you will, either nanopore or luminaire or something like that. As I said, we use Oxford nanopore. And then those molecules are threaded through the nanopores in the device. The software associated with that device, not from catalog, but from Oxford, will read that DNA and represent it in a file format digitally,

Starting point is 00:20:30 which we will then reinterpret in terms of all the transformations we've done to the data going through the machine and spit it back out in conventional data formats. And this is, if you're working with Lumina today, if you, any of these types of devices, this is the part that isn't new. We, we read DNA. We, we're good. We're pretty good at it. Relatively slow process. Uh, uh, if you're, if you're trying to use this for traditional storage methods, but the amount of data that it stores is incredible. So that part I get, it was the writing part that was not as clear to me, but now I'm seeing the challenge and the innovation is in the writing and indexing and making this a usable format for compute. Something like this technology, let's say you're a super secret intelligence organization

Starting point is 00:21:26 and you've been recording all the email traffic throughout the internet and you were able to encode all this information into DNA and put it into this end solution. And let's say you wanted to search for the word, I don't know, terrorist or something like that. You'd create a terrorist search molecule and insert it into this solution. And somehow you'd be able to come out of that with every email in the world that's in that solution that uses the word terrorists. Conceptually, that's correct.

Starting point is 00:22:07 So it's random access. And again, the matching of a molecule or set of molecules that constitute a search target can be injected into the solution and it can come out with a presence or absence of that particular search item in the solution. Okay. of that particular search item in the solution. Okay, so now you have to talk about, you mentioned the fact, you mentioned a couple of facts. You mentioned volume. So the volumetric efficiency of something like this is millions of times, I think that's the word you said,

Starting point is 00:22:39 more efficient or more dense, I guess, than common magnetic or electronic storage? Yeah, from a volumetric perspective. So last year, we wrote the English language contents of Wikipedia in DNA. So demonstration of technology, if you will. And the entire contents are in a tube that's about the size of one of the finger joints in your hand in terms of length and about the diameter of a pencil. So it's a very, very small tube containing all the English language content of Wikipedia encoded in DNA. Is that one image of one replica or is it multiple replicas of the Wikipedia at that point?

Starting point is 00:23:30 That's just one. It's solution. And we can make that smaller by desiccating it and rendering the DNA in solid form, put it in the pellet or something like that. So, yeah, it's about six order magnitude difference, a million times more dense. Than like LTO8 or something like that? Yeah. Okay. Okay. Okay.

Starting point is 00:23:52 So if you really wanted to, let's say, have a random access device that was read writable and such, you would use some sort of a search molecule to find the data that you want to update and some sort of I'll call a DNA kinds of computation to swap out the old data and insert the new. Is that how this would work? So so when DNA molecules are read through a nanopore device, they're actually destroyed. Right. OK. So one of the things we do is we keep multiple copies. And one can simply boost the signal, if you will, of the

Starting point is 00:24:33 new piece of data in the existing copy. So you don't have to rewrite everything. You can just essentially write the missing molecules and add it to what you currently have. In fact, that's the way you would do addition with DNA as well. You would take two different test tubes, if you will, representing different values, and you could put them together. And with chemistry, you could create something completely new, which would represent the addition of tube A with tube B. So it's a disquieting kind of representation for someone who may have only experienced conventional digital kinds of media for storage and computation, because it seems like we're, oh, I don't know, bypassing a lot of the conventional ideas here. But that's the whole point. It's really quite transformational in terms of feature and function that we can exploit here.

Starting point is 00:25:30 Yeah. Well, you know, at some point you have to talk about the speed of access and the speed of writing and that sort of stuff. But it's, you know, I've been associated with other storage technologies that have come and gone. And the challenge has always been, you know, if they're only a factor or maybe two of where, you know, current technology is, by the time those things get into production ready mode, you know, the electronic media and magnetic media have already caught up. Six orders of magnitude, they may never catch up. They may never catch up, especially with some of the limitations that come out of the world of physics. You see that with Mortimer's law today as an example. No such thing as a zero nanometer technology, right?

Starting point is 00:26:17 But I think here we know we have a ways to go. However, we also know the ways to get there. So we have, for example, we write in a megabyte per second range now. Right. And reading is capacitated by the speeds of sequencing devices. So as a startup company, and you noted this earlier in your comment, we focused on where we could make innovation first. And it wasn't to try to displace the sequencing community. It was to try to really find something that was very innovative and very effective on the synthetic biology side of this and writing DNA and encoding data into DNA. But we know how to dramatically increase the speed of writing DNA. And one of the things that's helped us, of course, is we built a machine to do this. I think a lot of what you see going on in academic institutions and other places is still fundamentally wet lab chemistry.

Starting point is 00:27:28 And so everything is sort of abstract and theoretical. We're sitting here actually measuring the speed of devices operating to produce this in an automated fashion, which is the only way you get to reasonable competitiveness. That's point one. Point two, we do understand sort of the inexorable pursuit of faster, better, cheaper. I spent decades in IBM working on a variety of projects, so I'm intimately involved with the knowledge of the concept. So we're not targeting our endgame here to be where storage is today. We're looking 10 years out into the future, although we expect to be commercial substantially in advance of that. The reason you look 10 years out into the future is you want to test yourself and explore the possibility of other sort of disruptive ideas that might come along.

Starting point is 00:28:17 So the hidden message in what I'm saying is we have a play that we're running today, but we are far from being dogmatic that this is the only way or even the best way that's still to be discovered and implemented. But given where we are today, we know how to improve the cost of where we currently are by five orders of magnitude. And we know how to increase the speed of where we are by probably four to five orders of magnitude. So these are targets for us that we can get to in reasonably short order. And I think that when you think about markets, so let's step away from technology for a second, because you alluded to this in the predicate of your question when you talked about how do you, how do you displace existing technology that's moving ahead at the same time? And why do,

Starting point is 00:29:08 why would customers jump to something when they're still getting efficient bang for the buck for what they've always used? And I think that's the motivation for why we're exploring this idea of merging the idea of compute with storage in a single kind of environment. Storage today has been principally conceptualized as a very passive device. You put data in storage and you pull it out when you want to compute on it. The problem is every time you do that, you incur a cost and you incur latency in terms of getting insight from the data you've stored.

Starting point is 00:29:41 What we'd like to do is see if we can just merge these ideas together, and it doesn't have to be universal, it doesn't have to be for every kind of data or every kind of algorithm or compute, but some substantive set of combinations of those things to be coalesced in a single environment to take storage devices away from being passive and make them active. If I'm hearing you correctly, let's say that I'm a scientist working on a drug discovery. Today, when I'm sequencing data or if I'm getting DNA sequencing, I need to inject that into my magnetic storage or my flash storage, et cetera, to work on it. So in theory, what you guys are trying to enable is putting the compute right next to that DNA level data in its native format, having the

Starting point is 00:30:33 compute execute across that data set as close as it is. And I eliminate this problem of being able to process, well, transfer petabytes of data and process that data when I need it. So I can create distributed algorithms, et cetera, to access the data where it's at, which is a big, big problem in pharma, getting the sequencing data from Germany into Chicago. If that stuff can stay local to Germany and the computers right next to it, that's a powerful construct. Imagine the scale-out capabilities. If you could effectively create a compute molecule and a storage molecule that exists in the same test tube with the capabilities of DNA replication, that sort of thing, you could create a gazillion copies of this thing in, I don't know, minutes, right? Yeah.

Starting point is 00:31:28 And that's really the objective. So we're trying to get away from this model, this dichotomy of storage and hardware being separate things. And as you said, really combine them in a single environment so that, for example, I create a data archive in DNA and I throw compute molecules into it and I put it in the closet and I go out to lunch. And with very low energy, very low everything, it's computing for me. And, you know, maybe I let that go on for a month. Maybe I let it go on for a year. You know, there are simulations that people do today in supercomputers that last for more than a year.

Starting point is 00:32:11 So I think there's a lot of imaginative thinking that needs to be brought to bear here to understand the domains of possibility where this can be applied. And the interesting thing from Catalog's perspective is we're actually in the midst of possibility where this can be applied. And the interesting thing from Catalog's perspective is we're actually in the midst of executing proofs of concept with a number of companies. Notice I didn't say universities, I didn't say laboratories, but companies who are trying to see the efficacy of this approach for the kinds of problems that they see forthcoming in the data world. And I will also say the following thing. These are non-trivial companies. These are Fortune 100 companies

Starting point is 00:32:51 who are really, really quite astute in terms of the evolution of storage technologies, etc. But they're intrigued by this because they see limits to the conventional approaches that this might overcome. Now, they don't, and I would agree with what I'm about to say, obviously, otherwise I wouldn't say it, but nobody is looking at this as a complete displacement, for example, of the tape industry. But it can be a very, very useful augmentation to the compute storage environment in general. And to that extent, it's quite synchronous with the way people are thinking about the evolution of computing today, which is classic von Neumann architectures, the invocation of AI, the invocation of quantum, all these different things are being looked at

Starting point is 00:33:37 to work in concert, which, you know, in a very crude way is nothing more than saying, get the right tool for the problem at hand. Yeah, but, you know, if you're able to do computational actions in the solution, you know, Turing proved a long time ago that, you know, with a very limited set of functionality, you can pretty much create any universal program. So, I mean, once you pass, and it's not that difficult a barrier in my mind from a function set perspective, a limited minimum number of functions, you can do anything in this thing. Well, that's correct. I mean, we know we can create Boolean operators, and by virtue of doing that, we can essentially do anything. But you also have to juxtapose that sort of theoretical notion with behavior of clients in the marketplace. Our ambition is not to do research forever, but rather to get in the market

Starting point is 00:34:37 with commercially viable products. And to that end, we're not trying to solve all the problems up front. We're trying to be super pragmatic and pick out a handful of problems that we can tackle, focus and provide value in short order to demonstrate the utility of the path we're going down. And of course, once we do that, I think people understand the transformational nature of this and investment will occur and new ideas will emerge and a lot of these ideas will be pursued at speed. Yeah, I spent three years at AbbVie and I can tell you without a doubt, there's applicability today easily. These are real, real world challenges, especially when you get to natural uses of where DNA data is used today in research and how transforming it from its existing format into a different format creates a tremendous amount of inefficiency. And if we can fix that and get compute as close as possible to that original set of data. And it's possible because compute is going to become free at some point. And this is probably one of the first applications or at least companies that I've talked to looking at how do you leverage compute closer. Compute will never be free, Keith. Come on.

Starting point is 00:36:03 There's always a cost at some level. Compute will be be free, Keith. Come on. Well, relatively, compute will be cheaper, yes. So there's this concept that, you know, and A16Z talked about it. I'll send you the link to reference it. But at some point, obviously Moore's law is not that we're going to double or triple whatever and compute capability, but we're going to continue to half the cost of compute capability to the point where it's practically free. And that's the current roadmap of semiconductor technology today. You know, they're on this roadmap to have the cost, maybe cause it to be vertical, horizontal, you know, whatever. But they'll try to, you know, they will have the cost of computational capabilities over

Starting point is 00:36:52 the course of time. Yeah. And one of the things that I wish we had time to delve into is what happens when we cross the point where it's cheaper to throw CPUs at data than it is to move data. Right now, for some data sets, it's cheaper to move it closer to the, and this is, again, the problem that I ran into practically on Pharma, that it was cheaper to move the data to the compute than to move the supercomputer to the data. When is good enough enough, And where's that strike?

Starting point is 00:37:25 And that's a whole other conversation. And this sort of technology turns that on its head, right? Because the compute, if you can do the compute with, I'll call it DNA processes, and you can do the storage with DNA processes, which we've already proven, then there's no reason to move it out, right? To some extent. There might be some display technology requirements and stuff like that. The other question, of course, David, is how big is this device that reads and writes this thing? I mean, normal storage devices are on the order of a pack of cards these days and can hold terabytes of data.

Starting point is 00:38:04 Right. So that's the device. But what was used to create that device? What was used to put the data into that device? So our machine right now, and there are pictures on our website, by the way, we're in Boston, is think of an L-shaped object. It's about 14 feet long on one leg and about 12 feet long on the other, maybe about three feet wide. Because remember, it's incorporating hardware, software, and chemistry all in one device. We know how to shrink that to desktop. That's not the issue, right? What's the issue is what are the encoding schemes that need to be represented?

Starting point is 00:38:49 What are the computational problems that need to be pursued? Because those by themselves may dictate the nature of the way we evolve this machine over the course of time. So our proof of concepts are actually meant to be as informative as possible to what the next generation of the machine should look like. Is there a market for desktop things? Do we need to modify the encoding scheme to encapsulate a way in which we want to do computing

Starting point is 00:39:15 different than we would do it today? These are all things we're trying to learn in the existing proof of concepts we're running. So, I mean, the proof of concept would be, you know, a company comes to CatalogDNA and says, I've got this data problem. It's petabytes of data. I want to be able to store it, read it, and have it available for, you know, the cost

Starting point is 00:39:38 of near nothing and have it stay around forever. I mean, the video archives for these movie theaters and the movie companies that produce these videos. The film is rotting. The digital storage requirements change every decade. That forces them to move from one technology to another. If they could put all this stuff on DNA once, they could have a million copies and never have to do it again. Is that what you're saying? Absolutely what I'm saying.

Starting point is 00:40:11 Wow. This sort of stuff is mind-blowing. I mean, I've even talked to reporters about DNA technology, DNA storage and stuff like that. And I said, you know, it's all great and it's all wonderful. And it's got the miniaturization, it's got, it's got, you know, length of longevity and that sort of thing. But the cost and time to read and write it were a significant problem. You mentioned megabytes per second. Is that what your device can do today? You can write megabytes a second into DNA? Yeah. So, so this will transform over the course of time, we will make adjustments to the mechanical attributes of the machine.

Starting point is 00:40:51 We will make adjustments to the chemistry. We'll make adjustments to the algorithms that are governing this. And all these levers and dials that I'm alluding to can be turned in ways to have an impact on speed, performance, quality, all these kinds of things. And I wouldn't say these are easy problems, but they're tractable. And we understand the nature of the way they need to be attacked to get to the kinds of improvements we're alluding to. So our expectation is that within a relatively short order, let's say 18 months, something like that, we could be at a right cost comparable to tape, right cost comparable to tape. And we have to do innovation on the read side. So until that innovation comes to fruition and we're thinking

Starting point is 00:41:46 about what that innovation should be, the technology we have is going to appeal mostly to those institutions that have write once, read rarely kinds of requirements. So the movie archive or something like that as sort of a deep archive in case something goes wrong with the conventional media that you have. Think of it in a sense as sort of an analog to the seed repository that sits up in the Arctic Circle in Norway, where it's stored there for the sake of preservation of the history of all these different plant examples. But it's not meant to be opened every day for somebody to look at it and use it and inspect it. We have to get the read side of this in a rough equivalency to the write side. And then you start to see an expansion of the domains of opportunity. And then at the same time, that's just a conceptualization of using DNA as an

Starting point is 00:42:46 archive-like device. But then the kicker here is, let's figure out the compute side of this and begin to put in place the mechanisms by which we can compute on the data that we're archiving. And that gets us to this point of making storage active and not passive. And then you see a tremendous growth and opportunity in the marketplace. Yeah, it's exponential. I've written blog posts on technology that was talked about, oh gosh, over the course of the last couple of years about how using cellular mechanisms to, Boolean operations and things of that nature. So, I mean, it exists today, at least in research form, to do these sorts of things. You also mentioned, I wanted to go back to something you mentioned earlier on. You're

Starting point is 00:43:37 not using normal DNA, per se, the AGTD. I'm not even sure if those are the right acronyms. But using synthetic DNA. Can you explain what that is? Yeah, synthetic just means we're building the molecule using the architecture and principles of DNA. So it still uses the same bases, it still uses the same sugars and so on, and it invokes the same kind of machinery for replications, etc. But it's not derived from any sort of living animal. It's all built in a laboratory. And you can actually buy snippets of DNA in bulk quantities from commercial vendors today who supply the research community worldwide for people who are looking to do different kinds of things. None, as far as I know, are very

Starting point is 00:44:27 few looking to encode data in it, but for other purposes. So we've simply leveraged that industry, just like we leveraged the sequencing industry, to give us the building blocks that we've used to support the innovation we're doing on the writing side. The DNA molecules we create are not biologically active. So there's no risk of taking something that we produce and sticking it in the cell and, I don't know, getting Godzilla or something like that. Wikipedia or something like that. Yeah, right. A walking, living encyclopedia. But you put gaps in the DNA, you put different stops in the DNA, you do a lot of different things so that it has no biological activity whatsoever. And in fact, if you try to render it to have biological activity, the cost of doing so would be such that you might as well

Starting point is 00:45:19 just start from scratch. So we've taken those kinds of precautions because we know there are people always trying to do crazy things. But we're pretty secure from that perspective. All right. Well, this has been great. Keith, any last questions for David before we close? No, my mind is pretty much fried. I'm guessing that most listeners are as well. I would say so. My mind's certainly fried. David, anything you'd like to say to our listening audience before we close? Yeah, check out our website. We're Catalog DNA in Boston. And you'll see pictures and videos of our machine. And you'll see white papers.

Starting point is 00:45:55 And you'll see pictures of the team. All right. Well, this has been great. Thank you very much, David, for being on our show today. You're welcome. And that's it for now. Bye, Keith. Bye, Ray. And bye, David. Goodbye. our show today. You're welcome. And that's it for now. Bye, Keith.

Starting point is 00:46:05 Bye, Ray. And bye, David. Goodbye. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify as this will help get the word out.

Your Ad Here

Grey Beards on Systems - 108: GreyBeards talk DNA storage with David Turek, CTO, Catalog DNA

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.