Storage Developer Conference - #184: The DNA Data Storage Rosetta Stone Initiative
Episode Date: March 14, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode number 184.
My name is Joel Christner. I'm with Dell. I'm joined by Alessia Morelli, CTO of DNA Algo, and Mark Wilcox.
We're going to talk about the Rosetta Stone Initiative.
It's a project that we are working on with a very good group of people to help bootstrap DNA data storage archives. I do want to put a plug in for the first white paper out of the
DNA Data Storage Alliance on preserving our digital legacy. There's a lot of graphics and data that
we've taken from that paper. There's a lot of really good information contained within it.
I would highly encourage you to take a look through it. As far as the agenda for the session today,
we have four items to cover. First is we're going to go through the differences, some of the main
primary differences between DNA as a data storage media versus what we would consider to be
traditional media. We're going to jump into an overview of the Rosetta Stone Initiative and what we're doing, why we're doing
it, where we are, and where it's headed, and then a brief call to action on how to participate,
and then we'll close up with a summary. Sound good? All right. So, differences between DNA
and traditional media. Anybody here a storage veteran? Lots of time in the storage industry?
Any chemists in the room? All right, no chemists in the room. Great. So I like to
look at things through the lens of analogy when trying to dissect them. I've been in the storage industry for 25 years.
I'm a storage geek, not a chemist, just had a personal interest in DNA data storage and thought to myself, why not just jump into the deep end? I'm part of the research and
development office in the office of the CTO at Dell and was given the green light to go ahead
and participate. I wanted to get right into the deep end of the pool. And I like to live
in the world of analogy by trying to understand one thing through the lens of another. When I
think about DNA as a data storage media type, I naturally start thinking about other types of
storage media, whether that's NAND or if that's tape, and all of the things that have to happen in between that media type
and the system and the application that's trying to consume it.
So we have a simple block diagram of a typical SSD
where you have some kind of controller interface,
you have some pool of flash memory behind that,
and there's this series of abstractions that happen where the controller understands how data is posited onto the underlying memory in order to expose the entirety of the system itself toward the consumer of that resource so that it can be used.
Lots of different layers of abstraction have to happen. What we think of as LBA0 from a SCSI perspective
probably does not cleanly map to what we see on the,
I think this is a, should I shine it in my eye?
There we go, to the flash memory that you see on the right.
Some series of abstractions have to happen
in order to make that useful and usable.
Similarly, once you expose that device using some series of standards to the operating system,
the operating system is going to put its own set of abstractions on top of that in order to turn
this logical block device into something consumable like a file system. So you have, again, another
set of abstractions that have to be layered on top of the abstractions that are on top of the
physical media. Now, the nice thing about SSDs is they have a built-in controller, right? It's great.
They expose metadata to the operating system. We can open up disk management on this laptop and
take a look and see who manufactured the SSD and look at its geometry. We can look at all kinds of other fun pieces of data that
nerds like me would really enjoy looking at, right? Some types of media, like tape, don't have
that same benefit, right? They don't have a built-in controller. You need a drive for it.
You have some metadata that is external drive for it. You have some metadata
that is external to the media. You have some metadata in the broader system that is storing
all of these cartridges. What you do have is the barcode with the metadata, and you have your
proximity from the beginning or the end of the tape, and an understanding of the file system
that is laid out on that tape. So with DNA, you don't have those, which presents an interesting
challenge. We've got this, somebody said these little containers of goop. Who was that? Thank
you. I'm going to give you a nickel every time I use that because that is perfect, right? There's
these little things with goop and you have no idea what's in them. It's just some chemical and we have
to somehow put some kind of usable interface on top of that.
And that's really the core of what we're doing in the Rosetta Stone group.
It's actually not that.
It's just making it possible so that you can put a file system or some other type of interface on top of this goop substrate.
So a brief primer.
And I'm not trying to be redundant with what Dave already covered in his session,
but there are a few nuances that I do want to point out that are important as we get into the discussion about the Rosetta Stone project.
So the first is the fundamental unit of storage is an oligonucleotide, also called an oligo, which is a short strand of synthetic DNA or RNA.
You can think of the strands themselves as the backbone.
Those are, as Dave mentioned, typically a sugar phosphate.
The connections between them are the base compounds. That's the AC, T, G. So adenine,
cytosine, thymine, guanine. Those attach to the strand and to one another. So there's a mating
type of process that happen and they have natural affinity towards one another. So A and T, G and C, and there's four of them, right?
If you go into RNA, there's another one, which is the letter U, but, you know, four is a nice
binary, you know, two to the power of two type of thing. So you can see a very clear path to go from
base two into a fourbase type of system.
So as Dave mentioned earlier,
this is the end-to-end progression that you go through for writing and then reading data
from a DNA archive.
The first thing that you have to do
is encode your digital data into the bases.
There's a whole lot of opportunity for innovation in the codex
space to include things like error correction, detection, compression, etc. There's a lot of
work being done there to help mitigate some of the challenges that you run into when you have
a strand or series of strands that have a higher concentration of different types of bases than others.
There's a lot of work being done there.
Once the data is coded, the process of synthesis is what actually writes it, quote-unquote.
From there, you store it, and then when you need to retrieve it, you find the container that has the goop that you need.
You dissect that, you sequence it, and then you decode back from the bases to the binary bits.
And that's what we just talked through right there.
So the interesting thing about DNA, again, is it doesn't have this concept of addressable sectors.
It doesn't have relative position. If you think about just oligonucleotides floating in some substance in a sealed
container, you know, somebody could take that container and shake it up, right? So
the strand might not be in the same place, it might just be naturally moving
inside the container. So there's no concept of proximity. I can't ask a
container, can you give me sector zero? Because it has no concept of proximity. I can't ask a container, can you give me sector zero?
Because it has no concept of sector zero.
But what if you lit my gate in or whatever?
If you make big strands or it looks like it's one and a quarter strands,
could you make one that was gigabyte long
and say okay
potentially yes
but then you would have to find that long strand
and that long strand could have moved
right
so that's a really good question
so
without this concept
of a location
proximity to start or finish or addressableable sectors, we have a natural challenge.
I like to use the file system as the canonical example.
That's what we interact with the most, I think.
If we want to put a file system on this jump into an overview of the Rosetta Stone is the concept of a primer.
I'm going to give you a primer on a primer.
A primer is a short stretch of DNA.
It's targeting a specific sequence.
So remember, Dave mentioned that A naturally wants to bond with T and C with G.
These primers, I like to think of them as magnets, right? They naturally attach to strands that are
incident to your interest. So if you're looking for a strand containing a certain amount of data
and you understand what part of that strand looks like, you can use a primer to attach
to that strand. You then go through a process called polymerase chain reaction or PCR,
and that is used to create one or many copies of that strand once you've used a primer to attach
to that strand. So when you go to extract data from an archive, one way to do that, I shouldn't say from an archive, but from DNA,
one way to do that is through the use of primers. So with these primers, you can attach two strands
that have the complement of that primer, and it might be a short section, as you can see here.
So you have what is called a front primer and a back primer, and it naturally matches to these bases here.
And once you have that, you capture the entirety there, and you now have access to everything that lies in between.
So that's basically the key.
Now you can see the masses.
Yes.
Yep.
All right. So we'll jump into an overview of the Rosetta Stone project.
So the big problem is DNA does not share the properties found in other types of data storage media.
There's no built in controller. There's no concept
of a logical address. We have no understanding of the proximity of a certain strand from the
beginning or the end of the archive. So it's unstructured. It is literally no different than
magnetic tape, except the regions of the tape could be floating around inside this container.
There are many different mechanisms for encoding data into DNA,
and we want to give vendors and academia and other industry constituents the freedom to innovate in how they write to DNA.
But what we do want to make sure that we do is look at this through the lens of a potential
hundred-year lifespan. How do we craft a common path that everyone can follow to be able to
understand the construction of the archive to then be able to consume the balance of it.
So we don't want to preclude any vendor from potentially using some novelty in their codec
to provide some value in terms of how they write to the DNA.
But if we come up with a common format that provides a descriptor for that archive,
that descriptor can then contain the information about the codec that was used,
such that somebody wanting to consume the archive could then use that information,
retrieve the codec, and then consume the archive itself.
So with that, I'm going to turn it over to Alessia.
And Alessia, if you'd like to take us through the Rosetta Stone project.
Okay.
So generally this translation is managed by a part of the operating system,
which is the file system in this case.
As Joel pointed out here, we do not have a controller with the DNA media.
And there is also no addressing by a real linear address or by location. So, and there is also no file system. So how can we communicate with the boot record of the DNA archive?
How do we boot, for example, an SSD? Here it is a picture of a standard SSD. We have a controller. We
have all the channels where all the NAND are attached. And we have also the eSquare Prom.
So first thing the controller does is read information from the eSquare Prom in order
to understand basically the hardware configuration. What is stored in the eSquare form is the type of NAND
that are on the channel, how to address them,
for example, which are the timings, the vendor ID,
and also the type of ECC that he needs to use
in order to access the data stored in the NAND.
Of course, data in the eSquare1 must be reliable. So, generally speaking,
it is protected by an error correction code. After that, so the controller has all the
information in order to be able to access the NAND. He knows the address, he knows the
kind of NAND, and he knows the timings. So he is able to access really the NAND.
So what is stored in all the block zero of the NAND is basically the firmware.
So in order to boot an SSD, we have two reads basically.
One read from the eSquare Prom in order to understand how to access the hardware,
and then loading the firmware itself from the block zero.
Of course, the block zero must contain reliable information
because they are key.
They are all the metadata, all the firmware,
in order for SSD to be able to boot.
So he has the information about ECC,
so in order to decode information read from the
block zero, and the block zero needs to be a good block.
So that's the reason why typically NAND vendors guarantee the block zero to be accessible
and to be a good block.
What if we want to boot a machine from an SSD?
In this case, of course, the SSD boots itself, and then the machine asks for sector zero
in this case.
Sector zero, the controller knows what it is, so he asks for sector zero, which is the
master boot record to the controller, the controller understands that what the machine needs
are all the metadata, and he knows where they are stored. It can
be also not in a zero location, let me say, but
it must be reliable information that the controller knows.
Because it is key in order for the machine to be able to boot.
But what happens here?
Here we do not have anything.
We just have a capsule with DNA stored inside.
So we do not have a controller.
And the metadata are mixed within the archive itself.
So we need a way to discriminate them in order to be able to boot the archive itself. So we need a way to discriminate them
in order to be able to boot the archive.
And also, for example, key information,
such vendor ID or the error correction code
used in the archive itself
is mixed within the archive itself.
So it's quite strange because we need to access the archive,
but we do not have the information in order to be able to access them in a reliable way.
So basically, we need a way to be able to access the archive from a first pass, let me say, and be able to read the key information in order to be able to boot the archive itself.
And basically, this is what the project Rosetta Stone tries to do.
So Rosetta Stone project is part of the DNA Storage Alliance,
as Dave mentioned before.
So it's a subgroup. And basically, in this picture, we have all the archive.
So there are the data
specific from the company which are here in green and then within the data itself
so within the archive itself we have the descriptor data which are
like the metadata, the information needed in order to boot the
archive. What are the goals of our project, of Rosetta Stone's project?
We need basically a common identifier
in order to be able universally to bootstrap a DNA archive.
We need to be able to identify
where the key information about the codec are stored within the archive itself.
But on the same time, we just want to be able to identify
where the key information of the codec are, but to enable information
because we want to let the flexibility to all the company to use the code that they want.
Yes. and confused approach, right? So we have to have some kind of encoding that makes it impossible for the user data
to be the same as the metadata.
Yeah.
And, of course, we want to provide also faster access to metadata
because they are the key data in order to be able to read the archive.
Of course, when we want to find out a universal way to find out this information, we use some working assumptions.
So, we assume that a general document about specification is accessible.
We assume that the archive boot are built with natural DNA,
so the standard base is ACGT,
but the archive itself may also contain different bases, non-natural. Remember that we are talking about
a huge retention, so this archive
may be accessed in 100 years, so we need something like
forecasts, what can happen in 100 years when someone wants to read the data.
We need a way, standard,
to let the reader of the archive able to identify where the key information about the error correction code are stored.
We assume that in some way the user wants to read the archive as some form of connectivity, internet connectivity. And we also assume that the
primary use of a DNA is a right-to-ance archive. So basically not data that are read too many archive, basically. I'll let the podium to Mark.
Don't worry, I don't have too many
more slides.
So the
project is like a
working group. There's various
different perspectives and
technologies between the digital side
and the molecular side.
So the working group is really to get together and provide those different perspectives.
It's been quite productive already today.
So there's sort of like we're identifying all these different, I would say, like parameters
where in order to build a model and understand,
hey, what's the, you know, for example,
just the example there of, hey,
mixing of the metadata versus the user data,
what are the statistical error rates and things,
we need to be able to actually not just have a general notion of them,
but to be able to say, hey, this is the exact length,
this is how it corresponds to a certain error rate, and so forth.
So we're sort of stacking up these decisions to make,
and I expect that this list is going to get a bit longer.
But at the moment, we're looking at we need to agree upon
what's that minimum length of an oligonucleotide,
and we're talking specifically for this sector zero,
like this first sector,
in order to read the archive, where once you get to codec, then that opens up all kinds
of different opportunities.
And then closely related to that is that error recovery metric and the mechanism for that.
Another active discussion is how many sectors we may want to standardize, so there
may be other, you know, sector one or sector two kind of areas that could get standardized
maybe in an open source codec.
And the progress to date essentially is that we've already started seeing several proposals.
So there's proposals being drafted and discussed. They're covering the Sector Zero
implementation, how those identifiers are constructed,
and then even looking at
into what kind of contents
are we expecting that different vendors may want to be including
in that payload and the various meetings of them. So there's a lot of active discussion at
the moment and it's still quite early in the working group but we're kind of
working towards hey what exactly are the you know error rates and
recoverability processes that we need to be able to deal with and we're kind of
looking at well what's the roadmap for this whole initiative,
really? So we expect that, you know, the proposals that are there to date are only just the start,
and that we'll get more, we'll see new versions of those existing proposals, and we'll see,
you know, more vendors joining the working group. But there's essentially, you know, a key
deliverable that we're kind of working towards,
which is the documenting in a standard document that we will sign off on and agree upon,
and there's some sort of policy and procedural documentation around how that standard may be expanded
or adjusted and updated and tested over time,
but a key discussion in the group is basically having a registry of codecs.
Cool.
And so if we step back, you know, what's this Rosetta Stone?
If we get a sector zero and we can manage to boot an archive,
what does that enable?
Well, it's really like the main blocker.
It's the main element that we need to actually
go off and start creating drives where you could have some integration and have some
protocol because you can decouple the drive from all of the software and drivers and whatever
we don't even know at the moment, like how these things will look.
So put controllers, the drivers, the ecosystem,
basically agreeing on a decoding standard enables vendors to start working on consumers
of that first sector, the zero sector.
Interesting comment earlier was,
how do you understand or reason about
the state of the system?
So I expect that that will quickly come up as the next sort of aspect,
which is how do you check a smart status for some strands that are floating around?
And so I would expect that once we get the identifier,
the way that we encode the first sector
and we can start registering metadata,
then we can go, hey, how can we expand the error model,
for example, in order to reason out things like that.
An extra sort of aspect to this
is that DNA, because the way that we're looking
at how to interpret the drive,
it's based on the identity, because the way that we're looking at the, how to interpret the drive, it's based on the sort of the allocation of identifiers.
It's sort of like IPv6 addresses, if you think about that, less so the IPv4, because
the IPv4 address is so short.
But there may be, you know, the alliance may function similar to ICANN in the way that it would help manage the risk of conflicts
and the distance between the vendors' codec IDs.
The other thing is that this technology is just so early, like we don't even know what
a drive would look like or how it would function yet.
There's various different experiments and papers and everything coming through, but
the technology is going to evolve quite a lot,
especially the next five years.
And so I think these core concepts of the codex
and the address space and dealing with the medium
from a specification perspective,
that will fall into the roadmap,
which is the other working group as part of the
alliance. Do you have a question? So the PCR, the simple way to think about it is you do like a test to go,
hey, what format is this driving? And so there's a false positive issue of,
hey, I make it false positive to say that this is actually in a different format
than I would expect.
So this is just the sector zero, yeah.
Yeah, yeah.
Yeah, but because it's part of that same archive, it creates this.
So in order for us to reason about it, we kind of need to go,
what's the appropriate distance and things?
Because it's not a singular sector in a key value store.
It has a relationship with other strengths.
So there's lots of working assumptions on the current technology.
So we'll go through and build an error model and so forth,
but that technology might get better and therefore expand the address space, for example, over
time.
I think we don't have an idea of what's the likely number of codecs and how quickly that
may evolve.
But I also think that I've worked on the computational storage briefly a few years ago, and I think that the DNA medium is quite fascinating.
So there's lots of aspects,
especially that come out of standardizing codecs,
and then having an archive that you know
you're trying to send queries to
and determine the state of.
I think there may be a lot of nice alignment
with the computational storage standard,
which is in a V1 now, which is great.
Yeah, so that last point,
just especially that there are going to be
a lot of novel mechanisms as a result of DNA.
It's not just for storage,
but if you look at, for example,
like the search indexing,
that has lots of capabilities
that you would traditionally be abstracted on top of SCSI,
not looking at seeing those queries down.
And so how we incorporate that into the storage model,
I think is going to be quite fascinating.
All right.
Back to Joel.
Is Joel?
Thank you.
Thank you, sir.
Ready for another dad joke?
All right, so how to participate.
Obviously, standards only succeed when you have a lot of information coming from a lot of different places,
from a lot of different constituents.
So we would love to have everyone join in and help.
Lots of different ways to get involved.
First is go to the DNA Storage Alliance website,
sign up for the newsletter.
There's a lot of really good material out there
on DNA data storage
and the underlying fundamental principles
that enable the technology.
Follow us on Twitter, LinkedIn.
Our goal really is to, you know, grow the
working group to the right size. We've got a really good base of representation right now,
but obviously we could use more because with more, we get more opinions and more input and more
expertise. So we want to make sure we have adequate coverage across, you know, public, private, academia, you name it.
So to summarize, so there's a lot of promise
with DNA data storage.
As Dave mentioned, it's extremely dense.
It doesn't have the same power constraints
or cooling constraints that traditional media has.
Lasts a really long time out in the wild. Dave's woolly
mammoth example. So now I have to give two people a nickel every time I use the analogy. So the
goop and the woolly mammoth, those are two really good ones. And you couple that with what's
happening in the world. Dave hit the nail on the head with the infographic that he showed earlier.
People are hanging on to data not because they need to. Of course they need to. There's a subset of the data that they need to
hold on to, but there's also just the unknown. And I think a lot of us get paralyzed by fear.
And if you are a storage admin, whether it's at a public or private company or a hyperscale or
whatever the case may be, there might be gold that's
unrealized in the data that you're about to toss. And so what are they doing? They're hoarding,
right? Pretty soon, 10 years from now, there's going to be a popular show on Netflix called
Data Hoarders. And we're going to have to go live through the adventure of a random IT storage
admin who's going through these exabytes of data. And he's like, do I keep it? Do I put it on DNA?
What do I do? That's happening. That's happening now because nobodyabytes of data. And he's like, do I keep it? Do I put it on DNA? What do I do?
That's happening.
That's happening now because nobody knows what data is going to be able to be extracted from that data in 5, 10, 20 years' time.
So they're hanging on to it.
There's a multi-stage process to go from data through DNA and back to data.
So the process involving encoding, synthesizing,
physical storage, retrieval, sequencing, and then decoding. There's a really good overview of that
in the white paper that's linked at the beginning of the presentation. Would really encourage you to
take a look through it. It was very informative to me. Most of the people that I've shared it with have agreed that it was extremely informative to them too. DNA as a storage media does not share
many of the properties that we enjoy with traditional media. There's no built-in controller.
There's no built-in metadata. There's no proximity or addressability regarding a position of a
certain piece of information versus the head of the
archive or the tail of the archive, which presents a unique problem because we need to be able to
figure out how to read an archive without knowing how to read an archive. So you can see some of the
challenges that we're going through as Alessia and Mark talk to those. So that's really the goal
of the Rosetta Stone. We want to be that box that's
in between a person trying to read an archive and on the other side of that box their actual ability
to be able to do something useful and meaningful with that archive so with that thank you for your
time i'd love to you know answer any questions that you have i've got two smart people here with me.
And, you know, if I get in over my head.
Yes, sir.
Yeah, that's a great question. I think that the embodiment that these archives will take on is yet to be determined. Are you familiar with Simon Wardley's and it essentially takes any technology stack and breaks it into its constituent
components and then outlies outlays those components into one of four areas either
genesis custom product or commodity and essentially what this allows you to do is understand
where each of those components are in terms of the general availability and consistency of that particular part of the stack.
And in the case of DNA, the vast majority of it is over in genocistic custom. So we can't say
with any degree of certainty that an archive is going to be this full of the goop, or if it's
going to be a thousand what look like little nine millimeter casings
that are in a tray and that tray gets slotted into a slot in a catalog that has a robotic
mechanism like tape.
That's my question.
Is there an external controller like a tape library that we have today, or is the metadata cross-referencing everybody
between the images that we have now?
Yeah, I think it's an excellent question
because I think that the state of the art
is going to be continually evolving.
I think that there will likely,
as the ecosystems mature for sequencing and synthesizing, we're likely to start seeing the technology get faster and smaller.
And there was a point made earlier about amortizing the cost of those fixed elements
over the amount of data that you're storing and the cost per gigabyte of the media, I think as those two domains improve from a cost performance perspective and also a form factor perspective,
it's likely that we'll shift from a traditional central synthesizer sequencer type of architecture with a bunch of trays of containers into something where we might, in the future, and I don't
have any inside information on, this is just me brainstorming, but could potentially see
that actually embedded in something that is pluggable.
That might be in the future of DNA.
That would be like sharding a database in red, where you're just saying, okay, I have Yeah. that parallelize down to hundreds of thousands of potential places.
I think, well, with this key, it's got to be in this one.
Yeah, that's a great point.
So, you know, a couple things on that.
The first is I think that whatever container, and by the way,
did you want to jump in on any of the previous?
You guys are good?
Okay.
Sorry.
You know, tend to just like steamroll.
I think there's two elements to that.
First is whatever the container form factor is, we're likely to have a sector zero in every one of those containers.
I think where there's going to be some incredible innovation, I think Mark touched on it a little bit ago,
is what levels of intelligence you embed in the abstractions above the top of those.
I think the, if I think about a file system today, this is, in my opinion, one of the more powerful analogies that have come to mind.
You have linked lists of inodes, right? But if you think about the construction of an oligonucleotide
having a sequence of bases, if you understand that sequence of bases, rather than traversing a list,
you might be able to do something like build a graph. And you might be able to do that not only
within the container, but across the top of those containers and that
logical representation that sits on top of those. So I think it's going to be really exciting to see
it mature and evolve. So the assumption today is that
sector 0 will be written in when the archive is written.
So one of the
underlay assumptions there being that it's a write-once type of archive.
So when you have some number of gigabytes or terabytes that you want to write and you are synthesizing and then storing into the container,
you are also synthesizing Sector 0 into that same container with it.
Are you including regular but updatable?
Are you including no append?
Are you including no append?
So there's the assumptions that the working group is making,
and then there's, you know,
there's space that we want to leave
for innovation by the vendors
that actually produce the systems
that would adhere to the spec.
The operating assumption is that
there does need to be some extensibility,
whether that is write, replace, or update.
So there's consideration for that.
The ability to, you know, non-destructively read is quite
challenging. Dave alluded to some of the mechanisms for reading the sequence by synthesis being one
of those. There's a lot of things that have yet to be completely thought through.
Do you want to add anything to that?
Oh, yeah, they are.
That's a faux pas on my part.
I got them uploaded last night.
Sorry about that.
Yeah, there were a couple slides that we had some discussions on internally
that it was going to be this version or a slightly different version.
Sorry about that.
100%.
Yeah.
And if they're not,
I can give you a copy here.
So thanks everyone.
Appreciate it.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to
developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with
your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.