Storage Developer Conference - #82: Eusocial Storage Devices
Episode Date: December 10, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 82. Let's go ahead and get started. I
got a bunch of slides here. I'm going to try to move through them, but I don't know that we're going to get through them all.
This is the first time I've done this sort of presentation with respect to the UC Santa Cruz.
This is a new sort of thing that's happened over the summer, is that U Social Storage is now an incubator inside the University of California at Santa Cruz.
And specifically, it's inside an organization called CROSS,
which is the Center for Research on Open Source Software.
This is an organization that was essentially founded by Sage Weil.
Sage is the author of Ceph.
He did his PhD thesis at Santa Cruz on Ceph
and consistent hashing and all these great things
that we see in scale-out.
And he took Ceph as a graduate research project.
He turned it into an open-source project,
eventually incubated it in a startup,
and then eventually got it sold to Red Hat.
And he wanted to see that success replicated again and again through the university system.
So he's actually the founder of Cross, and the idea is to replicate that.
So we have research that's going on, education.
We have projects that are part of the research work.
And eventually those projects turn into incubations that hopefully create successful open source projects.
From there, it's up to the students and the people involved to take it beyond that. But it currently is funded by industry.
We have an industry advisory board that essentially pays annual dues and essentially funds all the research and the incubation for open source.
And there is a number of projects inside this organization today.
It definitely has an open source focus.
Everything we do is open source.
So all the things that we're talking about today in eSocial Storage will all be open source based.
And currently we have, I forget the number, we have Toshiba, Micron, Samsung, Seagate, Western Digital, Huawei are all members that are funding this organization right now.
And I didn't introduce myself.
I am Philip Kufelt.
I've been in the storage industry for about 25 years, working, starting at Veritas and working through a number of storage companies,
some my own and some big companies.
My last two gigs were at Toshiba and Huawei.
At Huawei, I was the director of storage standards, and at Toshiba, I created the KV Drive,
which is kind of an example of what we're going to be talking about today,
which is an autonomous storage device.
But as I was working with both of these companies, I had an opportunity to work with Cross as well,
and I saw that Cross was this neat intersection between industry and research.
And my feeling is that moving forward with offloaded intelligent storage
devices is something that's going to take time. The industry is not going to tomorrow say,
yes, this is what we're going to do. We're going to have to stage it in a strategic fashion over
a number of years that is going to be combined with research to prove out the proof points of
why it's a good idea. So we need the research.
We need the industry to provide comments and commentary
to guide and mold it in the direction that it needs to go in.
So I thought Cross was a great place to place it.
So this summer, I took eSocial Storage and put it in Cross
and actually am now a member of the staff at UC Santa Cruz.
So use social storage.
We at CROSS, we had our first presentation at FAST 18.
We published an article on Usenix login.
If you want to go there, everything I'm talking about today is in that article.
I'm kind of redoing that article as part of this presentation.
And then if you want to see information about CROSS,
and also we have a symposium that's coming up in the beginning of October that is a very interesting showing of all the work that's going on,
not only in CROSS, but also bringing in industry and other research organizations
to talk about interesting storage topics.
There will be a eSocial storage topic there,
and we're going to focus on the strategic goal of uSocial storage,
which is in-storage compute,
and we're going to talk about the tradeoffs between the different types of models
of storage compute, in-storage compute.
So I encourage you to come out and take a look at it.
Okay, so half of this presentation is going to be some of the motivations
of why we think offloading is the right direction.
The industry right now is going in two different directions.
We have a lot of people looking at sort of the moving intelligence off of the storage devices
and into the host and allowing the host to do closer management of the underlying media itself.
So you see that with open channel and some of the things going on in the industry.
At the same time, we're also looking at composability and disaggregation.
And those are going to be moving the data farther away,
which means doing data management across the fabric is going to get more expensive.
So we're going to look at the offloading argument and talk about why we think it's a good direction to go in.
And then we're going to talk about the sort of framework, the strategic framework of what use social storage is.
So first of all, I've been using the term use social for those who don't know what that means uh you know it's essentially you social is a social strategy that's used by insects and
organisms where you have highly specific uh highly specialized groups of individuals that provide
uh individual tasks and they're focused and that's what they do and so you can think of this like
ants or termites those those kind of things.
So we thought it was a good metaphor for use social storage.
Can we devise a system where you have these autonomous, intelligent devices
that are doing something specific for the media that they are,
because every media is not created equal,
but combines together to make a unified whole that provides an amazing amount of services.
And you'll hear me start using the term casts.
This is a term that we've decided to adopt from the eusocial aspect in animals.
This is a grouping of like-minded organisms.
We'll talk more about that in a minute.
So why use social storage?
I'm going to talk in depth about three different trends,
public and private cloud.
I'm going to talk about server offload,
specifically in disaggregation.
And then talking about how the ways of the past
may not fit into the future.
All right.
So public cloud.
This is an area I don't hear a lot of people talking about,
but as far as a reason for going in a particular technological direction.
Right now we have a
lot of public cloud storage offerings that are out there. There's a handful of big guys out there
providing cloud, and we always talk about how we're migrating our data and resources into these
public clouds. And yet, if you look at some of the marketing statistics that are out there,
even as of late, there was some 451 research that shows
that the migration isn't necessarily all one direction. We are not always taking our stuff
and moving it into the public cloud. We're actually doing both. We're migrating back to
the private clouds as well. And there's various reasons for that, you know, from security to
control to cost. There's a bunch of reasons why private cloud still makes sense today.
It's still a management problem, but it still exists.
And I think the big data centers took notice of that.
And you can see that because, for example, a couple years ago or a year and a half ago,
you now can buy Azure Stack, which is a private cloud version of the Azure Stack, right?
So I can now deploy Azure Stack in my on-prem
or off-prem private area,
and then also be utilizing public cloud services as well,
and it provides for an easy migration.
So that's the strategy from the big cloud guys,
is that public or private, let's lock them into what we're doing.
It makes it easier to migrate back and forth.
Google partnered with Nutanix about 18 months ago.
There's even rumors now that maybe they're coming up with their own private cloud offering outside of Nutanix.
It just shows that the industry has taken note that there's not just one market.
It's not just public cloud and everybody's moving there.
There is private cloud as well.
And I think, you know, from my standpoint, one of the worst possible outcomes is that we end up five or ten years from now with five big cloud companies.
And we all do our business with those five cloud companies.
It's bad for the industry.
It's bad for the industry. It's bad for vendors. Vendors don't have the ability to get
R&D to move the market forward because they're all essentially servicing these five big customers
instead of a wide width of customer base. But what's missing in the private cloud world, I think, is still an easy-to-use, easy-to-deploy, scalable, scale-out system
that allows you to deploy this in the private cloud without a lot of expense.
Some of the reasons why we look to public cloud is the management, the complexity.
All of those things are some of the reasons why we move to public cloud.
So I think this is one of the things that's still missing today, is a good, easy to use.
There's lots of stuff out there.
Don't get me wrong.
But as far as easy to use and provides all the feature set that you get from a public cloud is another story completely.
Server offload. cloud is another story completely. Server
offload.
This was a
talk given by Fritz Kruger
and Alan
Samuels of Western Digital
about two years back, and I think
it's still relevant today. They were
talking about how
CPU bandwidth,
it's actually not CPU bandwidth bandwidth it's more the ram
bandwidth is becoming the the bottleneck of of tomorrow um we can easily let me back up the
bandwidth of nand for example is skyrocketing right now and that's what this chart is showing
they basically went historically and looked at uh storage bandwidth over time and then looked at network bandwidth over time and also looked at DDR bandwidth over time.
And the rate in which the network and the storage bandwidth is increasing right now is far outstripping the DDR bandwidth at this point.
It's not yet above it. That's why I can still put five, ten SSDs per proc to a system. But as
these things get continually faster and faster, the number of SSDs that I could attach to a per
processor basis is going to get smaller and smaller. And when I say that, I mean that you're not hiding bandwidth.
I can add 30 SSDs to a system, but I'm not going to be able to attain
all of the bandwidth that those SSDs provide because they're going to be behind that
bottleneck of the DRAM because I can't get the data in and out
quick enough.
So that means that
essentially
the DMAs right now
are doing two different things.
When we're talking about I.O.,
you're talking about two different kinds of I.O.
There's the I.O. that clients are really trying to do.
I'm trying to read and write my data.
But then there's usually
management of that data
that's going on underneath the covers, right? Behind the file system, behind the volume management,
behind the scale out for replication. All these things is data management that's going on, which
means I'm going to be bringing data in and out of the server, consuming that DMA bandwidth,
even though it's not for my primary client application point, for use by the application.
I'm going to get into a little deeper of that picture in just a second.
Furthermore, just to show you, this was a talk given by Jay Doe of Microsoft.
He's in the Microsoft Research Organization, and he basically provided a nice graphic showing
that server mismatch. The fact that I can layer up flash dies behind a flash controller attached
to PCIe that is then attached to the root complex that then has to talk to this CPU,
and I can easily stack up enough dies to give myself easily a 66x sort of ratio
between available bandwidth and the NAND dies
compared to what I can do at the CPU and in the memory.
And again, the real work is north and south.
It's moving data in and out.
You're going to hear me talk about north and south, east and west a lot today.
North and south is moving the data in and out for the client application.
But then there are data management,
and there's a north-south component to that
for doing translations, compaction, deduplication,
all of those kind of things,
as well as east and west traffic.
And that's where I'm going to be doing redundancy,
recovering, rebalancing, tiering, caching.
All of those things are moving data
from one device to the next.
So those are east and west type of transactions.
So let's look at an example of the north and south.
So here in this picture, I've got essentially a server that's running RocksDB.
I'm doing a key value store.
So that means my clients above the server want to get key value services,
and they want to be able to put and get values.
Unfortunately, for RocksDB to be implemented successfully,
it goes through a number of translation layers.
For example, it uses a file system to place its SST files.
That means I have to translate between a key value to what SST file does it live in.
So that's one of the jobs of the key value store.
But then when I get into the file system,
the file system has to translate between that file and an LBA.
And then the block layer passes it down to the LBA,
and then the LBA gets translated again
to a physical address by the FTL.
So there's lots of translations that are going on in this stack.
But in addition to that, so there's extra I.O. for metadata and things of that nature that are going on as part of that data path.
But in addition to that, because of the characteristics of a key value store and you keep it in a sorted order,
there's garbage collection, there's sorting that has to go on, and that causes files to be read in and read out back and forth in addition to the key value operations
that the client is providing.
So, for example, when I...
For example, a ROXDB is a level DB type of implementation,
which means that I have different levels
in which my data is stored in,
and that gives me a way of efficiently sorting data
as it comes into the system
and not re-sorting data that I don't need to.
And as data fills up a layer, then I have to move it up to another layer,
which causes garbage collection and other things of that nature.
That's when you start to see this reading in of data and writing out of data,
back and forth, not for the actual client,
but to do the data management of the key value store.
But it also does things like scrubbing.
Is my data correct?
Periodically, it's going out and looking at the data
to see if it checks some matches.
Reading in the data, writing that data back out.
So that's consuming this DMA bandwidth,
which we've already talked about is a bottleneck.
I'm consuming it for not work for the client.
I'm consuming it for work for the storage,
for storage and data management.
Scale out, situation gets a little worse
because you've got to do the same things in scale out,
but now I'm doing it not only north and south,
moving data between the device and the host,
but I'm also moving data between the host
and other hosts and other media.
So I not only move data in and out, north and south,
but now east and west.
I move data from clients to media
to other scale-out servers for redundancy.
I also do it for recovery
when I'm recovering erasure-coded data
and things of that nature.
When it's time to rebalance,
when you add a new system with a bunch of storage in it
into a scale-out environment,
you want to then redistribute the data
across to all the available disks
so that you're taking advantage of the capacity and access.
So you have rebalancing that goes on.
And then, typically, a lot of the scale-out systems
are providing a single quality of service.
In other words, I have a scale-out for my disks, I have a scale-out systems are providing a single quality of service. In other words, I have a
scale-out for my disks. I have a scale-out for my SSD. I have different scale-outs for different
purposes. And then I have to migrate data when it's time to migrate it for information lifecycle
management type of things, right? I put it on the SSD scale-out to start with because that's when
everybody's hitting it and it needs to be fast. But over time, I'm going to age it and put it
over somewhere else. Again, consuming that bandwidth of the DMA to get it into the host
and then back out to some other server and to some other scale out.
So scale out is a strong candidate for offloading,
as are the key value stores.
And here's just a picture of what i was talking about about all the different you know
internal copies and things like that that are going on between the different systems and across
the network and again if you start thinking about and we're going to talk more about this in the
next section we start talking about disaggregation these are network transactions as well every one
of these north and south is the network every one of these north and south is a network.
Every one of the east and west is a network.
Here's an example of what it might look like if you offloaded.
First of all, with offloading, you can get rid of the translation layers,
which I think is a key important piece.
You don't need file systems and blocks.
Since my client is talking key value,
it's talking about objects, then I can just shuttle it down to the device itself as an object,
and there doesn't need to be any translation on the host at all. And because the key value database is running next to the media itself, I can get better efficiencies between the key value
database and the underlying media technology. I can tailor better efficiencies between the key value database and the underlying media
technology. I can tailor them particularly for that particular media. All right. Okay, so I got
some history lessons here, and I don't think most of you guys don't really even care about that.
You know, but just suffice it to say
that disaggregation has been with us
since I started my computer science career
back in the 80s, right?
We were doing NFS way back when,
and the obvious reasons for that
were, you know, to have that consolidated in,
to get rid of some of the physical limitations
of a server plus disk
is to prevent me from having data copies everywhere.
There's limited pathways.
A server can only have so many disks
directly attached to it
before you're either running out of bandwidth
or you're running out of hardware.
And availability.
I can make the disks available,
but I can't make them available with regard to the server
unless I pair up another server and move things around.
But by disaggregating, I can focus everything into a single location,
and I focus my management, I focus my availability problems into one place,
and I have access to it from anywhere.
So if my clients die or fail or do whatever, it's all right.
I just put another box in.
I can still get at my storage.
So that was some of the motivations for why we went to NAS.
Scale-out were the same thing.
They were disaggregated as well. The idea was, again, I have thousands and thousands of client applications running in the cloud.
I don't want to tie them to a physical box. That would be a
mistake because that means that my storage either has to be ephemeral, meaning I only use it for
that session and then I throw it away. But if I have needs for persistent data, I have to separate
it away from the host so that my client application can move from cloud server to cloud server.
So that's why scale-out systems were created, were to essentially meet the demands
of those thousands and thousands of client applications
and give me an underlying technology
that allows me to continue to scale out capacity and access
over time by adding more and more nodes.
But I still end up with this fan-in, fan-out architecture
where I have servers talking through some sort of a head to the media.
So I fan in to the guy that's managing the media,
and then back out to the actual media itself.
Scale-out's no different.
So if I have a Ceph environment where I've got a server with 30 drives attached to it,
all of my countless thousands of clients are all fanning into that OSD,
and then he basically fans it out to the drives.
So that's a limitation,
and it's a bad limitation
because it forces you to make hardwired decisions
at implementation time.
You have to figure out what quality of service
you want to provide.
I'm going to provide HDD-level quality of service,
but for cost reasons, I'm going to trap some of that hard drive throughput
behind a server to reduce the cost,
meaning instead of having 10 drives or 15 drives
where I can definitely access all of that throughput through my 10 gig E-link,
I'm going to put 20, 30, 60 drives behind it with the idea that it's okay that I'm trapping some of
that throughput behind that NAS server, but my cost point is lower, right? So we make these
decisions, and no matter what happens in the evolution of our cloud, like those assumptions
don't hold true after I make my purchase decision, I can't change it. I'm stuck. My hardware is fixed
in the way that I've chosen to architect it together. So it's a hard limitation that
offloading and disaggregation go to fixing, meaning that I should have a finer-grained unit of allocation
that allows me to determine how I organize my system.
So here's how scale-out disaggregates.
And I just talked about some of the problems.
Basically, there's a single software component that manages the individual media,
and I stack up these pairs on a single host through a single or multiple Ethernet connection.
So each OSD is like its own little scale-out device, right,
that can either run on this system or wherever the storage is directly attached.
And it's the direct attach that means that I have to run on this server.
If the drive wasn't directly attached, I could run it anywhere.
Kinetic was another disaggregation mechanism.
This was a way that the industry was starting to move in this disaggregation process.
It was a key value API.
Used TCPIP as the fabric between the two,
had the basic get, puts, and delete.
So back to that picture of a key value store,
you could have implemented a ROXDB
inside of a kinetic device
and provided that key value disaggregation
and got rid of the file system and volumes
that were in there.
The problems were that it was a unique protocol.
It was brand new.
Nobody had seen it, nobody had heard of it,
and it required applications to be rewritten for it.
And I know this because I implemented one of these devices.
We implemented a key value device at Toshiba,
and the biggest headwind that we faced was around,
how do I use it?
That's a great thing, but now what software uses it?
And at that time, nobody did.
And when people rewrote their applications to use it,
because we had a handful of development programs in place,
they developed it to just modify the existing functionality
and leverage the pieces and parts of Kinetic
that were there to match what they were already doing.
So, for example, if you look at Seth, and we go back to this picture somewhere,
that picture, it's managing a local disk,
and so it just modified it to do puts and gets instead of reads and writes.
But it has east and west traffic there, too, disk, and so it just modified it to do puts and gets instead of reads and writes, right?
But it has east and west traffic there, too, because it's talking from one OSD to the next
to do replication and other things.
Kinetic had peer-to-peer operations inside of it, where I could have let the drive deal
with the copies and the replication and the data movement, but nobody took advantage of
those.
And the real power of Kinetic was that,
was that it provided a rich tool set
that if you rethought and reimagined your application,
you could get more out of it
by utilizing and offloading not only north and south,
but east and west as well.
But unfortunately, there was no multi-vendor
because Toshiba dropped the product
and the software failed to create any colonies out there.
So although it's still being worked on,
it's less of an issue today.
The big one today is NVMe disaggregation.
We're seeing this.
As part of my last job at Huawei,
I worked with NVMe and the Fabrics teams,
and we're now separating, disaggregating across different types of fabrics.
We have InfiniBand and Fiber Channel and RDMA, iWarp,
all these different strategies for disaggregating.
But in addition to that, what's going to come out by the end of the year,
at least the first drafts are almost done,
looking for ratification hopefully in the next six months,
is TCP, and that's being driven by Facebook right now.
It's still block mechanism,
which means that it's not really conducive
for scale-out type of environments.
You still need somebody to manage.
Whenever you have a block,
that means you have an address
that you have to manage, right?
I have to know where my data is
and I have to manage that device.
And it gets very difficult to scale out
at the block level.
The big advantage to scale out
is that we use key values
and keys can be consistently hashed.
I can use an algorithm to determine
what system it's on.
I don't have to go talk to somebody to figure out,
oh, it's on this system,
and this is the translation to that.
I can actually just say, this key,
I know exactly where it lives, and go talk to him.
And then that person is responsible
for returning back the value.
So it's in the right direction because it's disaggregating,
and TCP could be a game changer
because it could provide for a day
when we see generic TCP ports on our devices,
which is a fundamental piece for offloaded autonomous devices
is to have a network port.
One of the things that you can do
if you had these disaggregated devices,
I have kinetic written down there,
but you could substitute NVMe right now with that.
This is an older slide. Is that once you do separate them out, then the OSDs, those little pieces of software that manage the disk, they can run anywhere. So I can now hyper-converge this
world by just scheduling my OSDs inside my application cloud. And they can move whenever
there's a failure of an application server, they can
move with it to another place, another location.
So they can be scheduled
and move around without causing
failure problems.
Alright.
So what is this suggesting?
We moved from
aggregated to SAN
to scale out
if we truly disaggregate
and we truly put intelligence in the devices
we can get to
basically a crossbar
instead of a fan in fan out
if I have the ability to
essentially take devices
and have them participate as
intelligent devices that are doing all the
data management
taking client requests north and south but then dealing with it internally have them participate as intelligent devices that are doing all the data management.
Taking client requests north and south, but then dealing with it internally, I can create an environment where every client can connect to whatever drive they need to talk to, and
whatever drive can talk to whatever drive it needs to talk to for availability, rebalancing,
and stuff like that.
And so that's what uSocial is.
uSocial is trying to move in this direction.
Now I caution you that this is like an end goal.
This is strategic, long range.
This is not something that tomorrow is going to happen, right?
This is something that we have to plot a path of where are the big pain points,
how do we get those pain points, and start walking in that direction.
And that's what we're doing at UC right now is we're trying to plot that path.
We're trying to see where in the industry are pain points that we can address
and we can actually empirically show that this is a direction you want to take right now.
And we walk in that direction.
Next one will be another step and another step.
But eventually, getting to this and in-store compute is the final goal.
So to provide for an overall storage system,
they need to have data availability.
They need to have all the things that we see in modern scale-out systems today. They need to have availability. They need to have data availability. They need to have all the things that we see in modern scale-out systems today, right?
They need to have availability.
They need to scale capacity.
They need to scale access, which means, you know, every time I add another device, I'm getting more throughput.
They also need to define, and this is something I don't see in the existing scale-out products today.
They need to define lines and classes of service. So think of a class of service as a big bucket about the features and characteristics of a storage system.
So an HDD has a particular performance line of service, right?
It's got high latency, but it's got certain throughput numbers.
And that's a line of service, those performance characteristics.
I then pair it up with data availability.
I can replicate these three times
and get three points of replication.
These are all things that I can build up
into a class of service
that is my cold storage class of service
or my warm storage class of service.
An entire system needs to be aware of these classes of services
and then provide configurable mechanisms to allow users to use them.
In other words, when I write an object, where should it go?
Well, this is the new hit movie, and I need to make sure it's on super fast fast so it's going to go into the class of service
that is my persistent memory class. But over time nobody's going to look at it anymore. It needs to
go somewhere else. And all of that data management as it winds its way through these media needs to
be definable and then have the devices be able to move that storage around as needed.
So use social storage.
It must have north and south access to allow clients to do actual I.O., and it must have east and west scaling out to provide redundancy,
the quality of service, and lifecycle management.
Since we're just absolutely talking about adding computational resources into these devices,
at some point, you can conceive of them
doing more than just the data management.
And there is a direct need for that.
When you start looking at edge
and edge computing that is going on,
where our data collection may be outside
of our main compute facilities,
having the ability to scale back the amount of data
we have to send back for further processing is a huge win.
So you can think of cameras attached to storage devices
that maybe need to do filtering of the faces
and only sending faces back,
or actually locating individual faces
before sending back alerts and other things of that nature.
There's a myriad of examples of where, when you get out to the edge, having the ability
to do general compute is an important feature, especially in the world of genomics and other
things like that.
There's lots of places where I can show you this.
And we have an example at the symposium.
We're going to be looking at a database that's doing this, that's distributing the queries
out to all of the devices as well.
So what is uSocial?
First of all, don't think hardware.
This is a software abstraction.
Think of this more along the lines of the next evolution
of cylinder block, I mean, cylinder track block mapping
to LBA to key value to the next thing.
It is strictly an API abstraction,
and that allows hardware builders and architects
to build the hardware that matches their particular media.
And so when I think of autonomous devices with an SSD,
it might be an SSD with an Ethernet device on it,
but it also might be, when I'm talking about hard drives, it might be an SSD with an Ethernet device on it,
but it also might be, when I'm talking about hard drives,
it might be three or four hard drives together behind a controller that make it look like a single key value store.
It all depends on the cost and price point
of what you're trying to achieve as an architect.
So uSocial is not trying to make those decisions for you at all.
It's trying to say, here is the rich set of characteristics
that your device can do, right?
So it's a standard object protocol
that disaggregates mechanism-based.
Policy is all configured.
Cluster operations,
and the cluster operations
are for distributing out the configuration
because we need to know about who our neighbors are.
It is not for creating cluster operations for clients to leverage. So there's no transactional system that I'm
talking about here. I'm talking about strictly managing which devices are participating and
updating it as failures occur or as devices are created. Peer-to-peer operations that allow for copies and data movement
data integrity mechanisms
being able to essentially check some of your data
make sure it's all good and is not rotting away over time
it's an abstraction that provides for improved failure domains
my failure domains become smaller
I don't have a server with 20 drives attached to it
that if the server fails, I lose
20 drives. It becomes
whatever that unit is that
is the autonomous storage device itself.
And ultimately
would support
in-store compute.
So, no
restrictions on media type, form factor,
capacity, components, or fabric type.
That's a decision left up to the hardware architects.
So what does it look like?
It looks a lot like scale-out systems, right?
So if I was to look at a Ceph box and take that picture of the server with a bunch of OSDs
and strip away the server, it kind of looks like that, right?
Each device is its own object store manager, and it participates with all of the other ones to
provide a unified global object store. So that means that we have these monitors that monitor
the configuration and changes that occur in the configuration, updating all of the participants as that configuration changes.
But we also add the new thing, which is the cast.
The cast is the unit of scale-out.
Scale-out occurs within a cast.
So I can have a scale-out of SSDs, a scale-out of hard drives,
a scale-out of SMR drives.
I can have different quality of service
that have their own scale out capabilities.
And what that allows me to do is that now I have
essentially a uniform quality of service that I can manage objects
through that. And I can do that through just having defined
configuration that everybody shares.
Again, it's an object API, and just like
Seth, you need things like virtual block
because applications still write blocks.
There's still applications that use file
systems. There's file systems available as
well.
So what does that, what does some of the
information lifecycle management look like?
So if you look, that's the configuration.
The cluster map is the configuration that we're sharing,
and we make sure that it's consistent between all the members,
and it defines who my neighbors are,
but it also can define a class of service.
For example, here's a content delivery network.
We've got essentially five different casts here,
and for a content delivery network,
we're going to use three of them. And when I put a key value, I tag it with what is the strategy I want for this object. This object is a CDN, so it looks in the cluster map and sees that the first
element in the CDN, in the CDN graph, is cast3. And so that's where it writes the data to.
And then over time, there are triggers that are definable
that tell you when it's time to move,
whether it's age-based or access-based or things of that nature.
Then it gets essentially moved by the device itself
without the client's knowledge.
It just migrates it.
Just like in scale-out, we rebalance without telling the client's knowledge or it just migrates it just like in scale out we rebalance
without telling the client where where this is because ultimately it's the consistent hashing
that tells it where the object actually lives so the same the same mechanism for rebalancing
and recovery can be used for doing life cycle management.
What about in-store compute?
So I would not suggest that every device,
you social device is an in-store compute device.
Maybe it is, maybe it isn't.
But because of the price point,
I might want devices that are more expensive,
that do more things for me,
and that would be part of my quality of service that I would define.
My class of service here would be
a high-end in-store compute device,
and it would live in its own cast.
And I could scale out those kind of devices together
with each other.
And when it comes time to do a function,
I essentially just again using
the consistent hashing function it
essentially will mask out to all of the
members of the that have that particular
object or that object range a function
and parameters so for example in the
project that's ongoing called skyhook DB
and at UC,
they create and they objectize the tables in a database,
in a Postgres database, and they're sharding on rows.
And each object has its own set of rows.
And so I can then take a query that is saying,
you know, select all the rows that have in this table
that have some column less than five or something like that.
I can then figure out those objects that that table is contained in,
because it's going to be contained in a set of objects,
and then push essentially that select down to all of the devices in parallel
and have them select on the portions of that table that they each have.
So that's one of the
powers of this in-store compute model is that I can
essentially associate a certain amount of compute and algorithms to a
given object and have that run against
that object at any time
okay so one of the things i'm very cognizant of because i come from industry i come from a company
that like built these devices is that there needs to be room for open source and proprietary.
If you're going to get investment from the big media players,
there has to be a way that everything can't be open source.
So the devices, what you're seeing there in purple is kind of where the space of,
you know, where companies could provide new products, new devices that speak, use social.
They don't have to be open.
As a matter of fact, there's all sorts of ways to get competitive advantage by essentially doing smart algorithmic matches
between key value and object stores
to the underlying media.
There's lots of cool things that can be done in that space
when you're right next to the media
and doing things locally.
But there should also be an open source version of it.
So if I wanted to create this with a Raspberry Pi and a device, there should be an open source project
that allows me to do that. And that's what we're doing at UC Santa Cruz, is we're doing that
open source project. We're just starting on it. It's a long way away.
So what are we doing at UC? That's kind of the social storage.
What are we doing at UC Santa Cruz?
We have, I actually have two grad students now.
Woo-hoo!
I have minions.
So the first thing, like I said, is that this is a strategic view.
This is a five-year view type of thing.
The first thing is how do we get the industry to believe that offloading is the right
direction to go in we're already looking at it we're seeing movements in that direction samsung
has a key value device there's other other people that are looking at disaggregation and key value
as well um but no every time we bring this up in public forums the first thing we get hit with is
like why you know show me show me why not just these slides, which are nice, pretty language, but what is the actual
empirical evidence that says this is the right direction to go in? So that is actually the first
project that we're engaging in right now is the offload evaluation. We're trying to construct
an empirical model that allows us to compare very disparate systems.
Like if I wanted to show the value of a Raspberry Pi with a hard drive attached to it versus
a big Intel Xeon with 30 hard drives attached to it, how do I compare these two environments?
What is the metric I use? It gets very difficult to show apple-to-apple comparisons.
And so that's what we're trying to do with that first project.
And I'll give you kind of an overview of what we're doing.
I'd love for you to beat me up and tell me I'm, like, wacky
and what's wrong with our strategy.
I only think I have about five minutes left.
And then, finally, the full API definition.
This is talking about a key value store.
What are the APIs for that first key value
store? What should they look like? There's already work going on
in the object twig
at SNEA. I encourage everybody
to participate in that as well. There's
work going on in NVMe with key
value. A bunch of people are looking at this.
We want to do it.
We want to follow what all that
work is and not reinvent the wheel, but we also
want to look at it with these broader goals in mind.
In other words, what are the other things
we're going to need down the road
and make sure that those things
are moving in that same direction?
Okay.
And then the in-store compute is the other piece.
So the presentation that we're going to be giving
at the symposium is going to talk about
everything from canned functionality.
I conceive of a key value store as canned functionality.
I'm offloading compute that runs on the host onto the device.
That's step one of in-store compute is having a key value store.
There could be compression, encryption, and a bunch of other things that are canned sort of functionality,
to the far end of the spectrum, which is general purpose compute,
where I'm actually allowing people to write whatever they want to write.
And we see examples of that in the industry today.
And we're going to have a representative from NGD Systems give a presentation on a container-based approach,
where I actually deliver a container to my storage device and let it do whatever it needs to do. And then there's middle ground in an interpreted language where I'm limiting maybe
some of the functionality, but I'm minimizing some of the management overhead from both the industry,
managing tool chains and debug environments and all sorts of other stuff that need to be going on,
to just the interpreted language itself. So that's in-store compute.
All right, with the last four minutes or so that I have,
I'll talk about this offload evaluation.
So this is the fundamental problem, right?
This is what I described in my presentation,
that we have data management software,
we have media management firmware that runs on the device.
The client I.O. is provided by the client,
but underneath of it all, between the actual media and the server,
is client I.O. plus data management I.O., right?
How do I actually tell you that the right is better than the left?
What is the number I provide you to say from a cost perspective,
from a space perspective, from a space perspective,
from a watts perspective,
that this is actually a better approach.
So what we need to do
is we need to define a unit of work.
This is my conclusion anyway.
We need to find a unit of work
that can be equally done in either environment.
It should be independent of the environment
that it's performed on. In other words, it should be able to be equally run on one
versus the other. And equally is an interesting word. It just means that the entire task can be
completed basically. Some environments would be capable of doing multiple units. For example,
if I had 30 drives attached to a four-way Xeon with a ton of memory,
I should be able to do many of these work units.
But a Raspberry Pi may not be able to do
but a fraction of a work unit.
But that's okay because the cost profiles,
the power profiles, and the space profiles
are vastly different between those two environments.
And so when I start looking at it
from a dollar per work unit, kilowatt per hour work unit,
this is where I get my apples to apples comparison, right?
It's because I can now look at these work units
as compared to cost management kilowatts.
All right?
The problem is, what is this work unit?
What does it look like?
First of all, we need to simplify this.
You can get really complicated really quick.
We're going to start with this research
by just looking at north and south.
We're not going to worry about the other network traffic.
Everything that happens with the network traffic
is just more wins for disaggregation.
And we're going to be looking initially
at local traffics and not disaggregated. And we're going to be looking initially at local
traffics and not disaggregated. And we're going to
avoid trying to get lost in the details of cycles and
particular platform architectures and things of that nature.
So how do we do that?
We introduced the notion of a
MIBIWU, which is essentially a
media-based work unit.
And it's some load over time.
It's workload dependent.
But we define it in terms of only the media being the bottleneck.
In other words, how much work can we get out of a particular piece of media?
So there will be a different MIBWU for a hard drive, for a Seagate hard drive, for a Western
Digital hard drive, for a Samsung SSD.
There will be different Mibius for those that allow you to do these comparisons.
But that is the thing that remains consistent between these environments. an Intel box using a SSD from Toshiba
against a Raspberry Pi using that same SSD,
that becomes the uniform piece of the puzzle.
So if I make the work unit in terms of that media,
then it is directly comparable.
So what our plan is right now is to take some key value store, media, then it is directly comparable.
What our plan is right now is to take some key value store. We're using RocksDB and we're using a known
benchmark of YCSB. We're taking a consistent SSD
and we're going to measure what the unrestricted throughput
of that YCSB benchmark is
to that particular media.
And we want to see that they're not hitting any memory bottlenecks,
any CPU bottlenecks, any other bottlenecks other than the media device itself.
So theoretically, the throughput underneath this,
this throughput here should match what this device is able to do by itself, right?
The maximum it's able to do
is the maximum I see out of the bottom of this, right?
And once I know that I have that,
that becomes my unit of work.
Let's say it's 10,000 transactions per second.
That becomes my unit of work.
I can now figure out how many of these devices can I
add onto a server before I start seeing that curve go from a linear progression, yep, to a flat
progression. And then I know how many MIVIWUs that server's able to do on that. On the other side,
it's whatever fraction of that 10,000 transactions per second it can actually do. Anyway, so that's
the work we're doing at UC Santa Cruz. If you have any questions or anything, come find me.
I'd love to talk to you about it and see if there's any interest in what we're doing
and maybe help participating.
So thank you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.