Storage Developer Conference - #84: Deployment of In-Storage Compute with NVMe Storage at Scale and Capacity
Episode Date: January 7, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 84. My name is Scott Shadley. I'm a
storage technologist slash VP of marketing. I get to call myself a technologist because at one point in my career I was an engineer.
You can't blame me that I went for the frontal lobotomy to get my marketing half.
I work at NGD Systems. We're a pioneer in computational storage.
So today I'm going to give you an introduction to what we're doing,
but more importantly, since this is a developer conference,
I wanted to show real-world examples of how we're using the technology today with customers.
So yes, it's a tiny bit sales pitchy
because I am a marketing guy,
but I did want to spend a lot of time
more focused on what's causing the need for this.
So some examples of what's going on in the marketplace today.
This is one of my favorite cartoons.
For the first time, they've determined
it's actually hard to find a needle in a haystack.
Our data sets are growing to exponential sizes. We know that. We're getting exabytes,
zettabytes of data, as kind of the video clip I was showing talks to. So what if we had a way
that our storage devices, instead of just being like our storage lockers we all rent from Storm
or U-Haul or whoever else we buy it, rent it from, and have to unload all the boxes to find the one
box we need in the back of the storage room? Could just have it pop out and walk out to us and you know deliver itself to our house
by an uber or something so we worked with this company called actual tech media some of you may
know them they asked this question of a slew of storage professionals and where do you see the
most storage performance bottlenecks and so we talked we got things like i don't know array
controllers networking but the biggest storage performance bottleneck is still, aha, the storage.
And again, that's because we've done a lot in the market to improve on things.
We've introduced things like NVMe.
We've gone to things like the potential of open channel, open everything.
But at the same time, the storage device as a standalone product has not changed drastically from tape.
We put bits on it.
We pull bits off it.
We don't do anything else with it.
We may access it faster.
We may have better reliability in the way of VCC, things like that.
But we're not really moving the data around.
We've done things like software-defined everything.
We're moving to this whole new world.
But at the end of the day, these slides are available.
So I'm going to gloss over the glossy stuff
and get to the meat and meat of it.
The idea is don't leave the storage behind.
There is opportunity to move compute into a storage device
in an effective way that's easy to implement
and also scales,
so that you're not dealing with just one added device.
If I have 24 of these things,
I'll show you what that can do to a system.
And how does this occur?
Today's architectures look very much something like this.
This is an example from one of our partners that
uses, in this case, a very simple
single CPU root complex with 16
lanes of traffic. They're going through a switch
to get all of their SSDs, and they're having
a problem with the fact that they're needing to scale the
capacity of this footprint, but they really
don't want to change the compute architecture they're dealing
with, because the CPU side of things are more expensive, they're space constrained in their
particular architectural environment. So as the devices get bigger and the number of devices in
the system, so we've got out there our friends at Supermicro with the EDSFF chassis that can
support half a petabyte in one U. That's great, but now I have to figure out how to get access to
all of that. And if I'm talking about
this particular example where I've only got 16 lanes of traffic available and I'm switching it
out we see that the throughput capability of what the NAND can deliver and the flash can deliver
over NVMe to the root complex is about a 50x or greater delta in performance speed so I've got
really really fast IO over here and I've got a limited IO over there. I can switch through it.
I can make it expand out.
But at the end of the day,
the number of gigabytes per second here
versus the number of gigabytes per second there
are drastically offset.
So we need to work as an industry and an architecture
to figure out how to solve or implement a way
to make this solution a little more opportunistic.
And it's not just us that are talking about it.
So these are a bunch of different articles that have come out over the last several years.
One of the bigger ones here from the IEEE conference, there was about 16 people that
contributed to this concept of near data processing would be very valuable as a solution for us
in the marketplace.
A lot of this also plays into the edge and IoT space, and I'll explain how that works out as well. If we think about platforms that are going on the edge, we've got things like 5G,
great, but we're still going to run into bandwidth constraints. Moving data is becoming a bigger and
bigger problem. Storing is easy. We've got that down pat. We've got big drives. We've got NVMe.
We've got all these other opportunities. But actually doing something with our data continues
to be a significant struggle for us.
So when we looked at it,
we looked at the value props of this concept
of computational storage.
And of course, in our case,
and what I'm going to be talking to you about today
is moving compute closer to the data.
Now, in this particular instance,
as I look at it from where I work
and from what we're doing,
it's at a drive level.
This does not have to be just at a drive level. You've seen other presentations from our friends like Stephen Bates
and other companies also in the marketplace that are going outside of just the drive level or
looking at it at a system level. So there's lots of opportunity to do it. In this case, we're going
to stay focused at what it is from a drive perspective. So a computation request by any
application can be much more efficient if it's near the data.
Greg Schultz from Storage.io wrote a blog article,
the best I.O. is no I.O., the second best I.O. is the I.O. I only need.
And that's kind of a, it fits very well into the vision statement
of what computational storage is all about,
lessening the number of I.O.s.
And when you talk about large data sets and even unstructured data,
being able to do what you can
at the data set location is much more valuable to you than having to constantly load flush and
reload a DRAM footprint so if I've got mega terabytes or exabytes of data sitting in a server
and I've only got a couple hundred gigabytes of DRAM just simply doing the math of how many times
I have to reload that DRAM footprint to scour that data should tell
you that there's a lot of wasted time in today's
architectures. So there's a great opportunity
to look at ways of
eliminating that and helping
augment the CPU root complex.
The techniques
that we're talking about here are not in any way looking
to replace a GPU. I'm not
trying to get rid of an in-memory database.
There are applications where what I do
actually slow things down
in the particular implementation,
and I'll highlight those as we go through
some of the case studies.
What we're looking at here is something
that you have stored needs to be analyzed.
I'm going to make it faster to analyze that data
through this concept of in-storage compute.
So we took a dimensional look at this,
and we said there's multiple different ways to do it,
but we wanted to focus on things like the KISS principle,
keep it simple, stupid.
Again, I'm a marketing guy that used to be an engineer,
so they have to really dumb it down for me to be able to communicate it to you as a bunch of folks.
So from an operating system perspective,
bare metal, real-time operating system,
and it needs to be 64-bit OS.
Those are pretty straightforward, simple dynamics. When we get to the hardware side of it, there's
a lot of different opportunities. We started with 32-bit real-time processors. We talked
about hardware acceleration. There's lots of opportunities there around ways we can
use FPGA acceleration. And then we decided to take the leap and say it needs to be 64-bit
only. So for our particular platform, it's a 64-bit operating platform. When you look at the user applications, we had to write firmware
for this. This is something that has been looked at and been tried and then been presented by
others earlier this week in the way of using an existing storage device in these spare cycles
in that controller or that platform to do these types of processes.
They had issues with that because when you start having
contention between compute and
wear leveling, you can't fight
the wear leveling's got to take place. You've got to be able to keep
the data intact. So we looked at it
from that perspective. We're going to start from scratch
and we're going to write firmware from ground up that covers
flash management, ECC, data
placement, and then allow us to do
things like add application software at the device level.
In this case, we've actually been able
to do virtualization via container,
so Docker is my new best friend.
And then it seems to find out
that talking to the number of customers
that we have over the last couple of years,
the AI application space,
not just to use the buzzword,
but just the way those applications work
on doing analytics in a storage environment, fare very well for in-storage compute. If you're running inline
in memory or doing stuff that requires you to access it as you see it before you store it,
that's going to be somebody else's product in this particular instance.
And in the future, we have the ability to do more true AI application acceleration. So right now, I'm just
running the app in storage and actually giving you an acceleration out of that fact. Now, if we start
rewriting the applications dedicated to these types of processes and products, we have the
capability to do more with that. When it comes to looking at this as an ecosystem, there's a lot of
things that can be done here. One of the things that we're working together with others in the
industry, as was shown by the birds of feather last night, is the highest
level of this and how this drive identifies itself to the system, what the API looks like
that would drive this particular system. I have unique versions of those today because
I have a new product in the market, if you will. But we're using the SNEA organization
to help work through a provisional twig to
actually develop some standardization on how these products are identified to the system
so you know that it's my version of a product versus Stephen's version of a product versus
someone else's version of the product and be able to actually intelligently use an OS
and not have to rewrite code every single time you want to use a different version of
these particular products. So we call it in situ processing.
In situ in Latin is in place.
So we're not being very creative.
Again, we're a small startup.
When you look at it from our perspective, the CTO said when he sat down with the team,
his number one focus was a seamless programming model.
Back to this concept of keep it simple.
If I can't make it easy for you to use, you're not going to touch it.
So I have to start off with something very easy.
Seamless is not easy,
but it's as close as we can get to making it functional.
Scalability was another one.
If you can't put this in process across multiple platforms
with multiple people figuring out how to use it,
you're never going to be able to actually implement this
for a long-term solution.
So scalability becomes very valuable.
And then
capacity growth. You need to make sure that whatever you're doing can support the capacity
growth inside an individual drive. Right now we see a lot of problems today, not with people
wanting bigger drives, but people being able to truly deliver a large drive at a reasonable cost
with the right performance metrics. And those are another key cornerstone of what we're focused on. This is the very basic
block diagram of
what computational storage is.
I'm not taking the CPU out of
the process. I'm simply adding compute
processing into each and every one of my drives.
It scales.
You add compute cores into the system
and you augment the CPU or even
free the CPU up to go do bigger and better
things that are more useful with its time. I've had a lot of questions in the room in previous conversations.
What about things like open channel? Open channel is taking a lot of the control of this drive
out into the CPU. That's not what this product is designed for. I can make you an open channel SSD
if that's what you like and you want to go write your own FTL. But if you take the FTL out of my
drive, you lose the capability to do in-storage compute. So there are definitely trade-offs in the market and there are
customers that are saying we don't like that idea. There's customers saying we really like that idea.
So I just want to be very open as far as a development perspective. This does require
you to have the drive act as a traditional NVMe target. So when we think about moving compute to data,
this is going to be a history lesson
for most people in the room.
So you write data in, it comes into the storage device,
we read it out of the storage device into DRAM,
that particular path,
and then using the host CPU to do the compute,
that's the focus of in-storage compute,
or what we call in-situ processing,
because you're going to sit there and repeat that process
multiple times over again with multiple drives
in multiple systems.
And if we have the ability to limit the number of times
you have to use that host CPU and the time it requires
to offload the entire storage device into the host DRAM,
we're going to be able to show you how we can save time and money.
So with an in-situ storage device,
we do the same thing.
We're going to store your data.
So again, this is back to the point,
I'm not an in-memory platform.
I'm not going to do things real-time.
I'm going to do near real-time.
Once it's in the data storage device,
you now have your internal buffers,
which in our case is a drive with DRAM on board.
And we have computational resources in the way of ARM cores outside of the NVMe data
path. So the data management of the read and write still takes place. We use an alternative path to
be able to process the data in place. The type of in-situ processing that we offer, it's an embedded
Linux platform. So you don't have to worry about having a specific app. If it can run an ARM64,
it can basically be dropped into the drive.
And we offer APIs and solutions
to help support that capability.
Then the results are fed back,
and you're getting a smaller packet size
out of the drive.
You're limiting the I.O.,
and you're freeing the DRAM and CPU
to go do other things
while you're running your storage compute.
Yes.
And please, if you have questions, stop me, because I tend to ramble pretty quick and will be done really, really fast. So data contention, if you will, from that example. So
yeah, the question in the room was, if I'm doing in situ processing on data in the drive and you
go and do write and update the data, what happens to that situation?
Because we're always reading stored data,
it's as if you were doing the same thing
where you're writing to a drive
as you're exiting out into host DRAM.
There is a potential for contention
or for processing on old data.
Since I'm not doing it within the memory buffer
as it's going in,
you have exactly the same contention example
as you would just pulling it into the host system.
We don't have
an overlap issue from that perspective.
Yes?
How much extra DRAM do you need
for in-situ processing?
How many extra cores do you need for
processing? He's asking for the block
diagram. I like it.
Our in-situ processing is built on A53.
We have four 64-bit
ARM cores.
From a DRAM buffer perspective,
we're not using more than about 2 to 4 gigabytes of DRAM
depending on the size of the drive.
So that's extra 4 gigabytes?
On top of the standard amount we have for the drive processing.
I'll show you in a bit.
We do have patents related to the DRAM on the NVMe side,
which limit that footprint requirement of space on the drive as well.
Yes?
So in terms of the capability of that process,
what kind of processing is that, encryption or compression,
or what kind of processing do you refer to?
The beauty is this is effectively the way we've designed this drive.
It's a micro server in a drive.
It's got a runtime Linux OS on the drive.
You can drop any application.
So encryption, we've already done with the customer.
Compression, we've done with the customer.
It's running as an application in the drive.
It's not a compression engine inside the drive.
So it's even more flexible than the XBGA.
I'll show six different ways we've already done things with customers
as part of my kind of real-world examples.
Yes?
You mentioned that you have R-Cross
as the choice of operating system.
Is that embedded Linux in R-Cross?
Yeah, so our choice for our general distribution
is a Ubuntu 16.04 core,
where they've stripped out all the drivers
for everything like external peripherals
and displays and stuff like that,
so it's a very small footprint, full-fledged Linux OS.
But not R-Cross. roles and displays and stuff like that, so it's a very small footprint, full-fledged Linux OS. Is that a real-time
operating system? Yes, it is a real-time operating system.
It boots up just like a server would
be running a natural server. Effectively
what our drive does is that
Linux OS looks at the NAND placements
as if they're like drives in a system.
Just a FYI,
I wouldn't consider Ubuntu small.
It's 64 megabytes or something like that.
I would compare it to like Alpine Linux,
which is like over 2 megabytes.
Compared to Windows, it's teeny tiny.
And for the purposes of this product,
we can run Ceph or we can run any other operating system
as the core operating system.
Just for our development platform and our release product, it's Ubuntueph or we can run any other operating system as the core operating system, just for our development platform and
our release product, it's Ubuntu.
There's definitely options.
Another way to answer the question is to say, well, how does
60 megabytes compare to
how much you said you got 2 gigabytes
or 8 gigabytes of RAM?
Then it's time. Exactly.
Yes, sir.
I have a couple questions.
This is awesome, by the way.
The fact is that they run Docker containers.
You can literally write any algorithm you like,
and they can run it down there.
I think it's amazing.
So, Scott, I think I have two questions.
One is, is there a concept of networking?
So if I'm running on this Ubuntu operating system on those A57s,
do I see a network
interface that I can use to talk
to the iSign world?
Or is that something that's more available to me?
So that's kind of one. And then the other is
you mentioned that
the operating system sees
the Flash media as some kind
of device. Can you elaborate a little more on
how it's presented to the OS?
If I have an algorithm that wants to look for the largest value across these die, how does that look? Yeah.
So the A53 cores utilize the M7 data path.
We talk to and pull data from the drive
as if it's working through this as a pseudo host.
So the direct connection between those two
is an arm-to-arm NAND flash interface,
so it's much faster than even the PCIe Gen 3 bus.
So even though I have a smaller footprint, a smaller DRAM,
my application processors I chose for power versus performance.
An A73 or A74 would be much faster,
but the power budget tradeoff wasn't worth it for what I'm dealing with within the storage device itself
because I'm talking 8, 16,
32, maybe 64 terabytes of data, not petabytes of data behind those cores. And the amount of time I save transferring across that bus, as I hope to show you in a minute, highlight the capabilities
of the A53 being more than powerful enough to support that. To your other question around
network capability, for this
connection to this
migrated application or host processor, we do run
a small host API that
uses a TCP connection over
the NVMe protocol. So there's been
conversations here about NVMe over TCP
IP. I'm already doing it the other way around.
So every drive has a pseudo
IP address that I can see from my
host side, and I can address all of those devices and system,
which also allows it to extrapolate over a fabric-based platform,
and we're working with a couple of fabric sets.
And it'll still be fast.
Yes, sir?
So, one question.
In terms of the input buffers or computation buffers,
do you use the SMPF on the device to divide?
So, crawl, walk, run.
Crawling is where we're at today as a startup,
so we have individual devices.
Walk is peer-to-peer.
Run is where I eliminate his problem of hopping, hopping, hopping. So, yes, today it's just individual devices. Walk is peer-to-peer. Run is where I eliminate his problem of hopping,
hopping, hopping. So yes, today it's just individual devices, but because they run concurrently and or simultaneously, I may be redundantly running an application within multiple drives
and getting the similar results. There's not going to be the exact same data file on any
drive that I'm looking for, or shouldn't be if you're a good storage architect. You're
going to be spreading your data across it, so your results will come back concurrently.
What is the package? How much is the package?
How much is the what?
Package.
The power?
So we designed this to be a low-power solution,
so it's 16 terabytes and a 2.5-inch U.2,
full data rate and running application, it's 12 watts.
So we have the lowest watt per terabyte
available on a drive today.
Yes, and then I'll get you.
What's the ballpark of adding the cost for SSPs?
Ah, there we go.
All right, I did not pay him for that one.
So we looked at it from that perspective.
There is what we classify as a licensing cost
for the support of the in-situ platform to support that.
It's on the order of magnitude of pennies per gigabyte
above a standard SSD product.
So we do it
in both a monthly, yearly, or
lifetime type of a license fee
and because we know some customers
like the idea of having a very large
very low power drive as a stand alone
product, we can sell it as a storage
device and be competitive in the market with any of the big guys because we're just going to buy very large, very low power drive as a standalone product. We can sell it as a storage device
and be competitive in the market with any other big guys
because we're just going to buy their NAND.
Our BOM is less than their BOM outside of the NAND,
so we offset a lot of what we consider assumed cost.
And then we can add on the computational resources.
So it's an on-off switch capable product.
You can buy it with or without,
and you can turn it on later if you want to.
Yes?
My question is kind of a follow-up on the earlier one.
If you actually have,
depending on the application or the workflow pattern,
if you actually have your right traffic going in,
you said there's no overlap,
but you could have,
how much, what is the limitation
on your on-device compute? And I have to get, that's going to be the latency. I can't finish the compute if I actually have the application So based on the way that we're doing this,
it's not a high-write performance product,
and we give read prioritization inside the drive
unless the customer wants a unique firmware where we can transfer high write performance product, and we give read prioritization in the side of the drive unless the customer wants a unique firmware
where we can transfer to write performance.
So as a standalone NVMe,
it's 3 gigabyte per second read, 1 gig write.
It's not designed to be a 3 gig, 3 gig.
And we read prioritize everything,
so if a write's coming in,
we'll actually stall the write in favor of finishing the reads,
even if it's from the application processor.
Yep.
All right.
Oh, I'm going to hate that slide later.
So we look at it from a perspective of today's SSD,
and I have a longstanding debate with people inside the company
of the use of animation.
As you can tell, I think they won today.
So you have the idea of a smart SSD with in situ processing,
gives you abundant resources inside the SSD.
We've added those cores to give that to you
where others haven't had it in the past.
Because we are running a virtualized platform
with a 64-bit application processor,
we create this concept of a disruptive trend
in the marketplace within storage compute.
This slide I will not take credit for.
One of our partners that helped present this for us at some of this material at FMS created the slide, and this was
their interpretation of what in Storage Compute can do for the marketplace. So solutions. This is
the fun part. My first one, that previous slide, was from our friends at Microsoft Research. They
were kind enough to get on stage with us at FMS a couple months ago. So this is
a little more detail than we were able to present there because
it was a much shorter session. Basically
what we're doing with them is they use
this tool from Facebook called the AI Similarity
Search. It takes images, converts
them to three-dimensional vector
identifiers, XYZ, effectively,
stores them on disk, and then does
comparisons of those. So it's an inference
utilizes an approximate nearest neighbor neural network.
And up into 2017 with a billion images,
their current platforms, they just threw more compute at it.
So it's more storage, more compute, keep throwing servers.
We're starting to run into power problems.
And now with 2019 coming along,
the concept of a trillion images in a single data set
is becoming more challenging for these guys.
So the premise of this is you're going to Bing,
you go to the image search on Bing,
and you say, I want to see cats.
And you want to see how many cats it can come back with,
and we're all real-time people,
so how fast can I get those cats back?
Google used to put 0000012
to show you how fast they were responding.
That number's creeping the other way
because they can't see all the files
as efficiently as they want.
So by way of working with us in this concept
doing queries inside the system,
I start with a balloon, I get balloons out.
Now I'm not finding the same image and I'm not trying
to. That's not the purpose of this particular app.
They use various different platforms.
So this tool is an open source application
that multiple different companies use,
including Microsoft in this case.
So how their architecture is put together,
there's multiple ways they do it too,
and we talked to them about that as well.
But the premise here is more just about the app
and what we can do with the application.
So if you look at the way the tool set's built today,
it uses what we're all used to seeing.
You load a training set, you index it,
you have to reload the database,
and if you add a new file,
you have to add it back to the database.
These are all multiple steps that go back and forth
between storage and host processor.
So this is an example of one way they put it together.
This is not by any means the only CPU structure you can use.
But what we're doing here from an in-storage compute perspective is as the file's written,
we're able to automatically update the index already on the drive.
We're able to create and modify the database, and then we're able to run the inference real-time on it.
So it takes milliseconds of time to do all that work versus seconds of time to get
it done in the quote unquote real world. So as we put this in actual numbers with our customer,
and they called it an intelligent SSD just to give reference. I didn't want to modify his slide.
With just running the system on the host, he was averaging from a queries per second or the metric
of this particular application, just shy of 500 queries per second. With 16 drives in the host, he was averaging from a queries per second, or the metric of this particular application, just shy of 500
queries per second. With 16
drives in the system, which he already needs
to hold the size of the database in that
platform, by turning on the in situ
processing, he got a 4x improvement with
no modifications to the app, and
all we did was port his existing application
into all 16 drives and run them
concurrently for him.
Yes?
So all these SSDs you're reporting here,
it's U.2 NVMe SSDs?
Yes, there are U.2 NVMe SSDs.
They're all, in this case, for this particular example,
they were 8 terabytes apiece,
and they were fully loaded, all 16 drives fully loaded
with 8 terabytes of image database.
What kind of bandwidth do you have
between the flash and your internal bandwidth?
So we utilize the ARM-to-ARM communication.
So, sorry, the question was,
what do we use for internal bandwidth
between application and manned?
Since we're talking ARM-to-ARM,
we're using the native interface there.
It runs around 16 gigabytes per second.
So ARXR?
You know, to be honest, I don't know the specifics
of it. I asked not to because then I end up
divulging IP to a room full of people.
Yes?
One confusion I have is
that when in the software
you're trying to find storage
systems, the images
from the client
systems come as I upload as a Facebook user an image
it finally hits the storage as a shot.
Yes.
It's a piece of that image.
It's like it's not the entire image
and it perhaps goes through some transform.
The individual SSD will see the full images
and you can index things
or is that artificial or is that real use case?
This is how the face tool,
by converting that image into a three-dimensional vector,
that vector is a much smaller footprint.
It's a couple kilobytes per file.
So I'm not actually sharding the physical image.
I am re-representing the image in this case
through that vector.
So it's a different implementation.
There are applications like that that I cannot accelerate.
So there are limitations to what we can do.
In the cases for what we've done so far today,
everything's a direct attach,
and the files are intact but spread across multiple drives.
There's no RAID cards involved either
because RAID virtualizes out my drives.
So did Microsoft...
I know this is maybe a little bit off the topic.
Did Microsoft... You had indexing a year earlier this slide. I don't know exactly what the premises are.
They're going to be presenting the final results of this
at a technical conference in December.
And it'll be public record.
And a bunch of this is already public on their site.
It's called the Microsoft Soft Flash Research Project.
S-O-F-T-F-L-A-S-H.
Did I get everybody's hands?
So this is queries per second.
This is what makes the application guy happy.
So we also thought about, well, what about other ways
to look at the way this analysis can be done?
Because I got to find TCO.
I got to make other customers happy. This is processing time. So not only can
I do queries per second, but how long does it take to get the processing done? Same 16 drives,
host in orange, host plus my drives in blue. If we take a closer zoom in, oh, did I not put the
zoom in on this one? I probably didn't. So if you look all the way
down here, you'll see that if I'm one drive to one host in this particular instance, this application
is actually slower running in situ on my drive because the database is not big enough and the
host is that much faster because it all fits into DRAM. But as you scale up to 16 drives,
two things happen. One, the amount of time it takes to process, to move
the data for processing into the host goes up.
And then you can see if we went beyond 16,
it's literally an exponential curve.
No matter how
big the database gets, no matter how many drives I
add, my results are consistent and
stable.
Is there any meaning that, like, around five or
six cores, is it a wash?
Is there any way to kind of, like to make a rule of thumb out of that?
So this result will be unique to every application you run.
So for this application, when you hit host plus four,
I just saved you four servers is the best way to look at that
because you don't have to have those extra servers in place to support that.
Because if I do host plus 16,
I'd have to throw, basically
to get from 42 seconds back down to half a
second, how many CPU cores
in new servers do I have to add?
At 4, it's 4. At 16, it's
about 26 servers is what we figured out
based on the way that this particular platform
is built. If you have higher performance
processors, more cores, all that kind of stuff,
this math will change. So this is
definitely a point in the sand, if you will, definitely a point in the sand, if you will,
or a line in the sand, if you will, of this particular app.
On your previous slide when you said the performance is 4x,
was that sustained?
Yes.
That is sustained.
Do you have actual numbers?
Do you have, like, 4x times?
Because it depends.
How fast is the app that you're using?
Valid point.
In this particular instance,
this is the same data just represented differently.
These are all our drive,
and they are based on our prototype FPGA solution.
Our ASIC-based solution will shoot
even more of a delta to that system
because our ASIC-based solutions
are actually a faster processing drive.
And like I said,
the research project will be publicized by Microsoft
when it's final.
The current target's December.
Yes?
Why is the search time increasing super linearly
with respect to the number of drives?
From our perspective?
No, from the convention.
From the perspective in this case,
it's just because every time we increase the drive,
we're adding 8 terabytes to the database as well.
And so that 8 terabytes has to be moved from DRAM,
or into DRAM.
And so you're scaling it at 8 terabytes per step, basically.
So it takes that much longer to use the amount of DRAM,
in this case 32 gigabytes of DRAM,
to do that large of a data set through 32 gigabytes of DRAM,
it just takes longer.
So I mentioned we can also use this type of technology at the edge.
So this particular instance,
this is how we used to find images.
So if I wanted to find his face in this slew of things,
I'd have to start somewhere,
or I can use an AI algorithm to find them in one,
and then I have to figure out
where the timestamp matches in another.
We actually had some fun with the video clip,
and now it's not going to play for me, right?
So this is showing three individual cameras
directly connected to each of the individual drives,
so it's a one-to-one relationship.
You can see that we're tracking the image,
but we're slightly off because this is near real-time.
I'm storing the image, doing the object analysis,
and then sending that result to the host.
The relevant example for this particular case
is 60 frame rates per second input.
I'm about 2.25 frames slower in my response time.
That's why you see a slight shift in some of the boxes. But I'm not losing the image, and I can
track from camera to camera. So the customer we're working with in this particular instance wants to
set up a chassis of 24 drives with 24 cameras in a circle, and have someone walk around the room,
and our drives can keep track
of that person and report back to the host while the host sits idle. That's in storage compute in
an edge style application. Steven was kind enough to point out the concept of a container.
So this is called OpenALPR. We literally took that and dropped it straight from the Docker
container store for ARM into the drive. You can see that it's an IP address that goes into our individual device. And
we're executing this license plate recognition application inside the drive. The host is,
again, sitting idle in this particular instance. So right now, he's basically clicking on an
image to get that result back in the confidence level. Native app, no changes to the platform.
We just wrapped a GUI around it. In this case, we're going to upload a brand new image
from a different drive into the existing drive
that we're talking about as a standard.
I'm uploading a picture.
And then he's going to go ahead and send it in.
It's going to get added to the database,
and then it can be, again, recognized
by this particular application.
So you have four A53 cores.
Yes.
Does the open ALPR container automatically load balance,
so it's doing one-fourth of the work on each of them?
In this case, it's designed for 64-bit ARM,
that particular app, so we could drop it right in.
If there's not a native ARM 64-bit,
then it would not be able to do that.
Let me ask the question differently.
To do this demo, did you use one core or four?
It utilized all four.
And I kept that picture
because we're all developers.
We like our vodka.
So that's an example
of being able to run
a container inside the drive
with no changes
or modifications to the platform.
And again,
this is just that app running in place
on the drive, acknowledging and being able to read this.
So it doesn't matter what the container is.
I could grab a different app.
We've done things with TensorFlow as well
that we've shown in some of the instances.
So outside of this, we then started thinking,
well, what about if we go into the space
of our friends in biotech?
Because I've got to be friendly to a lot of different people. Blast is an application that's doing protein sequencing based
on files in various different places and our different file sizes. This particular graphic
shows across here the number of cores in the system. Here shows the number of drives with the
in situ processing turned on. And this is the percent improvement as you add drives to the system or as the database grows
so i made a little animation of the slides to show you that as you build it out
you can see that you continue to see an improvement as the database goes up but it sits
somewhat idle but as my database grows and i use more of my drives and more of my storage
the shift of how much performance
gain I get just by turning on the in situ cores inside the storage I already have is
about 100% improvement. Yes? We could, yes.
So it was a conscious trade-off
to look at a reasonable processor
that still gives me benefit for an existing architecture
versus going for 100% acceleration.
A trade-off, if I went to the A7 class of application processors
or put more of them in there, my 12 watts per drive goes up.
And then I start becoming uncompetitive from an SSD-only perspective.
Yes?
So if I wanted to upgrade or patch the Ubuntu embedded OS,
how painful are the processes?
How long does it take?
It's all done through our software API
I don't have the specific details on that
but I can certainly follow up with you on it
we support upgrade path
through our host API
it's able to push software updates to the drive
do we need to do it with the host
no
it doesn't require a reboot
because at the level that we deliver the drive
that embedded Linux and that whole processing capability
is shadowed from the customer.
You don't actually know that it's there
as far as the drive being plugged into the system.
It's just executing the applications on your behalf.
So do you expose any vendor-specific
NVMe or the main command
to let the host manage this embedded OS?
Anything there?
Today, we don't open that option to customers
because that creates a lot more support requirements.
If you change something just wrong,
it breaks the rest of the IP in the system.
We are working to build a developer kit version of it
that would allow users to have more access
to the internal aspects of that processing.
What is the total internal slash bandwidth available?
And you may have answered this question.
So the NAND itself is run,
we're agnostic since we don't make it.
We can run on-v or toggle,
so we run at those interface speeds.
Whatever the NAND choice is,
our net performance is relatively the same
because there's not enough difference between those.
Can you give us a ballpark?
I think the current NAND interface is like 400
megatransfers per second is how they quote it.
Across all the channels?
We're using a 16
channel controller in all of our different
versions. Last I checked
we're around 16 gigabytes per second just
at that interface. Per channel.
16 gigabytes per channel.
This might be a hard question
to answer, but did your CTO or somebody that designed the
chip say that, okay, if we're controlling 8 terabytes of NAND and we have 4 A53 64-bit
cores, is that a good mix?
As opposed to saying, let's just put 2 cores on there or 8 cores.
The 4 was a solid balance.
They went anywhere from two to 16.
They were thinking of doing a core per channel, for example.
And between a gate count limitation
inside of an ASIC and or FPGA
and the performance characteristics gain you get,
it made sense to do four as the first point
for this particular iteration.
Yes?
So adding these four cores,
does that add any cooling concerns,
mis-thermal challenges?
So because my entire drive runs at 12 watts in a U.2,
I actually have less thermal constraints
than a lot of the drives on the market today
that are running at 25 watts.
I don't add substantial thermal
the way that this particular solution is designed today.
It doesn't get in the way.
In fact, I'm saving thermal budget
at the CPU level in the host
because the host is not actually running,
or it's off doing something else.
So I'm going to skip my nasty slide
that keeps transitioning on me
and go straight to this one.
So when we, so this is kind of the nuts and bolts of everything we did. So we had to start with a
new ASIC. So we did design an ASIC device. It's a 14 nanometer SSD controller that has those A53
cores embedded, single SOC. There's no two-part build to this particular platform. So you get
your NVMe SSD with the compute on the side if you want it, or you can just run it as an NVMe drive.
We weren't about to put it in anything
but the standard form factor,
so you get your U.2, your M.2, your EDSFF.
And if you really want size,
I have an add-in card that can support over 100 terabytes.
We had to make sure the management
of the product was correct,
so we wrote the firmware and algorithms
around reliability of the NAND from ground up.
We have about 12 different patents
on various versions of ECC, LDPC, and error recovery,
including one that allows for the drive
to have failed devices removed, replaced, and rebuilt
as a field-replaceable upgrade module.
And then we wrote the firmware,
and we focused on things like QoS. So this is
your 5.9 window for my FBGA based product, let alone my ASIC product, which is going to shrink
that. I will call out this is a very good marketing slide because I make it look like we're really
cool right in the middle of the other guys. This red window actually sits in the time scope ahead
of my drive. I'm not faster than they are. I'm more consistent than they are.
And half the customers I talk to
will take that consistency today
over the speed of the response.
And that's with computation running
while doing I-O to the drive.
Then on top of all that,
we put in the in-situ platform.
So the in-situ platform, to kind of recap,
has hardware accelerators on side of the ARM cores,
has the full-fledged, in this case, Ubuntu.
Again, we can run other OSs if desired,
and we added the Docker container capability to the solution.
Going into VMware, especially with now ARM being supported,
is a potential for us.
So at the end of the day, it looks something about like this.
Proprietary controller, so yes, I am another controller vendor.
No, I'm not going to sell you a controller.
I will sell you a drive in various different form factors or variants,
but I'm not going to sell you a controller.
And all these different form factors.
As I mentioned here, the AIC shows up to 64.
With QLC, I can do 128.
So for us, finding the needle faster,
having the in-situ processing as a core tenet of what we're up to that's what we call computational storage today
it's all about near data processing we're moving the compute closer to the data we're getting as
close as we humanly possibly can at this point by putting it inside the asic on the drive there
are absolutely other ways to do near data processing and there's absolutely ways this
is not going to solve your problem.
So keep that in mind.
We wanted to make sure that we found people that can actually use it today
and help the rest of us figure it out,
so having the weight of Microsoft Research has been a great blessing to us.
We did have to do the flash agnostic controller.
It does support TLC and QLC today.
We've already characterized early QLC parts from one of our vendors, and we can
get half a drive write per day
on a full line write,
one gigabyte per second write to QLC
with our flash management algorithm.
Just a general
question.
In C2 processing,
you know, that has an idea,
and then NGP has a good execution
of that idea.
Yeah. Are there other vendors who are executing on this idea? You know, that as an idea. And then NGP has a good execution of that idea.
Yeah.
Are there other vendors who are executing on this idea?
Yes, absolutely.
We have friends in the marketplace.
So Stephen's sitting in a couple chairs over from me,
just shaking his head.
He has a various version of InSitu.
My friend that's hiding in the market around here somewhere and not presenting today has another solution
that's based on a host-based FTL
that can run this type of thing.
They run compute and storage,
but they use a host-managed drive.
So they're not an NVMe target.
They're a block storage target.
So they're similar.
They're as close to what we're doing as we have
in the marketplace as a like product.
So where ours is a standalone,
there's a host-based.
In the back.
Okay.
The part of the application
that I predict like the W show
for the camera tracking images,
you mentioned that you'll give a storage image,
which basically means you have to give
the main system memory anyway.
In that case, is it because you're scaling yourself? Because I have to pay the cost of in-memory anyway. Right.
Right.
Right.
We're focused on in storage.
So in the example with the cameras,
it's a lot more about post-processing
than it is real-time processing.
So I'm not going for in-memory
or real-time data management
where it's sitting in memory first.
I have to store the bits first,
whether it's video image, file,
whatever the case may be.
Our product requires it to be stored
and then pulled back into application processors.
That's just a choice from our perspective.
Yes, right here.
How would the resolution be analyzed from the flash?
Is it taken care of by the flash?
Yes.
Is it taken care of by the intuitive process?
So the question was,
where leveling garbage collection,
the standard NVMe stuff,
that's managed by the NVMe half of our drive,
which is a separate ARM core, which is designed
for that, which is the M class.
The A class processors are only for
application execution.
Yeah, so, the question was,
has there come a point where networks
and interconnects
are so fast
that off-storage
can do actually makes more sense?
We sort of double bandwidth, If you're looking to do it when it's... Anything that's a stored bit,
if you have to move it off of that
into some host memory base that is not of like size,
I'll still have an acceleration factor.
Yes, you can make it faster and you have other ways to offload it.
You can offload it to multiple systems.
But if I've got 100 gig to one petabyte, I'm still going to be faster in some way.
The acceleration value will definitely drop.
So Gen Z, for example, doesn't stay?
No. In fact, we're capable of doing a Gen Z interface on our devices if we wanted to.
So these are enterprise class drives.
There is power loss protection built into every single one of the drives.
One thing I didn't specifically call out, because I don't want to put too much IP on the thing,
our DRAM footprint, we use a thing called Elastic FTL.
It's a homegrown FTL
that requires less than the traditional
1 gigabyte to 1 terabyte that most
controllers use today. I use 1
gigabyte to control 8 terabytes.
So that way I have the room
for the DRAM I need in my application processors
and I still use less total DRAM than any other
drive of my light capacity.
So your power loss protection will ensure that both the application and I still use less total DRAM than any other drive of my light capacity.
Both the application DRAM and the user DRAM are protected.
The paradigm here is post-processing paradigm,
but the streaming, live streaming,
if you wanted to process the data before storing,
could this architecture do it?
This architecture, no.
His architecture, yes.
I'll be honest.
There are so many different ways to do this type of architecture.
We're absolutely focused on in-storage compute.
I'd love to do both, but you've got to start somewhere.
Any other questions?
Yeah, very early.
Yes, in the back. So we use a host-based API library
that you use as your point as your storage target,
and then it pushes the application, a copy of the application,
into all the different drives.
And the CPU is sitting idle waiting for the application response from storage.
We're faking it out effectively by only sending up the results
and sending the whole data set back.
So there's still some further quote-unquote post-processing required by the CPU,
but it's substantially less.
Any other questions?
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.