Storage Developer Conference - #112: Computational Storage Architecture Development
Episode Date: October 29, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 112.
Well, good morning, everybody. It's 8.30 on a Thursday after a long week of presentations, so we're going to try to have a little bit of fun this morning with some educational information and, of course, some product pitchy stuff because I am a vendor today.
I've spent the last day being the co-chair of the technical working group. Today, I'm going to give you a little bit more pitch about what my company is up to and what we see going on in the market. So with all good technical problems and technologies,
you have to have plenty of three-letter acronyms
to keep you busy for the day.
So today's learning opportunities, also known as TLOs,
are going to talk to you about things like the edge
and the need for...
Yeah, I know. There's a few.
Edge technologies, things like the...
It's going to do it to us again today.
I was doing that last night.
Same mic, yeah.
Yeah.
You should use my bad voice.
I should use my bad voice.
I'm going to talk to you like this today.
If there's not something wrong with my SDC presentation like last year,
it just wouldn't be fair.
So I'm going to give you kind of four highlights.
Edge needs CSDs.
So if you were here yesterday, you heard about a computational storage drive.
So my company manufactures computational storage drives.
And I'm going to talk to you about
the different form factor availabilities for those.
And my cohort in crime, Mr. Ellie,
is going to go grab my samples
that I left on the top of the room signage
because I was playing around out front.
And then I'm going to give you a little bit
about how we put our architecture together for this CSD,
and it uses a PCSS,
or a Programmable Computational Storage Service.
Again, that's a four-letter to go with our three-letters
because three wasn't enough for us.
And I'll give you some examples of how we use them
for artificial intelligence and machine learning
and show you some example workloads
that we've already demonstrated
for both customers and technical events. And then I'll give you a little bit of Hadoop and database
examples of what we're doing. So the idea here today is to talk about computational storage
deployments and how we've deployed them, how they would work if you would like to deploy them
yourselves, things like that. So it was brought up yesterday that we all have a lot of data, and we all know that we need to do a lot with that data.
And a lot of the times people don't want to hear me tell you why I think you need my technology,
because if I tell you why I need my technology, then I'm trying to sell you my Kool-Aid.
So what I wanted to do is spend a few moments and let you listen to a few industry experts talk about some of the problems in the market with data.
You can use their words for it, not mine.
Here's a quick video for you.
Going to be adequate
anymore to support the emerging
internet of things.
Oh, cool.
I'm telling you.
I'm going to have one of those days.
PowerPoint crashed.
There we go.
Internet of Things is approaching us faster and faster.
People like your fridge, dishwasher, coffee maker,
will all have their Internet connection.
And they will be able to gather data.
50 billion network devices,
or 50 billion devices that can drive in traffic,
it's not the devices, it's the email and data
that is just growing exponentially
at a certain point in life
and not support any more data.
One of the problems that has plagued cloud applications is the latency required to get data back and forth to the cloud.
It just functionally wouldn't work.
There would not be enough bandwidth.
The servers themselves would get overloaded.
We're needed network for more and more content.
You really need to have a device that can process the information that's coming in real time.
As these new use cases evolve, the Econoscar, the Connected Plane,
you've got this need for speed and latency and locality of compute
that's going to drive you to do some of these functions at the edge.
When it comes to making important real-time decisions,
edge computing significantly reduces the latency.
Instead of us adapting to computers to learn their language, computers are becoming more and more intelligent in the sense that they adapt to us.
It is estimated that 45% of created data will be handled at the edge.
That means storage, processing, analytics, and decisioning.
And that's going to drive a need for some new capabilities and new technologies.
So hopefully that gives you an idea. That is definitely a spin of the conversation related
to edge computing. And granted, there are a lot of different buzzwords we're using today.
We've got micro-edge, mini-edge, near-edge, far-edge.
And there's still the traditional data centers that still have to cope with some of these
problems around data.
So realistically, definitely spend a little bit of focus here from an edge perspective
because it does provide a perfect platform for technologies like computational storage
to be a value add to the ecosystem.
So one of the things that we want to make sure and highlight as we walk through the technology discussions
and hopefully throughout the course of the day as the other vendors and other people present is,
this is not a technology that we're designing or developing to replace existing architectures,
but to augment or to simplify, in some cases, the existing architectures.
Because there's been a lot of debate about what this is trying to fix or replace. It's not. It's
trying to look at it in a new way. And you can see here, this is an example from Gartner talking
about where they see all the different opportunities for data generation. And it basically means that
we have so many opportunities to expand the way compute is used, the way storage is used,
at various different locations in very
differing ways. The way these systems
are being built, the way they're being deployed,
the way people are using them,
is not just rack and rack upon
data. So we went from big
iron sands to, I'm going to build a
rack of white box whatever, dirt cheap.
Now we have to start architecting boxes that
are designed for hardened environments, for autonomous
vehicles, connected planes,
or just even the now POPs, points
of presence, that type of stuff. And those
architectures provide an
opportunity for something like computational storage
to come in at the beginning of those
architectural developments so that they're not
trying to retrofit, but they're able to actually
start off that way. And then you can look
at it as far as the retrofit ideas for the larger data centers, if you will.
So this is an example I tried to put together in the simplest way I could as a marketing guy
trying to talk to technical people. With all the data being moved to the edge, this funnel
represents the idea that I'm looking for the blue dots. And the blue dots are located somewhere, and I need to get them to the appropriate filter
within my funnel.
And as we move that data through the funnel,
you'll see that I have actually lost data,
but I've gotten the data I wanted,
but it took time to get there.
Some showed up sooner, some showed up later.
But if I go back, you'll see there's actually
seven down at the bottom, and only six come out.
That's actually a problem that exists
for one of our customers today is they're losing visibility
some of their data because it's getting filtered in the wrong spot or it's not getting to them
fast enough to let the systems manage that. So, and I totally forgot my introduction pitch.
So as you've probably seen already, I've swapped slide templates a couple times today. I'm going
to continue to do that throughout the day. So I've got two or three more transitions that will take place through the deck.
I'm trying to keep you guys all awake and interested and paying attention, so if you can
come up to me either at the break, because I don't want to take away from the next presenter,
I have a table outside. I will happily pay you 10 bucks from Amazon if you can tell me how many
different PowerPoint templates I used, and they are posted online already, so you do
have a cheat sheet if you need to go out and look at it.
So, when we look at where
we can deploy these products,
the concepts of computational storage,
and specifically for us, these are all
workloads where we have engaged customers
today. But again, it's
not just the data center. So I can talk
to the hyperscales and the second
tier hyperscales and third tier hyperscales all day long and get a lot of good business from those
folks. But really, as we move down this infrastructure, the edge devices and the
center infrastructure are really where we're starting to see more attention and more interest.
As mentioned in the video, connected planes, autonomous cars, we've got business there
today because they do see the value in this technology and what it can bring because their
infrastructure has power limitations. It has processing limitations. It has architectural
design limitations that can prevent it from being able to accomplish what it's really trying to do.
So when you think about an autonomous car, they want to be able to use one of these guys, your
M.2, because it's small and it offers an opportunity
to have some capacity attached to it.
But if I can put compute in something they're already buying
and give them some extra processing horsepower
for no extra millijoules of energy,
why wouldn't you want to try using that technology
and see if you can make it work?
There's going to be workloads it doesn't work for, I guarantee you.
From that point of view,
this is the different kind of ecosystems that we see as opportunities
for what we classify as our technology known as in-situ processing.
So processing within the device.
And in that case, you can see several different workloads or even infrastructures that we've looked at today.
TensorFlow's machine learning, I'll show you some examples of that a little bit later on. Last year, we introduced the FACE algorithm
and what we did with a co-project with Microsoft
around image similarity search,
which is an inferencing-like or even a search-like architecture.
We've got databases that we've played with.
We've got things that we'd throw in Docker containers
that we drop in the drive.
And you'll hear a little bit more about some other people's ideas
on how the Linux subsystem inside a drive will be of value later today
from one of our partners and co-sponsors of this event.
Content delivery is a great example
of where we can see these architectures work
because we're putting...
If you think about where the ecosystem is being built
and you read through the industry
about where we're doing edge deployments
and who's really building the edge data centers,
it has a lot more to do with the telcos
and the content delivery guys
because we're streaming everything. Disney Plus
goes live in a month or just over
a month. We've got Hulu. We've got Netflix.
All those guys have infrastructures that are now
existing further away from the data centers.
They're not just buying Amazon anymore. They're putting it
somewhere else where they need it. These types of
architectures can help them.
Machine learning, as I mentioned, we've
even talked about HPC as an
opportunity to offload. We had a nice conversation last night, and HPC guy's like, I see a use case
for this. It's a great opportunity to look at this technology moving forward. So my one big
product pitch slide, if you will, is this guy right here. So I already showed you the M.2.
That's eight terabytes. This guy can run up to about 16 terabytes. This is our new fun EDSFF.
You may have seen the presentation yesterday
that talked about the new form factors.
And then we also have your standard 2.5-inch drive.
And I can make this guy fit up to 32 terabytes.
And that's where we get into an interesting conversation
is blast radius or ability to actually deal
with the gravity of the data on this device.
If I've stored 32 terabytes of data
and I only really care about a couple hundred gigabytes,
why do I want to pull all 32 terabytes
back into host memory to do some work?
Why not let the system that you already have
that has the capability to do it
search through, sort, rearrange, do whatever else,
not wear, leveling, and garbage collection,
but actual data management, data analytics,
data transformation at the device level,
it saves you the bandwidth problem
that is always going to exist.
No matter how many lanes we put on the front of it,
no matter what form factor we put it in,
no matter how much power we give it,
we will always fill the lanes of traffic we create for data.
And the data size is not shrinking,
so that's why these things are of value.
The fun part is things like this.
I think we started with one form factor.
There's now like 12.
So there is a lot of debate going on in the industry
what to do with form factors around these technologies
as much as it is what we can do inside of them.
So our architecture today,
we build a ASIC-based computational storage processor
that is built into our NVMe SSD controller.
We looked at it from a perspective
that there is enough of a market adoption
for products and technology today
that putting it in an ASIC format,
saving the hops,
giving you something that you can scale
in the right form factors was very paramount.
But we are delivering an off-the-shelf NVMe SSD,
so we have to do all your traditional data management,
wear-leveling, garbage collection,
flash characterization,
because I don't care which flash vendor you give me
because I'm not a flash guy.
That's one benefit that my solution offers
is I can work with any of the NAND vendors.
We put it in the right form factors
and then we add what you classify
as our startup value add,
which is this in situ processing stack.
We took a look at it and said
we want to make this as flexible
for our customers as possible
and there's absolutely opportunity where what we've done doesn't fit for all workloads.
But for the workloads and for the customers we've talked to, this is what they would like to see come about.
So we've got a full drive Linux running or an OS.
So the application cores that we've installed in our ASIC or built into our ASIC let us run an OS.
Does it have to be Linux?
No.
I've talked to folks about FreeBSD.
We had fun as an experiment in a lab.
We actually got Windows running inside the drive.
Why you do that, I don't know, but it works.
We can offer the virtualization concepts
by letting you drop containers in the solution.
This, again, is a flexibility play.
It gives you the opportunity to be more flexible
or easier to deploy your solution.
And then it's built off, in this case, an ARM quad core,
as I mentioned, a partner of ours.
And we've even got the ability to throw hardware acceleration in it.
So this first solution gives you quite the opportunity
to look at these devices as basically a Linux subsystem
within your particular platform.
As we go through the course of the day,
you're going to hear various other ways
that these particular solutions are being designed and built
for the concept of computational storage.
Hopefully what we have to offer in some of the examples I'm going to walk through in the next few minutes
will give you an idea of what you can really do with it.
So as we look at it from more of a realistic implementation perspective,
we've taken some of the resources from a CPU, we've stuck them in the design,
and then we've built a solution stack around it to support you.
And I'm going to keep reiterating a couple of key points
because I want to make sure that they come across correctly
for at least our type of implementation
of computational storage
it's an off the shelf NVMe drive
it will do, look, act and treat your data
as any other NVMe solution will today
you have the opportunity to turn on
the additional ARM processors
within our product
that then can offload different types of workloads.
And I've got AI workloads.
I've got training ML workloads that I'll walk through, show you a Hadoop example.
But the other part of this is we have to pay very much attention to the ecosystem it's being plugged into.
So, for example, this form factor only gives me 8 watts.
So I can't blow that budget while turning on the compute.
So we had to write the architecture and design the architecture correctly
so that your data can come in and out as an NVMe drive,
and I can still do processing, and I don't overpower the system.
And this is where things like the connected plane, autonomous car,
or net edge device platforms are really taking a hard look at this
because I'm staying in that envelope.
I've optimized the solution to give you the best of both worlds. And we've classified that as watts
per terabyte because it's a high density, low power consuming compute offload. Throw that into
a TCO model for the overall system and your overall system power comes down as well. And I
have an example of that in the deck. So when we looked at it, we said, here's your traditional
SSD. You've got a media controller, you've got some DRAM, and you've got some NAND.
And there were some companies back in 2012, 2013, that took this off-the-shelf design
and actually implemented a version of what we're now calling computational storage.
The trick that they ran into as a problem, if you will, is that singular media controller has to do too much.
They don't have enough processing power to do true compute offload and manage the device.
So we said, well, if we're going to do this right,
we're going to add into the solution
a secondary application core dedicated to compute offload.
And that's where these ARM quad cores come in.
We load the OS.
And then the next part of it is,
well, if I want my customers to be able to use this,
I can't add another interface. I can't create another I want my customers to be able to use this, I can't add another interface.
I can't create another path
because it needs to be able to plug in and work
and be able to be operating from that perspective.
So we're able to take the application
or a version of or a partial of,
depending on the level of complication
and engagement with the customer,
and migrate the application
to actually execute user code inside the device.
So I'm not modifying the code.
I'm taking an instance of it, dropping it in,
and doing it in parallel across multiple drives.
That way you get this concept of parallel and distributed computing
with very little effort.
Today, the way that we're doing it, it is a custom library and a custom API.
That's one reason why we joined up with SNEA
and created this computational storage working group
along with about 40 other companies because we realized that the way we're doing it is a little bit of
kind of new and innovative but I can't make everybody adopt just my way of doing it because
there's other people in the room that are doing it differently as well but we all agree if we make
at least the discovery and the way to plug it in and see what it can do common then it'll work
better for everybody and since I already had DRAM in the device,
I can share that DRAM between managing the drive, where leveling, garbage collection,
data placement, and data manipulation in the way of transformation or any of the workloads
that I'm about to walk through. So with that, I wanted to get into a couple
actual architectural designs because you got to do a little bit of a mix and match of product pitch and technology and the natural execution
of it. So this first example is where we took the concept of using R on Drive Linux. We
loaded Keras APIs into it that now run a TensorFlow application known as MobileNet V2. Now this
MobileNet application is an object identification
or object recognition application.
The little video is a GIF file,
so I didn't bomb poor Brooke and the SNEA team
with a large file,
so that's why it's a little jumpy from that perspective.
But as you can see,
this application is taking those particular objects,
it's identifying them and telling you what they are,
and giving the four closest representations of it. Now, the trick to understand is I have not modified that application. That is
simply the application executing real-time in my device where the USB camera is passing information
through the CPU to my SSD. I'm running the code, and I'm replying back to the host with the answer.
This can be done at scale where we've done examples of where we plugged multiple cameras into a system. Each camera goes to an independent drive. They all run concurrently
and the host is sitting idle. The host is simply doing data pass-through. None of the actual
execution of that mobile netv2 is being done on the host resources. So this is an example where
you can use it. In this case, it could be surveillance or it could be some other form of
useful machine learning type of workload
from TensorFlow. So another example is when we get into the conversation around neural networks
and weightless neural networks versus convolutional neural networks. Last year with our example with
Microsoft, it was a convolutional neural network. Weightless neural networks are starting to gain
traction, but some of the problems we're running into when you look at these, for example, this particular academic study, it's this concept of federated
learning or distributed learning. Google's brought it up in 2017. There's been a lot of papers on it.
They tried doing it using cell phones for purposes of image capture and stuff like that with the
Android platform, but it needs to be able to migrate it into more realizable and useful technology
as well. And so we took an effort and said, well, we need to look at it from a different point of view
and do parallel and distributed training
can be done in our drive.
We can do training in our drive.
Because it's a federated or transfer learning process,
we're taking advantage of technology
that others have deployed and implemented
and being able to make the world, quote-unquote,
better, faster, stronger.
Again, you're not buying extra hardware. You're simply using a device that has this resource
available. So I'm not adding to a system. I'm just using existing platforms with a new piece
of technology that you'd buy anyway because you need the storage and making it do something a
little different. So this is going to be a quick kind of tutorial, if you will, of a walkthrough
of how federated learning works
and why it's valuable to the industry from that perspective.
So I've got two systems here.
On the left side is your traditional machine learning training algorithm path,
and on the right side is how you do it with our technology
or a computational storage solution.
So the first thing you always have to do,
and this is always going to be the case,
is you've got to put data in the drives. Data you're going to store in this case, storing
pictures, storing whatever you want. For example, the object tracking that was done in the GIF on
the previous slide. So that we don't change. You need the storage to work. It has to be common
storage, hence off-the-shelf NVMe. But then it starts to get interesting when you get into the
next step, because what we then do is we migrate a copy of the existing training model into each one of our compute resources on our drives.
Now, each of these drives have independent data being stored on them.
They're not identical copies.
It looks like I'm replicating the same image, but they're actually storing massive amounts of data in parallel across different devices.
So now each of these training models are going to start doing some work,
and when you do the training inside the device,
each of those products, or each of those drives,
are doing real-time training,
while on the other side you're re-migrating all that data
back into the host resources and using the host
to do that model training.
So this is your traditional versus your opportunity
to save a lot of bandwidth
and host power and host resources. The trick that then becomes is as you start to evaluate and
update that training model, you'll see that on the left-hand side, I've got a model that the
host CPU has managed through all of that data. On the right side, I've got a slightly different
variation of that model because each of the individual devices have done a sparse model update. They've transferred that sparse model up to the main update, and it's created an even
stronger model because I've distributed it across so much more resources, and I've done it on a
more localized set of data. And since that data set is smaller than the models being trained on,
it's actually more efficient. The trick was to be able to get it back up to the host and recombine it, if you will, into something in the way of a new or innovative workload.
And that's kind of what's been going on with this particular focus.
So then what you do next is, on the left, you have to continually repeat these steps.
Train the data, load a whole bunch of information, evaluate the data, create a new model. And you're doing that by migrating data back and forth
as I was showing with my fancy green arrows.
On the right side, I've simply migrated that new model down
and I'm going to reiterate again,
but I'm going to continue to be a level of model value
to the customer or to the person using this workload
because I'm creating a more distributed
and useful example of this training. And it saves
host resources to go off and do what it needs to do in a way of gathering new data to put into
those devices because you're constantly updating the storage with new information to create the
need for an updated or new model. So it gives you the ability to walk through this path over and
over again where I'm always going to consume the data, but I don't always have to return the data. I can return the value of the data in the way of this, in this case, a training
model. So I'm doing useful work on the data. The data has never actually left my device, but yet
the value of that data has been presented back. And that's really where this concept of computational
storage starts to gain even more net value, if you will, to the market.
So what does that look like in reality?
So this is an example of a database that was run, and it is a somewhat small database,
but the concept here is I want to get to the most efficient level possible.
And this particular training algorithm shows that after only doing four iterations with the federated model, I've reached a 94% accuracy of the data.
So you can see that it asymptotically gets up to 98% over a very long period of time in the
existing model. But by doing it in a federated fashion, it only took four iterations within the
device to get to the same level of model training that it took an entire ecosystem of GPUs and other
things that we're using for machine learning. So these are the values that we're bringing.
I've saved power.
I've left the host alone.
I've stopped moving data,
freed up the data pipe to ingest more data
because I'm not moving data back out
because you always end up with a two-way street problem.
And yet I'm still providing what the customer really needs
at the end of the day, which is the value of that data.
And that's what our computational storage
and what computational storage products are working to provide. So in real time, as an example, this particular video
clip is showing that we're going to do a model training of that object. So as this person moves
the object around, the camera is tracing, following, and learning what that object is from all the
different angles. You can see up there the model update that there's been 18 model updates in a short period of time
the video loop has been running
because this particular box here at the top,
this is the quad core in my device.
This is not the host CPU.
Those guys are the ones doing all the work off of the different drives
doing those model updates.
I don't have the host actually executing this code real time
on that image file or that video stream.
The devices themselves are executing those model updates for that particular device.
And this can go on.
We kept it short, again, to keep file size down.
But you can see that the ability to run real-time compute algorithms with no modification to the code,
this wizard machine learning algorithm was just copied into the multiple devices
that are used to do these model training updates.
So that's where the value of computational storage
and what we're offering our customers provides.
So AI and ML is great.
It's awesome.
We know there are big buzzwords,
and I'm going to continue to ride that wave as long as I can
because it's fun, and I'm a marketing guy
with a little bit of techie.
But there's also some real-world big data stuff like Hadoop that are useful to talk about I'm going to continue to ride that wave as long as I can because it's fun, and I'm a marketing guy with a little bit of techie.
But there's also some real-world big data stuff like Hadoop that are useful to talk about, as well as some databases,
and different ways of looking at how you can manage these things.
So you're going to hear a lot about different database implementations, different Hadoop implementations of this product.
So I wanted to give you our spin on it for today.
So this represents a Hadoop cluster that was built, and down here kind of represents what we're trying to do with our data,
which you see that we've basically taken
a portion of the Hadoop workload,
the data management node,
and we've migrated into multiple NGD SSDs or CSDs.
And up at the top, you can see the two flat lines
that are called the 16-core host.
We've dedicated 16 cores of this Xeon processor
in the baseline of this product to normalize that result. So no matter how many drives are in the 16-core host. We've dedicated 16 cores of this Xeon processor in the baseline of this product
and normalized that result.
So no matter how many drives are in the system,
this particular workload on this set of 12 drives,
the performance is the same
as it's doing this particular application,
which is a sort application.
So we're like, well, that's great.
We can make it faster, but let's make it more efficient.
So we turned off eight of those,
or in this case, yeah, 12 of those cores,
three quarters of the processing power.
So if I turn those off,
of course, with no computational storage drives turned on,
it's going to run slower,
and it's going to consume more energy
because it takes more time to complete the task.
But then I start turning on the computational resources
within just a couple of the drives,
which manage part of the data.
And as you can see, as we start turning on these drives, at two, four, six a couple of the drives, which manage part of the data.
And as you can see, as we start turning on these drives, at 2, 4, 6, right around 9 drives,
I am now processing this application on this data set at the exact same rate of speed as
the host was doing with 12 additional host Xeon cores versus using the computational cores inside my drive.
And again, my drive is consuming no additional power
to provide that performance benefit,
and I'm saving you power because the Xeon is not running as fast,
or it's off doing other things.
So over here you can see that we looked at it from a power perspective
because that's part of the TCO model.
And you can see, again, as I start to turn on the drives,
my crossover point on power savings is actually before I reach a performance benefit.
But then as I get significantly more power consumption savings, I've also gained 40% in execution at just 12 drives.
And I'd challenge anybody to tell me someone that does a Hadoop workload with only 12 drives.
Now, will that asymptote off? Absolutely.
It's not going to constantly be a forever better improvement.
But the simple fact that at 12 drives or half a server,
I can provide you execution 40% faster on a given workload,
that provides value to what this technology
and what our products do for our customers.
And then if I look at how you may build one,
and this is definitely a singular representation
of how you may consume this particular technology,
but I'm using SAS HDDs in this case
because that's dirt cheap
because that's what people want to do
when they come to building these big data platforms.
It takes nine rack servers
to get me to 864 terabytes of data storage
or three quarters of a petabyte.
And it has a single Xeon processor in each box
because I'm keeping it dirt cheap.
I don't want to put a lot of effort into it.
On this side, we take our 8 terabyte M.2s
and some fancy new platforms that support 36 of those.
And in three single 1U chassis,
I can give you that same exact density.
So now here's your, I can shrink it and make it better play.
But what this shows up here
is I've now added 432
capable processing cores
to that subsystem.
So I went from 9 Xeons
with a whole bunch of cores
within the Xeon
to 432 additional drive cores
that can be used to manage data.
Now, Terrasort works great
because, again, data's consumed
and I'm just simply looking through it.
WordCount is another great example.
I'm counting information through the data.
I'm not looking to move the data. I'm not even looking to transform the data.
I'm simply looking for the value of the data. And that's really what comes into play for this
particular example. So another way to look at it, we did a MongoDB example, and we took a different
spin on this particular workload. So this is an example of a retail website that's running in
Mongo, and you can see the
data being generated by the different websites
kind of scrolling through this simple little video clip.
So from this perspective, I'm not
necessarily looking at computational
storage as an acceleration
of an architecture, but an ability
to scale the architecture.
I turn those things off three, four times, I swear.
So this is
going to probably slip on me again, but
let's see if
I can get it to stop.
But the idea here is I can provide scale
because as I add a storage device to increase
the size of this particular
website or a retail footprint
that's using a MongoDB platform,
I provide them the opportunity to not have to
add more processing, simply more
storage. And that's really what it comes down to value.
Again, sorry folks for that.
So,
scalable computational storage.
The ability to drive
what I classify as the new cloud, because we have
a cloud, we have a fog,
we have somebody at one point called
it mist to get closer to the edge.
However you look at it, our data
is moving all over the place. It's no longer
in just one location. It's no longer in just
one type of architecture. It's no longer
in just one type of workload.
And being able to provide
this flexibility of workloads and
deployment is very much a key
for what we're doing from this perspective.
And that's what this edge data growth is
doing when it's challenging these platforms. If you look
at some of these new servers, we have a partner of ours, Lenovo, that has built an edge server.
They were running around at the event when they first launched it by pulling it out of their backpack.
It can plug into a wall outlet.
It's got 5G engines on it, and it pulls right out of their backpack.
Well, they need storage in that server.
How are they going to get enough storage in that server?
And there's compute in that server, but if it's just plugged into any Joe Blochmo outlet, it's not going to be
multi-core dual processor Xeons in there. It's a smaller, lower power, lower cost solution,
but it still needs to provide value. So if we put our particular drives in with it,
we give them some extra boost at no additional cost to the system because it's just storage.
They have to have it anyway. And if you can give them a high-density storage solution, that's even better.
And then, as I've tried to illustrate
this concept of in-situ processing
or our version of computational storage,
it has a wide range of opportunities
to engage with our customers.
You can SSH right into our drive
if you want to have that capability
and treat each drive as a Linux microserver.
You can just simply use the GUI we provide and recompile on ARM.
That's the extent of some of the software changes required for some customers.
Our particular workload example that we did last year,
I didn't want to reuse the same slide two years in a row,
showed that we could save about 500 times the load effort
for the algorithm that Microsoft was using for their AI workload
because they're not loading all workload. Because they're not loading
all the images, they're not loading all the data into the host memory, and they're not loading
memory, processing on it, flushing it, reloading. They're simply getting the value of the data out
of the storage devices at scale. So I'm running a little bit ahead of schedule because I talk way
too fast when I get excited about my technology. So I've got a few minutes for some questions in the room. If people have any thoughts, I see some
curiousness on some people's faces. So please feel free to ask a question. I can go back over
anything or I'll let you free for a few extra minutes.
How does that communication scheme work between your onboard Linux and the host? So for our particular implementation of this,
if I go back to my fancy little drawing here,
we use, in this case, a tunnel over the NVMe bus.
So we actually embed TCP packets within the NVMe transfers
to move the compute resources over.
So I've heard people say that TCP can be a little slow
or a little odd as a choice,
but it's a tried-and-true way that exists within Linux
as a platform.
Again, we're reusing things that are already known,
not trying to reinvent the wheel.
So yes, thank you for the question.
Cool.
Yes, sir?
You talked on a previous slide about your 16-core system versus four cores.
You talked about power on that.
And what you used as a reference was you did not have the additional processing power for the computational services?
So the question he was asking is, in this example here, this particular case, we did compare both with our drives on and off, which would be simple.
I don't just turn my computational resources on.
He asked, did we also do a comparison
of using non-NGD or just traditional drives?
We did do that as well.
It shows a similar type of performance challenge,
but I wanted to just...
We didn't put that particular data in this slide,
but we have done that work.
If you're interested in knowing more about that,
I can certainly provide it.
This was our drives on and hard drives off
for this particular representation, for sure.
Yes, sir?
You talked about the architecture
having a processor with a hardware acceleration.
For the example,
was it largely just using the risks of the CPU,
or did you actually offer something to the hardware acceleration? So right now, as far as the products we provided,
we do have a hardware acceleration engine inside the CPUs.
We have not actually had to turn it on yet.
We're actually working right now with a customer
to turn on that hardware acceleration
to further advance some of the AI workloads that we're using.
Because we do have workloads today that we've tested that customer assets look at
that are not optimized yet, and so we actually slow the system down.
I'll admit that we can be slower in some workloads.
We are in the process of turning that on, and there's a lot of opportunity
because it's a very open-ended engine from a perspective of what we can do.
But it's part of that ARM subsystem that we've embedded inside the drive.
How are you balancing the power of the A-plot between the RAM and the ARM?
So the question,
and I should have repeated the last one,
so I apologize.
So the question is,
how do we balance the power
between the overall subsystem of this?
So when you look at my particular solution
as an NVMe off-the-shelf drive,
I'm not quoting a million IOPS.
I'm not quoting three gigabyte writes.
I'm not the fastest drive on the planet because I don't need to be.
So we've challenged and taken an effort looking at it as a random read,
heavy read-centric-like device, which is what everybody's calling a read-intensive drive.
There's overhead in those devices if you build the rest of the ASIC correctly
so we did an ASIC from the ground up
we're on 14 nanometer process technology
which gives us a bunch of power savings
and then the workloads we've optimized the rights
that we're not conflicting over that particular 8 watt limit
for the M.2
and then when you get into the EDSFF and U.2
they have a much larger envelope
so the most challenging one we have
is definitely the M.2.
And you'll see that in the raw performance numbers.
But again, as I've shown,
raw performance versus compute,
I don't have an issue with the speeds of the drive
when you're using it in a computational example.
Yes, sir. Sorry.
You first. Your hand went first.
You're showing a direct connection there
between the man-made and the whole price.
Does that work?
So this is an oversimplified marketing diagram.
No, so the trick here is
the data comes in over the NVMe protocol,
the transport, PCIe transport NVMe protocol.
Once it's in the media controller,
we're transforming that data to be stored in the NAND
using the NAND, in this case,
toggle mode or on-feed type of architectures.
This really should be touching that.
So we do use the media controllers attached to the ARM processors
to do that so that we can see the
data structure. Because we have to know the data layout,
we have to know which LBAs are where, that kind of stuff.
But the trick is I'm not
using anything in the way of a storage protocol,
I'm simply creating a bus
between the application core and
the media cores to do that management, which is still
significantly faster than using
the NVMe protocol overhead.
So you're raising, getting the media controller, but you're sharing the RAM and then you're accessing it. faster than using the NVMe protocol overhead.
Basically, yes.
There's a couple of additional internal buses that reroute where the application course
talked to the media,
but the trick is we did not put,
in our case, we did not put the application processor
in line blocking the NVMe
so that we don't create a contention between
writing data and managing data.
Yes, sir?
So,
does your host API
Scott support getting data
from a different CSD
into the application processor
on another one?
So the question came from my friend who loves peer-to-peer,
asking if I support peer-to-peer.
We're all CMB.
Yeah, we're CMB.
So this first version of the product
is not designed for that particular workload.
It is fairly new.
It's your host memory, right?
Yes.
The API is a screw-peer-to-peer,
but if I want to get input data from a couple of my drives
and bring it to one of the arc boards, you can do that through host memory. Correct. ADI is a screw-peer-peer, but if I want to get input data from a couple of my drives,
I bring it to one of the ARM cores, you can do that through host memory.
Correct.
Any time you want, right?
Exactly.
So there are multiple ways to deploy this in an environment.
Today, our view of it with our current solution is I'm going to have a whole bunch of these in a system.
I'm going to push an exact copy of whatever I want to do to every single device that's in that particular workload environment. If the information I'm interested in is not located on one of those drives, I'm simply going to get nothing back from that drive
because it's looked through it and has no value added. It comes back with a null or
whatever you want to call it. Longer term, now that things like peer-to-peer, CMB, and
some other new features are coming out within NVMe and the PCI architectures
will certainly be enabled in the next solution from that perspective.
Yeah, absolutely.
Another one I did have.
What are your thoughts on Ethernet?
So, good question.
Is Ethernet a good place to put in line with the NVMe?
So, we've had a lot of discussions with customers about that.
We have no qualms with necessarily doing that.
As of today, there's no immediate plans to do it on this solution.
There's always the dongle or the attach point from certain other vendors in the networking space that can provide that solution for us.
So, there's opportunity to do that, but it also creates a lot of challenges that a lot of people may or may not know about around drive management and things like that.
So for today's products, we're going to stick with an NVMe solution.
So, yes, sir.
You already asked a question.
Yeah.
No, you.
SW.
Sorry.
So there's some implied data structure in there.
You're processing video, which is not necessarily blocks. So you're talking probably about this guy right here, right, for example?
Any of those where you're...
So this particular application is being used to capture and store information,
and in the host memory version,
a non-computational version,
you're streaming that video into the application.
It's doing the data manipulation,
identifying the object and sending it out.
And it's being shipped to a drive
to be stored as the drive stores it.
I'm copying that instance into our device.
So all I'm doing is moving that ability,
that exact application that already knows that file structure, already knows how that file system our device. So all I'm doing is moving that ability, that exact application that already knows that file structure,
already knows how that file system's working,
it's attached to the device just a little closer.
I'm not recreating that structure from that perspective.
It could be file, it could be block,
it's whatever that particular application
and this API are expecting,
there's a replication of it in the drive.
So you're using the drive's file system
or the host's file system. Is that true?
How does the Linux kernel
on the drive that has
a file system, I assume,
and then you have the host's file system.
Which file system is actually
used to
sort of create the structure of
any...
It's based on the host file system. And you're able to pass the sort of L a structure of image? It's based on the host file system.
And you're able to pass
the sort of LDAs to that file
and the details to the...
Right, which
our friend over here in the corner, I didn't get
his name, made the comment about my data path
drawing being wrong. I am talking through
the media controller that's
understanding the media structure
from my on-drive OS. So it
understands that because, again, if I were to turn it on or off, the host will still do the same work.
I'm just moving an exact copy of it in there to allow that application to run closer. The data
path for my internal controller is talking to the same engine that is in line of the data path from
the host engine. So I don't have that overlap problem.
So I may have drawn it incorrectly,
but it functions correctly.
So you're saying that the ASIC
is issuing NVMe-free commands
using LTVs provided to it by the host engine?
In a matter of sense, yes.
There's a little bit of a difference
because we're not using NVMe as such.
We're using an internal bus,
and that's some of the IP that I've created,
or we have created, if you will.
So it's not a specific way of doing things?
Yeah.
It's on a separate internal bus through that process, yes.
It does create...
It creates a lot of questions.
I'm happy to talk about them one-on-one,
potentially under NDA.
There's a very good reason they don't put the CTO in the room
because he would give all those answers.
But yeah, so again, the architectural structure of it,
this is a very simplified version of it.
There are definitely ways we can show
how the programming model works. This is more of a
hardware look to it, if you will.
We have talked a little bit in the past about
the software and how the data
actually flows within the device.
But again, that tends to be a little bit more
low-level conversation.
Sure.
Sure. So typically then, your customers, is the host API signed directly to the block device
like in namespaces?
Or do customers put a file system
on the host,
on the block device
and then the host API
talks through the file system?
So...
Or are we getting to
another area you want?
That would be an area where
I've chosen to be
slightly dumb enough
that I can't answer it or I can say I don't know.
Every once in a while I get into that conversation with the team and they're like, you don't want to know that because you don't want to have to lie to people saying you don't know what it means.
There's definitely some secret sauce there that I am not comfortable with sharing.
I think the idea is there's a lot going on and there's definitely a lot of questions.
But at the same time, there's a lot going on, and there's definitely a lot of questions, but at the same time, there's a lot of examples
of where it's already working.
I think that's an interesting point in general.
Yeah.
Something like that.
Flexible, programmable competition.
The one that you're providing with your arm force
is linked to this as well.
What does that mean?
The file system, the key values, whatever they are. Just remember, because we're recording this. I know, they're not hearing him.
So Stephen, my friend in the TWIG
and also co-worker in the computational storage products department
was discussing the concept of the file systems
and how they work and aspects of what we need to do
in the way of work within the computational storage TWIG
to make this more understandable for
our customers. So there are aspects
he was asking about how the interaction between the
API and the file systems work that
I was admitting that I don't know the answers to.
And that is definitely some
of the secret sauce and some of the patented stuff that
they've done inside the company itself.
Cool.
Yes, sir.
How do you follow your
operations? Do you use a sir? Yeah, so we have multiple different models.
We have a GUI-like model for simple engagements
that use the API where you do a storage call to the R API,
and it does the work of migrating
an ARM-compiled version of your application
so that as a user, if you're doing it at the highest level,
you have to recompile whatever application you want
into an ARM instance so that we can migrate that
into the ARM core.
So if you're working on x86...
Yes.
So the host side will have to do the compile of the code
into ARM to allow it to be migrated into the device.
Beyond that, it's not changing the code
or rewriting the code.
It's simply recompiling for a different architecture.
Yes, sir?
With the ARM core
versus trying to do the
application,
are you saying that the ARM cores
are more efficient, or just that
the whole, that it's more efficient because it sits
closer and you've got a better bandwidth?
So the question came
out, is it using the ARM cores
more efficient, or is it just because
it's closer that it's creating a benefit
to the customer? Is that a proper repeat?
Okay. So from that perspective,
the ARM cores we chose
definitely have some trade-offs. We could go
much more powerful cores. We could add more
cores. We chose, for this particular
case, a quote-unquote less
efficient as far as processing core,
but a more power-efficient core
because we didn't want to engage that
power offset. That's why I said earlier
in the presentation, there are absolutely workloads
where I slow things down today.
Because it's closer, because I'm
not requiring a storage call of actual
media out of the device,
at the size of our devices they are today and the
speeds that we have within the flash, the cores we've chosen are efficient enough for 90% of what we've seen today
in the market. So the idea is if I have to pull media from here over NVMe more than two times to
fill a host memory buffer so the host can process on it, I can probably provide you some sort
of either acceleration or equal performance at lower power. Because the more you have
to pull that data out and put it into host memory, flush it and repeat and keep doing
that iteration, the more efficient this subsystem becomes. Our goal is to limit the IOs coming
out while you're allowing continued IOs to go in. And that's really where our focus is. So it's a storage-centric computational storage solution.
As you can see in the server complex over there,
we still show a GPU.
Real-time data coming in, say, for example,
front-end LiDAR and cameras from an autonomous car,
you're probably not going to use my solution for that.
You'll have some other form of high-performance NVIDIA
or something doing that,
because that's true, in-line, real-time,
I can't hit something.
But all of the rest of the surrounding data
and all the other information that that car has
from an ecosystem perspective
can easily be processed by this particular solution.
Yes?
Today, the drive presents itself as a block device.
Do you see, for the application use cases going forward,
that it is more beneficial, perhaps,
to be a file or an object device?
Sure.
So the question is, today it's a block storage NVMe device.
Do we see value in potentially having it be file or object-oriented type of a solution?
So for today, the size of the company, the efforts we put forward,
it's going to stay as an NVMe device.
As the ecosystem evolves and the needs for products in that space continue to evolve,
we'll certainly look at adding those other variants of the product to our portfolio.
But right now, it's strictly an NVMe block storage device.
All right.
If there aren't any other questions,
we're at the actual end time,
so I'm glad I left some time for some questions from you guys.
Thank you very much for your time and attention.
And appreciate it.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.