Storage Developer Conference - #100: A Comparison of In-storage Processing Architectures and Technologies
Episode Date: June 24, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNIA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snia.org
slash podcasts. You are listening to STC Podcast, episode 100.
I am Jérôme Gacy, and I am a marketing technology analyst.
And today, I will talk about the comparison of in-storage processing architectures and technologies.
Today, this is a very funny time for hardware architects and system architects.
And I show this picture because I have two boys.
They are eight and ten. And
when they ask me, Dad, what kind of job are you doing? And I answer that I help my customer
in playing Legos. And the Legos today for the server market, this is the different technology
that we have in processing, in IP, memories. We have many technologies available today.
And in the past, to build a server,
it was very simple, just one CPU,
one memory, storage, and network card, and that's it.
Now we have multiple flowers of all of them.
And in the future, we'll have a lot of different technologies
and many ways to assemble
all of them in different ways to do different kinds of products.
And computational storage,
this is just an example of doing this. In several technology
landscapes, there are multiple different systems and servers.
Computational storage, this is
just one of them. And even just for this one, this segment of product, there are many ways
to do it. So, sorry. So here, this is the agenda of this presentation. So first, I will talk about the concept
of in-situ processing computation storage.
Just explain to you how it works
and what are the benefits
and why we want to go in this direction
and what is the state of the art of the market adoption.
Then focusing on more advanced applications,
which is deep learning,
and having a look at the roadmap in terms of architecture and technologies.
With in-situ processing technologies,
we are trying to solve the problem of the data movement.
By the time there are more and more data to process,
with big data,
more and more computing capabilities,
but the bottleneck, this is the way to move the data
to the processing element.
Here, this is just an example with a local SSD.
You have to first move the data to the RAM and then to the RAM to the CPU.
It takes time and energy to do this data movement.
So the way to solve it, this is to make the data and the processing more close together.
We can co-list computational storage, smart SSD, in-situ processing, in-storage
processing, and near-data processing. There are many names to do the same concept. I hope
that with the work of the SNIA or NVMe workgroup in order to standardize, and I will come back
later on this new concept, I hope that it will come with a universal name and it will be more easy
for everybody. Here the benefits are
very easy to understand. If you reduce the distance
between the compute and storage, then you reduce the latency,
then you save performance at the
application and system level, and you save performance at the application and system level
and you save energy for the data transfer
which is leading to a lower power consumption
at the system level.
So making data and processing closer.
There are two ways for that.
The first one is to move the storage to the computing
and this is by using NVDIMM.
I will not talk more about NVDIMM on this presentation because today the market is more
going on and you have multiple players here at the storage developer conference talking
about that by moving the computing into the storage.
As an example of why we are moving to in-situ processing, here is a very basic example of a data movement of one gigabyte of data.
On the left, standard architectures, and on the right, a smart SSD.
In terms of power budget and timing, moving data from the storage to the computing is four times faster on the smart SSD and 10 times better in terms of power efficiency. This is a very basic example,
but here you can understand that, oh, yes, we have to do that.
If you bring the computing in the storage SSD, yes,
you will save performance and power efficiency.
How it works. There are many ways to implement in-situ processing. But basically,
on the right side here, you have this smart storage system where you will find the SSD controller, the memory, and the computing core. It could be FPGA or whatever. But how to talk to this new product? Because you have to deal with storage and you have to deal with computing interface. The NVMe driver,
this is the perfect answer for that. Because obviously, it is done for data transfer, for storage access.
It provides performance, low latency, so that's very good.
And on top of that, it provides mechanisms like the vendor-specific commands that will bring flexibility
and configuration options
to deal with the
computing part.
Also,
you can play with different
namespaces. As an example,
I was this morning
at the ATT.com presentation
with the ATT.com
accelerator. They play with different namespaces for different I was this morning at the ATT.com presentation. With the ATT.com accelerator,
they play with different namespaces
for different accelerators.
So this is a way to use namespaces.
For example, you can write your data
and send your data in a specific namespace.
Then you will send a vendor-specific command
just to start processing with some few parameters. And with another vendor-specific command just to start processing with some few parameters.
And with another vendor-specific command,
you can get the result.
Or you can get to another namespace
to get the result of the processing.
In terms of implementation,
again, you can find many ways to implement in-situ processing in SSD.
The first one, so SSD controller with computing capabilities.
And I see that in a Samsung key value store management, this is how it is done by using a SSD controller, still playing with flash memory, and using
the SSD controller
cores to do some
sub-function and to do
some processing.
We can also,
just for a different way to implement
the storage part,
using M.2
local SSD, and you will find
on the market FPGA board with M.2 socket,
so you can use this FPGA just for HBA product,
but on top of the HBA function,
you can add some hardware accelerator and FPGA as well.
And on the bottom,
so it could be a standard SSD controller
with an external connection in order to plug a coprocessor.
A coprocessor could be FPGA, could be an ASIC, dedicated AS, I would say, I call still that in-city processing
using peer-to-peer between a standard SSD and an accelerator.
This is not in this slide, but we can also imagine in all-flash array system using in-line
processing, so on the storage controller, when you will get the data from the network,
then you can do some processing on the fly,
or when you read the data, yes,
still doing some local computing.
In terms of application,
today, basically, we see, I would say,
standard software or application.
Compression, encryption, search, key value store, error coding as well.
And that's good.
You can benefit from this in local processing. And what is interesting in this
is that smart SSD is a new concept.
This is not...
In fact, the question we can ask
when we talk about this concept,
is it storage with embedded processing
or is it an accelerator or computing
with local storage?
And depending on the product, depending on the application, it could be one
or the other one. It's not obvious for the
system guys or the application guys to integrate
in the system and to benefit from this new product in the system.
So by starting by very
easy to understand application, that's the right way.
In terms of market players, so you have some products available today from NGD system, from ScaleFlux, Adeticom, Samsung, and other one.
Also, some few technology providers who are pushing this new concept from Xilinx, ARM, and Marvel.
I was at the Flash Memory Summit this summer.
There was a specific session for computational storage. It was a very interesting
presentation from both Xilinx, ARM,
and Marvel, who are thinking that, yes, bringing
compute to the storage, this is the right way to go.
But in terms of market adoption, what we absolutely need, this is a standard.
Hopefully, there is some group from the SNEA workgroup.
There is a technical workgroup on that.
What is interesting is that if you look at the SDC agenda this week,
there are three, four, maybe five sessions about this concept.
So this is a good sign.
There will be a specific buzzer freezer session today at 7 p.m.
on the Cypress room, I think.
So I invite you to attend, and I'm sure that we'll learn more about this concept and what is the state of this workgroup at the SNEA,
and when we will get a specification, a standard for that.
So here for the first part,
I think that the real value in computational storage,
this is for data analytics and deep learning.
Because for this application, you really need very low latency to get the performance,
low latency between compute and storage. You may have some power budget limitation for
edge computing, because you may find this kind of application at the edge, and at the edge, the power supply is not the same
in a data center, the same as a data center.
And on top of that, we will see huge improvements
in terms of hardware accelerator for deep learning.
And if we don't change the way to implement this
in a system,
the data movement problem I mentioned at the beginning will increase more and more.
So we absolutely need this new concept of computational storage
in order to provide the required performance
with the power budget at the right level.
So now we'll spend a few times on deep learning.
I don't know what is your level of understanding of deep learning.
In deep learning, there is inference and training.
I will focus now on the training problem.
Training problem, in fact, the training.
So you have to play with a data set.
This data set will be sent through the neural network in order to train them.
Then you will get your neural network parameters for your application.
It takes a long time to do it or if you want to reduce this time
you will have to require expensive resources
this is many GPU today for that
or even it will take a long time
on a very expensive hardware
when I say a long time
it could be in the range of weeks or months
to do one training. So imagine in terms of business impact, when you get your new data
set, or if you want to train a new model to be applied on the field, on your market, you
will have to wait for a few weeks or a few months. If you can reduce it to a few days, a few hours,
or I would say real time,
this is easy to understand the business that you can save with that.
Today, a typical deep learning training based on a GPU.
So we have many GPUs in the systems.
In terms of performance, that's perfect.
You can't do better today.
This is the best way.
But in terms of cost, in terms of power, yeah.
It's not easy.
This is not for everybody.
I have selected this ResNet-50 neural network.
ResNet-50, this is a very well-known neural network.
I have selected this one for this presentation for my study.
Because you have a lot of information on this neural network,
how it works, what is the topology, the different parameters.
And you have a lot of benchmarks available from HPE, from Dell, from NVIDIA, and all the benchmarks are on different architecture,
on one GPU, two GPUs, eight GPUs, with different configurations.
So all these configurations and benchmarks
help me to design a performance model
where you have two kinds of parameters.
So the hardware parameters which are used,
so the flops of the GPU, the memory boundaries,
the number of compute nodes,
and also the configuration parameters. le nombre de nodes de compute et aussi les paramètres de configuration
Oui, le nombre de paramètres, la résolution et le nombre de images que vous prenez pour le modèle.
Et donc, j'ai impliqué tout ce modèle dans un fichier Excel et je l'utilise, ce modèle basique,
pour estimer la performance sur une nouvelle architecture, un nouveau système.
En termes de performance, dans le système actuel que vous avez aujourd'hui,
typiquement, le maximum you can reach
with the ResNet-50 network
is about 400 images per GPU
using FP32 resolution.
So this is for the computing side.
For the storage side,
because you have to read all the data from your dataset storage,
this is 40 megabytes per second and using 100 kilobyte images.
40 megabytes per second per GPU.
This is not a real problem today in terms of architecture and storage boundaries.
At the performance, at the system level,
you just multiply by 8,
and the storage read that you have to manage
is 320 megabytes per second.
This is still reasonable.
But what is interesting is that
you will see huge improvements in terms of deep learning
processing in the coming years by different techniques.
There was huge work from
data scientists and from mathematicians guys in order
to check how we can
reduce the resolution of the
parameters in order to save performances.
Maybe we can move from
FP32, so floating point 32-bit,
to floating point 60-bit, to integer
8-bit or even integer 4
bit and why
not at the
bit level.
On the
other hand,
if you
reduce the
resolution of
the weights,
you will
decrease the
quality of
the network.
But if you
just reduce
the quality by
3 to 5
percent, for many applications, that's enough.
Pruning, this is the same concept of optimization of the neural network.
Pruning, I don't know if you know this term, in deep learning, this is the concept of the
optimization of the network.
In a neural network, you have many neurons with many connections.
I would say this is very symmetrical in terms of architecture.
But if you analyze the importance of the different weights of neurons,
some weights will provide a huge, very important part of the quality of the neural network,
but some of them, this is very minor.
So if we remove some connections or even some neurons,
then, like with the lower resolution, you will reduce the quality of the deep learning network.
But in terms of computing capabilities, you reduce the number of operations to do.
So altogether, you will decrease the computing requirements.
You will decrease the memory bandwidth requirement.
And on top of that, in terms of architecture,
implementation of hardware accelerator for deep learning,
you will move to a very massively architecture,
mainly multiply accumulation implementation.
Because this multiply accumulation,
this is the very basic operation for deep learning.
So I estimate that very soon we will be able to increase by 25,
between 20 and 30, the number of frames to be processed per second.
So what will be the impact at the storage and the I.O. level?
So to come back to this system implementation based on GPU,
so maybe reaching up to 80K frames per second at the system level and leading to 8 gigabytes per second on the grid.
This is higher than 320 megabytes per second.
Wow.
But I would say still reasonable in terms of data throughput
by just using few SSD.
But the key point here,
this is not just a problem of boundaries.
This is mainly a problem of latency.
Why?
Because for deep learning,
if you want to run your deep learning training correctly,
you have to do random access,
and every time, random.
And when you will do this full training system,
all the images will be read multiple times,
but in different ways.
And in addition, that will be at low Q-depth.
So in terms of IOPS, in terms of real storage access,
by just using a few SSD locally,
that will not be able to be used for this training system.
Then you will need additional storage, so I would say additional all-flash array
with a very high-boundary network interconnect.
Then you will have to use additional volume and power consumption and cost.
So what we can do for that with computational storage?
So there will be
multiple hardware options based on FPGA, based on
ASIC CPU, so mainly software processing,
mini-core or AI chips.
The target here is to use a 2U server
using 24 U.2 form factor,
the standard SSD form factor,
and target is 1 kilowatt.
In terms of performance estimation,
so on the top we already saw the numbers,
80K frames per second.
So that would be a 5U server.
And in terms of efficiency, 20 frames per second per watt.
And in terms of performance density,
16 kiloframes per second per U.
My estimation on computational storage, this is 1 kilowatt for this 2U server.
And I would say that we will be in the same range of performance and power consumption. A little bit better in terms of power efficiency
and a little bit lower in terms of computing density
and performance density.
I would love to see that we will be very better
in terms of performance with computational storage.
The problem with the deep learning
is that it's very computing intensive
and very memory-bound intensive.
And working on storage interface
could provide some benefit,
but that will not provide
huge improvement in terms of performance.
But here, this is assuming that
in terms of computing capabilities, we are 10 times lower versus a GPU.
And I would say that we can use better numbers with new hardware technologies, especially with AI chips.
But at the beginning, I wanted to show some price number and so on,
but I would prefer not.
In terms of system cost,
I'm sure that with computational storage,
we'll be very lower than a GPU-based system.
I don't know if you know the price of GPU or GPU-based system,
but I invite you to go to
the web and check it.
But I think here, the main advantage,
this is about scalability.
With computational storage,
if I come back here,
if you want to increase the performance,
you just have to add one computing element,
a U.2 smart SSD, and that's it.
And if you want more, just add it.
With a GPU-based system, this is not easy to do that
because you have your own full
system with 8 GPUs
or even 16 for a few systems.
And in terms of
scalability and flexibility,
this is not done for that.
So in terms
of implementation,
so
I will not go into the details
for that, but here, for FPGA architecture, so obviously we have some FPGA with an HBM interface.
I mention that because deep learning is very memory, read and write intensive. So implementing NVMe and on-field interface
just for the standard SSD,
and using an embedded CPU.
It could be an ARM CPU and a Zinc FPGA, for example,
or Macroblaze or RISC-V.
And using Accelerator IP.
So there are a few companies
who are providing neural network accelerator.
It could be a neural network processor or even the implementation in a full hardware of the neural network.
And in terms of the power budget, that would be okay.
I mention that because, again, here the target will be to be compatible
in a U.2 form factor.
With the SOC architecture,
here SOC, this is what I call
a very CPU-based SOC.
What is interesting here,
obviously, for a CPU-based system,
it provides flexibility.
But playing with deep learning training on a CPU,
this is not the right choice for performance.
It's very too slow.
If you want to do deep learning training with a CPU-based system,
you absolutely need,
like for the FPGA,
some hardware accelerator
for the neural network computation.
In terms of many-core processors,
for the two previous examples,
this is by using just one CPU or one controller
for both storage and computing. Here we are using two controllers, one for the storage
and one for the computational part. So using, I would say, a small FPGA that will handle
NVMe and the flash interfaces and connecting a mini-core processor with
PCIe interconnect or AXI interconnect, I don't know if it exists,
but implementing absolutely a very low latency connection
between the FPGA and the coprocessor. And the same
principle for AI coprocessor.
And that's it for
deep learning.
And now let's have a look at
the evolution and
how I see the evolution of
computational storage. So today
we are at the beginning
of
this new concept with
few products available.
But I think there is room for improvement.
Because two problems I see, or two things that we can improve,
is that with computational storage, we are not able to share the data between the different elements.
If you want to share it, you have to go through the CPU, through the controller, and this is not done for that.
Secondly, there is no cache currency.
If you want to run an algorithm on few
computational storage systems.
There is no cache currency
and you will lose some performance.
So the question is
how we can benefit from the new interconnect
Gen Z, C6, or OpenKP
and how we can apply it
to computational storage.
So here I will give you some examples in terms of architecture
and how we can use it.
And I think we can play Lego for a long time.
C6, just to bring you an overview.
So C6, this is a new interconnect standard introduced
two years ago
by multiple leading
companies and
here this is a very
so you can find these slides
on the C6 website
here the goal is very
easy to understand, this is to implement
cache currency between the main CPU of your server and another component.
This other component could be an accelerator, or it could be memory, or it could be control network as well.
And then you can imagine the number of topologies that we can implement. So we can implement this
on computational storage.
So instead of implementing the NVMe interface
for that, we can reuse the CCX implementation.
That will be the same on this product.
You will have storage and computing
together on the same board. And this is the target, Cela sera le même sur ce produit. Vous aurez le stockage et la computation ensemble sur le même bord.
Et c'est le but, réduire la latence entre le stockage et la computation.
Ou nous pouvons utiliser C6 comme interconnect et la coïncidence entre les deux contrôleurs, ou les multiples contrôleurs du produit de storage de computation.
Ou pourquoi pas,
toujours utiliser NVMe comme interface standard,
et utiliser C6 juste pour l'interconnexion entre le produit de storage de computation. the computational storage product.
With Gen Z,
so another standard introduced two years ago.
And if you have any questions regarding Gen Z,
there is the right guy in the room,
and he will have a talk on Wednesday morning.
I invite you to assist this talk because Gen Z is very interesting.
And here, this is a way to use Gen Z as a new interface.
So instead of using NVMe as an interface, you can use Gen Z.
So you have still the compute and NVM.
And in addition, the benefit of Gen Z is c'est que vous pouvez désagrégéter ce produit et l'utiliser de différentes façons
en termes d'implémentation mécanique dans votre serveur.
Nous utilisons Gen Z comme partage de données.
Et ici, je dirais que nous pouvons comparer cela avec l'interconnexion NVLink pour le GPU de NVIDIA.
Avec NVIDIA, il y a l'interface PCI,
mais il y a aussi le NVLink.
Le NVLink vous permet
d'interconnecter les différents GPUs ensemble.
Mais c'est spécifiquement
pour le GPU NVIDIA.
Avec Genzy,
c'est un standard plus ouvert.
Vous pouvez partager des données
entre tous les éléments de la compute.
Ou en utilisant NVMe, mais c'est plus intéressant.
Voici la façon de partager les données pour le traitement.
Si vous ne voulez pas avoir une copie de vos données sur plusieurs systèmes, the dataset for the training. If you don't want to have a copy of your dataset on multiple systems,
you can have a dataset on the system on the left,
and the training system on the right can have access to the dataset,
which is stored on the left. Donc, il y a, et c'est juste par, je dirais, jouer quelques minutes sur ces slides et imaginer ce que nous pouvons faire.
Il y a plusieurs, je dirais, infinites manières d'impliquer toute cette architecture.
Mais, encore une fois, le même concept que vous trouverez chaque fois,
c'est que, dans le même endroit the same place you have computing and storage.
So in terms of technology,
roadmap, here the goal is to go more in details
at a more
higher level of integration in terms of technology.
So today you have computing and SSD on different boards.
Computational storage, the goal is to have compute and storage on the same board, SSD.
Here now, the goal is to have storage and computing on the same die, the same silicon.
So this is for more long-term
roadmap technology.
So having kind of smart
memories,
so I know a few companies
working on that.
The challenge here, this is
the
silicon process
and how we can bring computing process
on the memory process.
It's not easy.
Maybe a smart way to do that
is to play with 3D integration,
so using a 3D interposer.
In terms of silicon technology,
this is by using the die of a CPU,
the die of a memory,
and having a third die just for the interconnect.
And why not, more long-term view, using silicon photonics for very high bandwidth between all the parts.
Or having the memory in the SSD controller.
Here, of course, in terms of storage capacity, the size will
be lower. But in terms of computing efficiency, yes, this is very excellent. I remember a
presentation from Crossbar, the Flash Memory Summit. We were talking about the re-RAM technology
used for computational, for deep learning.
I think it was for inference, not for the training.
But the same, I'm sure that you can find the slides on the Flash Memory website.
All the deep learning neural network weights are implemented locally with the processor, and then in maybe
one or two clock cycles, the processor is able to read all the weights and to do the
computation.
So in terms of efficiency, that's excellent.
So as a conclusion, there is a demand for computational storage in terms of high-performance requirements,
and there are some existing solutions.
A standard is needed, but hopefully a few companies and the SNEA are working on that,
and that will help to validate the market adoption.
But I think that the standard must be enough in order to support new architecture or new interconnect like Gen Z, CCX, or other.
Thank you very much.
Thanks for listening. podcast, be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.