Storage Developer Conference - #100: A Comparison of In-storage Processing Architectures and Technologies

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNIA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snia.org slash podcasts. You are listening to STC Podcast, episode 100. I am Jérôme Gacy, and I am a marketing technology analyst. And today, I will talk about the comparison of in-storage processing architectures and technologies. Today, this is a very funny time for hardware architects and system architects.

Starting point is 00:01:01 And I show this picture because I have two boys. They are eight and ten. And when they ask me, Dad, what kind of job are you doing? And I answer that I help my customer in playing Legos. And the Legos today for the server market, this is the different technology that we have in processing, in IP, memories. We have many technologies available today. And in the past, to build a server, it was very simple, just one CPU, one memory, storage, and network card, and that's it.

Starting point is 00:01:36 Now we have multiple flowers of all of them. And in the future, we'll have a lot of different technologies and many ways to assemble all of them in different ways to do different kinds of products. And computational storage, this is just an example of doing this. In several technology landscapes, there are multiple different systems and servers. Computational storage, this is

Starting point is 00:02:06 just one of them. And even just for this one, this segment of product, there are many ways to do it. So, sorry. So here, this is the agenda of this presentation. So first, I will talk about the concept of in-situ processing computation storage. Just explain to you how it works and what are the benefits and why we want to go in this direction and what is the state of the art of the market adoption. Then focusing on more advanced applications,

Starting point is 00:02:43 which is deep learning, and having a look at the roadmap in terms of architecture and technologies. With in-situ processing technologies, we are trying to solve the problem of the data movement. By the time there are more and more data to process, with big data, more and more computing capabilities, but the bottleneck, this is the way to move the data

Starting point is 00:03:16 to the processing element. Here, this is just an example with a local SSD. You have to first move the data to the RAM and then to the RAM to the CPU. It takes time and energy to do this data movement. So the way to solve it, this is to make the data and the processing more close together. We can co-list computational storage, smart SSD, in-situ processing, in-storage processing, and near-data processing. There are many names to do the same concept. I hope that with the work of the SNIA or NVMe workgroup in order to standardize, and I will come back

Starting point is 00:04:00 later on this new concept, I hope that it will come with a universal name and it will be more easy for everybody. Here the benefits are very easy to understand. If you reduce the distance between the compute and storage, then you reduce the latency, then you save performance at the application and system level, and you save performance at the application and system level and you save energy for the data transfer which is leading to a lower power consumption

Starting point is 00:04:33 at the system level. So making data and processing closer. There are two ways for that. The first one is to move the storage to the computing and this is by using NVDIMM. I will not talk more about NVDIMM on this presentation because today the market is more going on and you have multiple players here at the storage developer conference talking about that by moving the computing into the storage.

Starting point is 00:05:16 As an example of why we are moving to in-situ processing, here is a very basic example of a data movement of one gigabyte of data. On the left, standard architectures, and on the right, a smart SSD. In terms of power budget and timing, moving data from the storage to the computing is four times faster on the smart SSD and 10 times better in terms of power efficiency. This is a very basic example, but here you can understand that, oh, yes, we have to do that. If you bring the computing in the storage SSD, yes, you will save performance and power efficiency. How it works. There are many ways to implement in-situ processing. But basically, on the right side here, you have this smart storage system where you will find the SSD controller, the memory, and the computing core. It could be FPGA or whatever. But how to talk to this new product? Because you have to deal with storage and you have to deal with computing interface. The NVMe driver,

Starting point is 00:06:42 this is the perfect answer for that. Because obviously, it is done for data transfer, for storage access. It provides performance, low latency, so that's very good. And on top of that, it provides mechanisms like the vendor-specific commands that will bring flexibility and configuration options to deal with the computing part. Also, you can play with different

Starting point is 00:07:16 namespaces. As an example, I was this morning at the ATT.com presentation with the ATT.com accelerator. They play with different namespaces for different I was this morning at the ATT.com presentation. With the ATT.com accelerator, they play with different namespaces for different accelerators. So this is a way to use namespaces.

Starting point is 00:07:33 For example, you can write your data and send your data in a specific namespace. Then you will send a vendor-specific command just to start processing with some few parameters. And with another vendor-specific command just to start processing with some few parameters. And with another vendor-specific command, you can get the result. Or you can get to another namespace to get the result of the processing.

Starting point is 00:07:57 In terms of implementation, again, you can find many ways to implement in-situ processing in SSD. The first one, so SSD controller with computing capabilities. And I see that in a Samsung key value store management, this is how it is done by using a SSD controller, still playing with flash memory, and using the SSD controller cores to do some sub-function and to do some processing.

Starting point is 00:08:34 We can also, just for a different way to implement the storage part, using M.2 local SSD, and you will find on the market FPGA board with M.2 socket, so you can use this FPGA just for HBA product, but on top of the HBA function,

Starting point is 00:08:55 you can add some hardware accelerator and FPGA as well. And on the bottom, so it could be a standard SSD controller with an external connection in order to plug a coprocessor. A coprocessor could be FPGA, could be an ASIC, dedicated AS, I would say, I call still that in-city processing using peer-to-peer between a standard SSD and an accelerator. This is not in this slide, but we can also imagine in all-flash array system using in-line processing, so on the storage controller, when you will get the data from the network,

Starting point is 00:09:46 then you can do some processing on the fly, or when you read the data, yes, still doing some local computing. In terms of application, today, basically, we see, I would say, standard software or application. Compression, encryption, search, key value store, error coding as well. And that's good.

Starting point is 00:10:18 You can benefit from this in local processing. And what is interesting in this is that smart SSD is a new concept. This is not... In fact, the question we can ask when we talk about this concept, is it storage with embedded processing or is it an accelerator or computing with local storage?

Starting point is 00:10:44 And depending on the product, depending on the application, it could be one or the other one. It's not obvious for the system guys or the application guys to integrate in the system and to benefit from this new product in the system. So by starting by very easy to understand application, that's the right way. In terms of market players, so you have some products available today from NGD system, from ScaleFlux, Adeticom, Samsung, and other one. Also, some few technology providers who are pushing this new concept from Xilinx, ARM, and Marvel.

Starting point is 00:11:28 I was at the Flash Memory Summit this summer. There was a specific session for computational storage. It was a very interesting presentation from both Xilinx, ARM, and Marvel, who are thinking that, yes, bringing compute to the storage, this is the right way to go. But in terms of market adoption, what we absolutely need, this is a standard. Hopefully, there is some group from the SNEA workgroup. There is a technical workgroup on that.

Starting point is 00:12:07 What is interesting is that if you look at the SDC agenda this week, there are three, four, maybe five sessions about this concept. So this is a good sign. There will be a specific buzzer freezer session today at 7 p.m. on the Cypress room, I think. So I invite you to attend, and I'm sure that we'll learn more about this concept and what is the state of this workgroup at the SNEA, and when we will get a specification, a standard for that. So here for the first part,

Starting point is 00:12:57 I think that the real value in computational storage, this is for data analytics and deep learning. Because for this application, you really need very low latency to get the performance, low latency between compute and storage. You may have some power budget limitation for edge computing, because you may find this kind of application at the edge, and at the edge, the power supply is not the same in a data center, the same as a data center. And on top of that, we will see huge improvements in terms of hardware accelerator for deep learning.

Starting point is 00:13:39 And if we don't change the way to implement this in a system, the data movement problem I mentioned at the beginning will increase more and more. So we absolutely need this new concept of computational storage in order to provide the required performance with the power budget at the right level. So now we'll spend a few times on deep learning. I don't know what is your level of understanding of deep learning.

Starting point is 00:14:14 In deep learning, there is inference and training. I will focus now on the training problem. Training problem, in fact, the training. So you have to play with a data set. This data set will be sent through the neural network in order to train them. Then you will get your neural network parameters for your application. It takes a long time to do it or if you want to reduce this time you will have to require expensive resources

Starting point is 00:14:51 this is many GPU today for that or even it will take a long time on a very expensive hardware when I say a long time it could be in the range of weeks or months to do one training. So imagine in terms of business impact, when you get your new data set, or if you want to train a new model to be applied on the field, on your market, you will have to wait for a few weeks or a few months. If you can reduce it to a few days, a few hours,

Starting point is 00:15:27 or I would say real time, this is easy to understand the business that you can save with that. Today, a typical deep learning training based on a GPU. So we have many GPUs in the systems. In terms of performance, that's perfect. You can't do better today. This is the best way. But in terms of cost, in terms of power, yeah.

Starting point is 00:16:01 It's not easy. This is not for everybody. I have selected this ResNet-50 neural network. ResNet-50, this is a very well-known neural network. I have selected this one for this presentation for my study. Because you have a lot of information on this neural network, how it works, what is the topology, the different parameters. And you have a lot of benchmarks available from HPE, from Dell, from NVIDIA, and all the benchmarks are on different architecture,

Starting point is 00:16:37 on one GPU, two GPUs, eight GPUs, with different configurations. So all these configurations and benchmarks help me to design a performance model where you have two kinds of parameters. So the hardware parameters which are used, so the flops of the GPU, the memory boundaries, the number of compute nodes, and also the configuration parameters. le nombre de nodes de compute et aussi les paramètres de configuration

Starting point is 00:17:10 Oui, le nombre de paramètres, la résolution et le nombre de images que vous prenez pour le modèle. Et donc, j'ai impliqué tout ce modèle dans un fichier Excel et je l'utilise, ce modèle basique, pour estimer la performance sur une nouvelle architecture, un nouveau système. En termes de performance, dans le système actuel que vous avez aujourd'hui, typiquement, le maximum you can reach with the ResNet-50 network is about 400 images per GPU using FP32 resolution.

Starting point is 00:17:59 So this is for the computing side. For the storage side, because you have to read all the data from your dataset storage, this is 40 megabytes per second and using 100 kilobyte images. 40 megabytes per second per GPU. This is not a real problem today in terms of architecture and storage boundaries. At the performance, at the system level, you just multiply by 8,

Starting point is 00:18:34 and the storage read that you have to manage is 320 megabytes per second. This is still reasonable. But what is interesting is that you will see huge improvements in terms of deep learning processing in the coming years by different techniques. There was huge work from data scientists and from mathematicians guys in order

Starting point is 00:19:04 to check how we can reduce the resolution of the parameters in order to save performances. Maybe we can move from FP32, so floating point 32-bit, to floating point 60-bit, to integer 8-bit or even integer 4 bit and why

Starting point is 00:19:26 not at the bit level. On the other hand, if you reduce the resolution of the weights,

Starting point is 00:19:36 you will decrease the quality of the network. But if you just reduce the quality by 3 to 5

Starting point is 00:19:44 percent, for many applications, that's enough. Pruning, this is the same concept of optimization of the neural network. Pruning, I don't know if you know this term, in deep learning, this is the concept of the optimization of the network. In a neural network, you have many neurons with many connections. I would say this is very symmetrical in terms of architecture. But if you analyze the importance of the different weights of neurons, some weights will provide a huge, very important part of the quality of the neural network,

Starting point is 00:20:32 but some of them, this is very minor. So if we remove some connections or even some neurons, then, like with the lower resolution, you will reduce the quality of the deep learning network. But in terms of computing capabilities, you reduce the number of operations to do. So altogether, you will decrease the computing requirements. You will decrease the memory bandwidth requirement. And on top of that, in terms of architecture, implementation of hardware accelerator for deep learning,

Starting point is 00:21:12 you will move to a very massively architecture, mainly multiply accumulation implementation. Because this multiply accumulation, this is the very basic operation for deep learning. So I estimate that very soon we will be able to increase by 25, between 20 and 30, the number of frames to be processed per second. So what will be the impact at the storage and the I.O. level? So to come back to this system implementation based on GPU,

Starting point is 00:21:59 so maybe reaching up to 80K frames per second at the system level and leading to 8 gigabytes per second on the grid. This is higher than 320 megabytes per second. Wow. But I would say still reasonable in terms of data throughput by just using few SSD. But the key point here, this is not just a problem of boundaries. This is mainly a problem of latency.

Starting point is 00:22:26 Why? Because for deep learning, if you want to run your deep learning training correctly, you have to do random access, and every time, random. And when you will do this full training system, all the images will be read multiple times, but in different ways.

Starting point is 00:22:52 And in addition, that will be at low Q-depth. So in terms of IOPS, in terms of real storage access, by just using a few SSD locally, that will not be able to be used for this training system. Then you will need additional storage, so I would say additional all-flash array with a very high-boundary network interconnect. Then you will have to use additional volume and power consumption and cost. So what we can do for that with computational storage?

Starting point is 00:23:30 So there will be multiple hardware options based on FPGA, based on ASIC CPU, so mainly software processing, mini-core or AI chips. The target here is to use a 2U server using 24 U.2 form factor, the standard SSD form factor, and target is 1 kilowatt.

Starting point is 00:24:00 In terms of performance estimation, so on the top we already saw the numbers, 80K frames per second. So that would be a 5U server. And in terms of efficiency, 20 frames per second per watt. And in terms of performance density, 16 kiloframes per second per U. My estimation on computational storage, this is 1 kilowatt for this 2U server.

Starting point is 00:24:36 And I would say that we will be in the same range of performance and power consumption. A little bit better in terms of power efficiency and a little bit lower in terms of computing density and performance density. I would love to see that we will be very better in terms of performance with computational storage. The problem with the deep learning is that it's very computing intensive and very memory-bound intensive.

Starting point is 00:25:09 And working on storage interface could provide some benefit, but that will not provide huge improvement in terms of performance. But here, this is assuming that in terms of computing capabilities, we are 10 times lower versus a GPU. And I would say that we can use better numbers with new hardware technologies, especially with AI chips. But at the beginning, I wanted to show some price number and so on,

Starting point is 00:25:54 but I would prefer not. In terms of system cost, I'm sure that with computational storage, we'll be very lower than a GPU-based system. I don't know if you know the price of GPU or GPU-based system, but I invite you to go to the web and check it. But I think here, the main advantage,

Starting point is 00:26:16 this is about scalability. With computational storage, if I come back here, if you want to increase the performance, you just have to add one computing element, a U.2 smart SSD, and that's it. And if you want more, just add it. With a GPU-based system, this is not easy to do that

Starting point is 00:26:43 because you have your own full system with 8 GPUs or even 16 for a few systems. And in terms of scalability and flexibility, this is not done for that. So in terms of implementation,

Starting point is 00:27:01 so I will not go into the details for that, but here, for FPGA architecture, so obviously we have some FPGA with an HBM interface. I mention that because deep learning is very memory, read and write intensive. So implementing NVMe and on-field interface just for the standard SSD, and using an embedded CPU. It could be an ARM CPU and a Zinc FPGA, for example, or Macroblaze or RISC-V.

Starting point is 00:27:38 And using Accelerator IP. So there are a few companies who are providing neural network accelerator. It could be a neural network processor or even the implementation in a full hardware of the neural network. And in terms of the power budget, that would be okay. I mention that because, again, here the target will be to be compatible in a U.2 form factor. With the SOC architecture,

Starting point is 00:28:14 here SOC, this is what I call a very CPU-based SOC. What is interesting here, obviously, for a CPU-based system, it provides flexibility. But playing with deep learning training on a CPU, this is not the right choice for performance. It's very too slow.

Starting point is 00:28:42 If you want to do deep learning training with a CPU-based system, you absolutely need, like for the FPGA, some hardware accelerator for the neural network computation. In terms of many-core processors, for the two previous examples, this is by using just one CPU or one controller

Starting point is 00:29:07 for both storage and computing. Here we are using two controllers, one for the storage and one for the computational part. So using, I would say, a small FPGA that will handle NVMe and the flash interfaces and connecting a mini-core processor with PCIe interconnect or AXI interconnect, I don't know if it exists, but implementing absolutely a very low latency connection between the FPGA and the coprocessor. And the same principle for AI coprocessor. And that's it for

Starting point is 00:29:47 deep learning. And now let's have a look at the evolution and how I see the evolution of computational storage. So today we are at the beginning of this new concept with

Starting point is 00:30:04 few products available. But I think there is room for improvement. Because two problems I see, or two things that we can improve, is that with computational storage, we are not able to share the data between the different elements. If you want to share it, you have to go through the CPU, through the controller, and this is not done for that. Secondly, there is no cache currency. If you want to run an algorithm on few computational storage systems.

Starting point is 00:30:48 There is no cache currency and you will lose some performance. So the question is how we can benefit from the new interconnect Gen Z, C6, or OpenKP and how we can apply it to computational storage. So here I will give you some examples in terms of architecture

Starting point is 00:31:08 and how we can use it. And I think we can play Lego for a long time. C6, just to bring you an overview. So C6, this is a new interconnect standard introduced two years ago by multiple leading companies and here this is a very

Starting point is 00:31:33 so you can find these slides on the C6 website here the goal is very easy to understand, this is to implement cache currency between the main CPU of your server and another component. This other component could be an accelerator, or it could be memory, or it could be control network as well. And then you can imagine the number of topologies that we can implement. So we can implement this on computational storage.

Starting point is 00:32:11 So instead of implementing the NVMe interface for that, we can reuse the CCX implementation. That will be the same on this product. You will have storage and computing together on the same board. And this is the target, Cela sera le même sur ce produit. Vous aurez le stockage et la computation ensemble sur le même bord. Et c'est le but, réduire la latence entre le stockage et la computation. Ou nous pouvons utiliser C6 comme interconnect et la coïncidence entre les deux contrôleurs, ou les multiples contrôleurs du produit de storage de computation. Ou pourquoi pas,

Starting point is 00:32:50 toujours utiliser NVMe comme interface standard, et utiliser C6 juste pour l'interconnexion entre le produit de storage de computation. the computational storage product. With Gen Z, so another standard introduced two years ago. And if you have any questions regarding Gen Z, there is the right guy in the room, and he will have a talk on Wednesday morning. I invite you to assist this talk because Gen Z is very interesting.

Starting point is 00:33:24 And here, this is a way to use Gen Z as a new interface. So instead of using NVMe as an interface, you can use Gen Z. So you have still the compute and NVM. And in addition, the benefit of Gen Z is c'est que vous pouvez désagrégéter ce produit et l'utiliser de différentes façons en termes d'implémentation mécanique dans votre serveur. Nous utilisons Gen Z comme partage de données. Et ici, je dirais que nous pouvons comparer cela avec l'interconnexion NVLink pour le GPU de NVIDIA. Avec NVIDIA, il y a l'interface PCI,

Starting point is 00:34:07 mais il y a aussi le NVLink. Le NVLink vous permet d'interconnecter les différents GPUs ensemble. Mais c'est spécifiquement pour le GPU NVIDIA. Avec Genzy, c'est un standard plus ouvert. Vous pouvez partager des données

Starting point is 00:34:22 entre tous les éléments de la compute. Ou en utilisant NVMe, mais c'est plus intéressant. Voici la façon de partager les données pour le traitement. Si vous ne voulez pas avoir une copie de vos données sur plusieurs systèmes, the dataset for the training. If you don't want to have a copy of your dataset on multiple systems, you can have a dataset on the system on the left, and the training system on the right can have access to the dataset, which is stored on the left. Donc, il y a, et c'est juste par, je dirais, jouer quelques minutes sur ces slides et imaginer ce que nous pouvons faire. Il y a plusieurs, je dirais, infinites manières d'impliquer toute cette architecture.

Starting point is 00:35:21 Mais, encore une fois, le même concept que vous trouverez chaque fois, c'est que, dans le même endroit the same place you have computing and storage. So in terms of technology, roadmap, here the goal is to go more in details at a more higher level of integration in terms of technology. So today you have computing and SSD on different boards. Computational storage, the goal is to have compute and storage on the same board, SSD.

Starting point is 00:35:59 Here now, the goal is to have storage and computing on the same die, the same silicon. So this is for more long-term roadmap technology. So having kind of smart memories, so I know a few companies working on that. The challenge here, this is

Starting point is 00:36:21 the silicon process and how we can bring computing process on the memory process. It's not easy. Maybe a smart way to do that is to play with 3D integration, so using a 3D interposer.

Starting point is 00:36:39 In terms of silicon technology, this is by using the die of a CPU, the die of a memory, and having a third die just for the interconnect. And why not, more long-term view, using silicon photonics for very high bandwidth between all the parts. Or having the memory in the SSD controller. Here, of course, in terms of storage capacity, the size will be lower. But in terms of computing efficiency, yes, this is very excellent. I remember a

Starting point is 00:37:15 presentation from Crossbar, the Flash Memory Summit. We were talking about the re-RAM technology used for computational, for deep learning. I think it was for inference, not for the training. But the same, I'm sure that you can find the slides on the Flash Memory website. All the deep learning neural network weights are implemented locally with the processor, and then in maybe one or two clock cycles, the processor is able to read all the weights and to do the computation. So in terms of efficiency, that's excellent.

Starting point is 00:37:57 So as a conclusion, there is a demand for computational storage in terms of high-performance requirements, and there are some existing solutions. A standard is needed, but hopefully a few companies and the SNEA are working on that, and that will help to validate the market adoption. But I think that the standard must be enough in order to support new architecture or new interconnect like Gen Z, CCX, or other. Thank you very much. Thanks for listening. podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers

Starting point is 00:38:57 in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #100: A Comparison of In-storage Processing Architectures and Technologies

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.