Storage Developer Conference - #84: Deployment of In-Storage Compute with NVMe Storage at Scale and Capacity

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 84. My name is Scott Shadley. I'm a storage technologist slash VP of marketing. I get to call myself a technologist because at one point in my career I was an engineer. You can't blame me that I went for the frontal lobotomy to get my marketing half. I work at NGD Systems. We're a pioneer in computational storage.

Starting point is 00:00:57 So today I'm going to give you an introduction to what we're doing, but more importantly, since this is a developer conference, I wanted to show real-world examples of how we're using the technology today with customers. So yes, it's a tiny bit sales pitchy because I am a marketing guy, but I did want to spend a lot of time more focused on what's causing the need for this. So some examples of what's going on in the marketplace today.

Starting point is 00:01:18 This is one of my favorite cartoons. For the first time, they've determined it's actually hard to find a needle in a haystack. Our data sets are growing to exponential sizes. We know that. We're getting exabytes, zettabytes of data, as kind of the video clip I was showing talks to. So what if we had a way that our storage devices, instead of just being like our storage lockers we all rent from Storm or U-Haul or whoever else we buy it, rent it from, and have to unload all the boxes to find the one box we need in the back of the storage room? Could just have it pop out and walk out to us and you know deliver itself to our house

Starting point is 00:01:48 by an uber or something so we worked with this company called actual tech media some of you may know them they asked this question of a slew of storage professionals and where do you see the most storage performance bottlenecks and so we talked we got things like i don't know array controllers networking but the biggest storage performance bottleneck is still, aha, the storage. And again, that's because we've done a lot in the market to improve on things. We've introduced things like NVMe. We've gone to things like the potential of open channel, open everything. But at the same time, the storage device as a standalone product has not changed drastically from tape.

Starting point is 00:02:25 We put bits on it. We pull bits off it. We don't do anything else with it. We may access it faster. We may have better reliability in the way of VCC, things like that. But we're not really moving the data around. We've done things like software-defined everything. We're moving to this whole new world.

Starting point is 00:02:40 But at the end of the day, these slides are available. So I'm going to gloss over the glossy stuff and get to the meat and meat of it. The idea is don't leave the storage behind. There is opportunity to move compute into a storage device in an effective way that's easy to implement and also scales, so that you're not dealing with just one added device.

Starting point is 00:02:59 If I have 24 of these things, I'll show you what that can do to a system. And how does this occur? Today's architectures look very much something like this. This is an example from one of our partners that uses, in this case, a very simple single CPU root complex with 16 lanes of traffic. They're going through a switch

Starting point is 00:03:15 to get all of their SSDs, and they're having a problem with the fact that they're needing to scale the capacity of this footprint, but they really don't want to change the compute architecture they're dealing with, because the CPU side of things are more expensive, they're space constrained in their particular architectural environment. So as the devices get bigger and the number of devices in the system, so we've got out there our friends at Supermicro with the EDSFF chassis that can support half a petabyte in one U. That's great, but now I have to figure out how to get access to

Starting point is 00:03:43 all of that. And if I'm talking about this particular example where I've only got 16 lanes of traffic available and I'm switching it out we see that the throughput capability of what the NAND can deliver and the flash can deliver over NVMe to the root complex is about a 50x or greater delta in performance speed so I've got really really fast IO over here and I've got a limited IO over there. I can switch through it. I can make it expand out. But at the end of the day, the number of gigabytes per second here

Starting point is 00:04:09 versus the number of gigabytes per second there are drastically offset. So we need to work as an industry and an architecture to figure out how to solve or implement a way to make this solution a little more opportunistic. And it's not just us that are talking about it. So these are a bunch of different articles that have come out over the last several years. One of the bigger ones here from the IEEE conference, there was about 16 people that

Starting point is 00:04:34 contributed to this concept of near data processing would be very valuable as a solution for us in the marketplace. A lot of this also plays into the edge and IoT space, and I'll explain how that works out as well. If we think about platforms that are going on the edge, we've got things like 5G, great, but we're still going to run into bandwidth constraints. Moving data is becoming a bigger and bigger problem. Storing is easy. We've got that down pat. We've got big drives. We've got NVMe. We've got all these other opportunities. But actually doing something with our data continues to be a significant struggle for us. So when we looked at it,

Starting point is 00:05:08 we looked at the value props of this concept of computational storage. And of course, in our case, and what I'm going to be talking to you about today is moving compute closer to the data. Now, in this particular instance, as I look at it from where I work and from what we're doing,

Starting point is 00:05:21 it's at a drive level. This does not have to be just at a drive level. You've seen other presentations from our friends like Stephen Bates and other companies also in the marketplace that are going outside of just the drive level or looking at it at a system level. So there's lots of opportunity to do it. In this case, we're going to stay focused at what it is from a drive perspective. So a computation request by any application can be much more efficient if it's near the data. Greg Schultz from Storage.io wrote a blog article, the best I.O. is no I.O., the second best I.O. is the I.O. I only need.

Starting point is 00:05:53 And that's kind of a, it fits very well into the vision statement of what computational storage is all about, lessening the number of I.O.s. And when you talk about large data sets and even unstructured data, being able to do what you can at the data set location is much more valuable to you than having to constantly load flush and reload a DRAM footprint so if I've got mega terabytes or exabytes of data sitting in a server and I've only got a couple hundred gigabytes of DRAM just simply doing the math of how many times

Starting point is 00:06:22 I have to reload that DRAM footprint to scour that data should tell you that there's a lot of wasted time in today's architectures. So there's a great opportunity to look at ways of eliminating that and helping augment the CPU root complex. The techniques that we're talking about here are not in any way looking

Starting point is 00:06:40 to replace a GPU. I'm not trying to get rid of an in-memory database. There are applications where what I do actually slow things down in the particular implementation, and I'll highlight those as we go through some of the case studies. What we're looking at here is something

Starting point is 00:06:53 that you have stored needs to be analyzed. I'm going to make it faster to analyze that data through this concept of in-storage compute. So we took a dimensional look at this, and we said there's multiple different ways to do it, but we wanted to focus on things like the KISS principle, keep it simple, stupid. Again, I'm a marketing guy that used to be an engineer,

Starting point is 00:07:14 so they have to really dumb it down for me to be able to communicate it to you as a bunch of folks. So from an operating system perspective, bare metal, real-time operating system, and it needs to be 64-bit OS. Those are pretty straightforward, simple dynamics. When we get to the hardware side of it, there's a lot of different opportunities. We started with 32-bit real-time processors. We talked about hardware acceleration. There's lots of opportunities there around ways we can use FPGA acceleration. And then we decided to take the leap and say it needs to be 64-bit

Starting point is 00:07:42 only. So for our particular platform, it's a 64-bit operating platform. When you look at the user applications, we had to write firmware for this. This is something that has been looked at and been tried and then been presented by others earlier this week in the way of using an existing storage device in these spare cycles in that controller or that platform to do these types of processes. They had issues with that because when you start having contention between compute and wear leveling, you can't fight the wear leveling's got to take place. You've got to be able to keep

Starting point is 00:08:14 the data intact. So we looked at it from that perspective. We're going to start from scratch and we're going to write firmware from ground up that covers flash management, ECC, data placement, and then allow us to do things like add application software at the device level. In this case, we've actually been able to do virtualization via container,

Starting point is 00:08:30 so Docker is my new best friend. And then it seems to find out that talking to the number of customers that we have over the last couple of years, the AI application space, not just to use the buzzword, but just the way those applications work on doing analytics in a storage environment, fare very well for in-storage compute. If you're running inline

Starting point is 00:08:50 in memory or doing stuff that requires you to access it as you see it before you store it, that's going to be somebody else's product in this particular instance. And in the future, we have the ability to do more true AI application acceleration. So right now, I'm just running the app in storage and actually giving you an acceleration out of that fact. Now, if we start rewriting the applications dedicated to these types of processes and products, we have the capability to do more with that. When it comes to looking at this as an ecosystem, there's a lot of things that can be done here. One of the things that we're working together with others in the industry, as was shown by the birds of feather last night, is the highest

Starting point is 00:09:28 level of this and how this drive identifies itself to the system, what the API looks like that would drive this particular system. I have unique versions of those today because I have a new product in the market, if you will. But we're using the SNEA organization to help work through a provisional twig to actually develop some standardization on how these products are identified to the system so you know that it's my version of a product versus Stephen's version of a product versus someone else's version of the product and be able to actually intelligently use an OS and not have to rewrite code every single time you want to use a different version of

Starting point is 00:10:02 these particular products. So we call it in situ processing. In situ in Latin is in place. So we're not being very creative. Again, we're a small startup. When you look at it from our perspective, the CTO said when he sat down with the team, his number one focus was a seamless programming model. Back to this concept of keep it simple. If I can't make it easy for you to use, you're not going to touch it.

Starting point is 00:10:25 So I have to start off with something very easy. Seamless is not easy, but it's as close as we can get to making it functional. Scalability was another one. If you can't put this in process across multiple platforms with multiple people figuring out how to use it, you're never going to be able to actually implement this for a long-term solution.

Starting point is 00:10:41 So scalability becomes very valuable. And then capacity growth. You need to make sure that whatever you're doing can support the capacity growth inside an individual drive. Right now we see a lot of problems today, not with people wanting bigger drives, but people being able to truly deliver a large drive at a reasonable cost with the right performance metrics. And those are another key cornerstone of what we're focused on. This is the very basic block diagram of what computational storage is.

Starting point is 00:11:09 I'm not taking the CPU out of the process. I'm simply adding compute processing into each and every one of my drives. It scales. You add compute cores into the system and you augment the CPU or even free the CPU up to go do bigger and better things that are more useful with its time. I've had a lot of questions in the room in previous conversations.

Starting point is 00:11:30 What about things like open channel? Open channel is taking a lot of the control of this drive out into the CPU. That's not what this product is designed for. I can make you an open channel SSD if that's what you like and you want to go write your own FTL. But if you take the FTL out of my drive, you lose the capability to do in-storage compute. So there are definitely trade-offs in the market and there are customers that are saying we don't like that idea. There's customers saying we really like that idea. So I just want to be very open as far as a development perspective. This does require you to have the drive act as a traditional NVMe target. So when we think about moving compute to data, this is going to be a history lesson

Starting point is 00:12:07 for most people in the room. So you write data in, it comes into the storage device, we read it out of the storage device into DRAM, that particular path, and then using the host CPU to do the compute, that's the focus of in-storage compute, or what we call in-situ processing, because you're going to sit there and repeat that process

Starting point is 00:12:25 multiple times over again with multiple drives in multiple systems. And if we have the ability to limit the number of times you have to use that host CPU and the time it requires to offload the entire storage device into the host DRAM, we're going to be able to show you how we can save time and money. So with an in-situ storage device, we do the same thing.

Starting point is 00:12:47 We're going to store your data. So again, this is back to the point, I'm not an in-memory platform. I'm not going to do things real-time. I'm going to do near real-time. Once it's in the data storage device, you now have your internal buffers, which in our case is a drive with DRAM on board.

Starting point is 00:13:03 And we have computational resources in the way of ARM cores outside of the NVMe data path. So the data management of the read and write still takes place. We use an alternative path to be able to process the data in place. The type of in-situ processing that we offer, it's an embedded Linux platform. So you don't have to worry about having a specific app. If it can run an ARM64, it can basically be dropped into the drive. And we offer APIs and solutions to help support that capability. Then the results are fed back,

Starting point is 00:13:32 and you're getting a smaller packet size out of the drive. You're limiting the I.O., and you're freeing the DRAM and CPU to go do other things while you're running your storage compute. Yes. And please, if you have questions, stop me, because I tend to ramble pretty quick and will be done really, really fast. So data contention, if you will, from that example. So

Starting point is 00:13:57 yeah, the question in the room was, if I'm doing in situ processing on data in the drive and you go and do write and update the data, what happens to that situation? Because we're always reading stored data, it's as if you were doing the same thing where you're writing to a drive as you're exiting out into host DRAM. There is a potential for contention or for processing on old data.

Starting point is 00:14:17 Since I'm not doing it within the memory buffer as it's going in, you have exactly the same contention example as you would just pulling it into the host system. We don't have an overlap issue from that perspective. Yes? How much extra DRAM do you need

Starting point is 00:14:32 for in-situ processing? How many extra cores do you need for processing? He's asking for the block diagram. I like it. Our in-situ processing is built on A53. We have four 64-bit ARM cores. From a DRAM buffer perspective,

Starting point is 00:14:47 we're not using more than about 2 to 4 gigabytes of DRAM depending on the size of the drive. So that's extra 4 gigabytes? On top of the standard amount we have for the drive processing. I'll show you in a bit. We do have patents related to the DRAM on the NVMe side, which limit that footprint requirement of space on the drive as well. Yes?

Starting point is 00:15:09 So in terms of the capability of that process, what kind of processing is that, encryption or compression, or what kind of processing do you refer to? The beauty is this is effectively the way we've designed this drive. It's a micro server in a drive. It's got a runtime Linux OS on the drive. You can drop any application. So encryption, we've already done with the customer.

Starting point is 00:15:31 Compression, we've done with the customer. It's running as an application in the drive. It's not a compression engine inside the drive. So it's even more flexible than the XBGA. I'll show six different ways we've already done things with customers as part of my kind of real-world examples. Yes? You mentioned that you have R-Cross

Starting point is 00:15:48 as the choice of operating system. Is that embedded Linux in R-Cross? Yeah, so our choice for our general distribution is a Ubuntu 16.04 core, where they've stripped out all the drivers for everything like external peripherals and displays and stuff like that, so it's a very small footprint, full-fledged Linux OS.

Starting point is 00:16:08 But not R-Cross. roles and displays and stuff like that, so it's a very small footprint, full-fledged Linux OS. Is that a real-time operating system? Yes, it is a real-time operating system. It boots up just like a server would be running a natural server. Effectively what our drive does is that Linux OS looks at the NAND placements as if they're like drives in a system. Just a FYI,

Starting point is 00:16:24 I wouldn't consider Ubuntu small. It's 64 megabytes or something like that. I would compare it to like Alpine Linux, which is like over 2 megabytes. Compared to Windows, it's teeny tiny. And for the purposes of this product, we can run Ceph or we can run any other operating system as the core operating system.

Starting point is 00:16:44 Just for our development platform and our release product, it's Ubuntueph or we can run any other operating system as the core operating system, just for our development platform and our release product, it's Ubuntu. There's definitely options. Another way to answer the question is to say, well, how does 60 megabytes compare to how much you said you got 2 gigabytes or 8 gigabytes of RAM? Then it's time. Exactly.

Starting point is 00:17:00 Yes, sir. I have a couple questions. This is awesome, by the way. The fact is that they run Docker containers. You can literally write any algorithm you like, and they can run it down there. I think it's amazing. So, Scott, I think I have two questions.

Starting point is 00:17:16 One is, is there a concept of networking? So if I'm running on this Ubuntu operating system on those A57s, do I see a network interface that I can use to talk to the iSign world? Or is that something that's more available to me? So that's kind of one. And then the other is you mentioned that

Starting point is 00:17:35 the operating system sees the Flash media as some kind of device. Can you elaborate a little more on how it's presented to the OS? If I have an algorithm that wants to look for the largest value across these die, how does that look? Yeah. So the A53 cores utilize the M7 data path. We talk to and pull data from the drive as if it's working through this as a pseudo host.

Starting point is 00:18:00 So the direct connection between those two is an arm-to-arm NAND flash interface, so it's much faster than even the PCIe Gen 3 bus. So even though I have a smaller footprint, a smaller DRAM, my application processors I chose for power versus performance. An A73 or A74 would be much faster, but the power budget tradeoff wasn't worth it for what I'm dealing with within the storage device itself because I'm talking 8, 16,

Starting point is 00:18:31 32, maybe 64 terabytes of data, not petabytes of data behind those cores. And the amount of time I save transferring across that bus, as I hope to show you in a minute, highlight the capabilities of the A53 being more than powerful enough to support that. To your other question around network capability, for this connection to this migrated application or host processor, we do run a small host API that uses a TCP connection over the NVMe protocol. So there's been

Starting point is 00:18:55 conversations here about NVMe over TCP IP. I'm already doing it the other way around. So every drive has a pseudo IP address that I can see from my host side, and I can address all of those devices and system, which also allows it to extrapolate over a fabric-based platform, and we're working with a couple of fabric sets. And it'll still be fast.

Starting point is 00:19:24 Yes, sir? So, one question. In terms of the input buffers or computation buffers, do you use the SMPF on the device to divide? So, crawl, walk, run. Crawling is where we're at today as a startup, so we have individual devices. Walk is peer-to-peer.

Starting point is 00:19:45 Run is where I eliminate his problem of hopping, hopping, hopping. So, yes, today it's just individual devices. Walk is peer-to-peer. Run is where I eliminate his problem of hopping, hopping, hopping. So yes, today it's just individual devices, but because they run concurrently and or simultaneously, I may be redundantly running an application within multiple drives and getting the similar results. There's not going to be the exact same data file on any drive that I'm looking for, or shouldn't be if you're a good storage architect. You're going to be spreading your data across it, so your results will come back concurrently. What is the package? How much is the package? How much is the what? Package.

Starting point is 00:20:11 The power? So we designed this to be a low-power solution, so it's 16 terabytes and a 2.5-inch U.2, full data rate and running application, it's 12 watts. So we have the lowest watt per terabyte available on a drive today. Yes, and then I'll get you. What's the ballpark of adding the cost for SSPs?

Starting point is 00:20:33 Ah, there we go. All right, I did not pay him for that one. So we looked at it from that perspective. There is what we classify as a licensing cost for the support of the in-situ platform to support that. It's on the order of magnitude of pennies per gigabyte above a standard SSD product. So we do it

Starting point is 00:20:51 in both a monthly, yearly, or lifetime type of a license fee and because we know some customers like the idea of having a very large very low power drive as a stand alone product, we can sell it as a storage device and be competitive in the market with any of the big guys because we're just going to buy very large, very low power drive as a standalone product. We can sell it as a storage device and be competitive in the market with any other big guys

Starting point is 00:21:07 because we're just going to buy their NAND. Our BOM is less than their BOM outside of the NAND, so we offset a lot of what we consider assumed cost. And then we can add on the computational resources. So it's an on-off switch capable product. You can buy it with or without, and you can turn it on later if you want to. Yes?

Starting point is 00:21:26 My question is kind of a follow-up on the earlier one. If you actually have, depending on the application or the workflow pattern, if you actually have your right traffic going in, you said there's no overlap, but you could have, how much, what is the limitation on your on-device compute? And I have to get, that's going to be the latency. I can't finish the compute if I actually have the application So based on the way that we're doing this,

Starting point is 00:21:59 it's not a high-write performance product, and we give read prioritization inside the drive unless the customer wants a unique firmware where we can transfer high write performance product, and we give read prioritization in the side of the drive unless the customer wants a unique firmware where we can transfer to write performance. So as a standalone NVMe, it's 3 gigabyte per second read, 1 gig write. It's not designed to be a 3 gig, 3 gig. And we read prioritize everything,

Starting point is 00:22:18 so if a write's coming in, we'll actually stall the write in favor of finishing the reads, even if it's from the application processor. Yep. All right. Oh, I'm going to hate that slide later. So we look at it from a perspective of today's SSD, and I have a longstanding debate with people inside the company

Starting point is 00:22:38 of the use of animation. As you can tell, I think they won today. So you have the idea of a smart SSD with in situ processing, gives you abundant resources inside the SSD. We've added those cores to give that to you where others haven't had it in the past. Because we are running a virtualized platform with a 64-bit application processor,

Starting point is 00:22:57 we create this concept of a disruptive trend in the marketplace within storage compute. This slide I will not take credit for. One of our partners that helped present this for us at some of this material at FMS created the slide, and this was their interpretation of what in Storage Compute can do for the marketplace. So solutions. This is the fun part. My first one, that previous slide, was from our friends at Microsoft Research. They were kind enough to get on stage with us at FMS a couple months ago. So this is a little more detail than we were able to present there because

Starting point is 00:23:27 it was a much shorter session. Basically what we're doing with them is they use this tool from Facebook called the AI Similarity Search. It takes images, converts them to three-dimensional vector identifiers, XYZ, effectively, stores them on disk, and then does comparisons of those. So it's an inference

Starting point is 00:23:43 utilizes an approximate nearest neighbor neural network. And up into 2017 with a billion images, their current platforms, they just threw more compute at it. So it's more storage, more compute, keep throwing servers. We're starting to run into power problems. And now with 2019 coming along, the concept of a trillion images in a single data set is becoming more challenging for these guys.

Starting point is 00:24:06 So the premise of this is you're going to Bing, you go to the image search on Bing, and you say, I want to see cats. And you want to see how many cats it can come back with, and we're all real-time people, so how fast can I get those cats back? Google used to put 0000012 to show you how fast they were responding.

Starting point is 00:24:22 That number's creeping the other way because they can't see all the files as efficiently as they want. So by way of working with us in this concept doing queries inside the system, I start with a balloon, I get balloons out. Now I'm not finding the same image and I'm not trying to. That's not the purpose of this particular app.

Starting point is 00:24:43 They use various different platforms. So this tool is an open source application that multiple different companies use, including Microsoft in this case. So how their architecture is put together, there's multiple ways they do it too, and we talked to them about that as well. But the premise here is more just about the app

Starting point is 00:25:02 and what we can do with the application. So if you look at the way the tool set's built today, it uses what we're all used to seeing. You load a training set, you index it, you have to reload the database, and if you add a new file, you have to add it back to the database. These are all multiple steps that go back and forth

Starting point is 00:25:18 between storage and host processor. So this is an example of one way they put it together. This is not by any means the only CPU structure you can use. But what we're doing here from an in-storage compute perspective is as the file's written, we're able to automatically update the index already on the drive. We're able to create and modify the database, and then we're able to run the inference real-time on it. So it takes milliseconds of time to do all that work versus seconds of time to get it done in the quote unquote real world. So as we put this in actual numbers with our customer,

Starting point is 00:25:53 and they called it an intelligent SSD just to give reference. I didn't want to modify his slide. With just running the system on the host, he was averaging from a queries per second or the metric of this particular application, just shy of 500 queries per second. With 16 drives in the host, he was averaging from a queries per second, or the metric of this particular application, just shy of 500 queries per second. With 16 drives in the system, which he already needs to hold the size of the database in that platform, by turning on the in situ processing, he got a 4x improvement with

Starting point is 00:26:15 no modifications to the app, and all we did was port his existing application into all 16 drives and run them concurrently for him. Yes? So all these SSDs you're reporting here, it's U.2 NVMe SSDs? Yes, there are U.2 NVMe SSDs.

Starting point is 00:26:33 They're all, in this case, for this particular example, they were 8 terabytes apiece, and they were fully loaded, all 16 drives fully loaded with 8 terabytes of image database. What kind of bandwidth do you have between the flash and your internal bandwidth? So we utilize the ARM-to-ARM communication. So, sorry, the question was,

Starting point is 00:26:52 what do we use for internal bandwidth between application and manned? Since we're talking ARM-to-ARM, we're using the native interface there. It runs around 16 gigabytes per second. So ARXR? You know, to be honest, I don't know the specifics of it. I asked not to because then I end up

Starting point is 00:27:07 divulging IP to a room full of people. Yes? One confusion I have is that when in the software you're trying to find storage systems, the images from the client systems come as I upload as a Facebook user an image

Starting point is 00:27:26 it finally hits the storage as a shot. Yes. It's a piece of that image. It's like it's not the entire image and it perhaps goes through some transform. The individual SSD will see the full images and you can index things or is that artificial or is that real use case?

Starting point is 00:27:47 This is how the face tool, by converting that image into a three-dimensional vector, that vector is a much smaller footprint. It's a couple kilobytes per file. So I'm not actually sharding the physical image. I am re-representing the image in this case through that vector. So it's a different implementation.

Starting point is 00:28:05 There are applications like that that I cannot accelerate. So there are limitations to what we can do. In the cases for what we've done so far today, everything's a direct attach, and the files are intact but spread across multiple drives. There's no RAID cards involved either because RAID virtualizes out my drives. So did Microsoft...

Starting point is 00:28:22 I know this is maybe a little bit off the topic. Did Microsoft... You had indexing a year earlier this slide. I don't know exactly what the premises are. They're going to be presenting the final results of this at a technical conference in December. And it'll be public record. And a bunch of this is already public on their site. It's called the Microsoft Soft Flash Research Project. S-O-F-T-F-L-A-S-H.

Starting point is 00:28:54 Did I get everybody's hands? So this is queries per second. This is what makes the application guy happy. So we also thought about, well, what about other ways to look at the way this analysis can be done? Because I got to find TCO. I got to make other customers happy. This is processing time. So not only can I do queries per second, but how long does it take to get the processing done? Same 16 drives,

Starting point is 00:29:15 host in orange, host plus my drives in blue. If we take a closer zoom in, oh, did I not put the zoom in on this one? I probably didn't. So if you look all the way down here, you'll see that if I'm one drive to one host in this particular instance, this application is actually slower running in situ on my drive because the database is not big enough and the host is that much faster because it all fits into DRAM. But as you scale up to 16 drives, two things happen. One, the amount of time it takes to process, to move the data for processing into the host goes up. And then you can see if we went beyond 16,

Starting point is 00:29:50 it's literally an exponential curve. No matter how big the database gets, no matter how many drives I add, my results are consistent and stable. Is there any meaning that, like, around five or six cores, is it a wash? Is there any way to kind of, like to make a rule of thumb out of that?

Starting point is 00:30:08 So this result will be unique to every application you run. So for this application, when you hit host plus four, I just saved you four servers is the best way to look at that because you don't have to have those extra servers in place to support that. Because if I do host plus 16, I'd have to throw, basically to get from 42 seconds back down to half a second, how many CPU cores

Starting point is 00:30:29 in new servers do I have to add? At 4, it's 4. At 16, it's about 26 servers is what we figured out based on the way that this particular platform is built. If you have higher performance processors, more cores, all that kind of stuff, this math will change. So this is definitely a point in the sand, if you will, definitely a point in the sand, if you will,

Starting point is 00:30:46 or a line in the sand, if you will, of this particular app. On your previous slide when you said the performance is 4x, was that sustained? Yes. That is sustained. Do you have actual numbers? Do you have, like, 4x times? Because it depends.

Starting point is 00:31:01 How fast is the app that you're using? Valid point. In this particular instance, this is the same data just represented differently. These are all our drive, and they are based on our prototype FPGA solution. Our ASIC-based solution will shoot even more of a delta to that system

Starting point is 00:31:17 because our ASIC-based solutions are actually a faster processing drive. And like I said, the research project will be publicized by Microsoft when it's final. The current target's December. Yes? Why is the search time increasing super linearly

Starting point is 00:31:36 with respect to the number of drives? From our perspective? No, from the convention. From the perspective in this case, it's just because every time we increase the drive, we're adding 8 terabytes to the database as well. And so that 8 terabytes has to be moved from DRAM, or into DRAM.

Starting point is 00:31:55 And so you're scaling it at 8 terabytes per step, basically. So it takes that much longer to use the amount of DRAM, in this case 32 gigabytes of DRAM, to do that large of a data set through 32 gigabytes of DRAM, it just takes longer. So I mentioned we can also use this type of technology at the edge. So this particular instance, this is how we used to find images.

Starting point is 00:32:25 So if I wanted to find his face in this slew of things, I'd have to start somewhere, or I can use an AI algorithm to find them in one, and then I have to figure out where the timestamp matches in another. We actually had some fun with the video clip, and now it's not going to play for me, right? So this is showing three individual cameras

Starting point is 00:32:43 directly connected to each of the individual drives, so it's a one-to-one relationship. You can see that we're tracking the image, but we're slightly off because this is near real-time. I'm storing the image, doing the object analysis, and then sending that result to the host. The relevant example for this particular case is 60 frame rates per second input.

Starting point is 00:33:03 I'm about 2.25 frames slower in my response time. That's why you see a slight shift in some of the boxes. But I'm not losing the image, and I can track from camera to camera. So the customer we're working with in this particular instance wants to set up a chassis of 24 drives with 24 cameras in a circle, and have someone walk around the room, and our drives can keep track of that person and report back to the host while the host sits idle. That's in storage compute in an edge style application. Steven was kind enough to point out the concept of a container. So this is called OpenALPR. We literally took that and dropped it straight from the Docker

Starting point is 00:33:42 container store for ARM into the drive. You can see that it's an IP address that goes into our individual device. And we're executing this license plate recognition application inside the drive. The host is, again, sitting idle in this particular instance. So right now, he's basically clicking on an image to get that result back in the confidence level. Native app, no changes to the platform. We just wrapped a GUI around it. In this case, we're going to upload a brand new image from a different drive into the existing drive that we're talking about as a standard. I'm uploading a picture.

Starting point is 00:34:13 And then he's going to go ahead and send it in. It's going to get added to the database, and then it can be, again, recognized by this particular application. So you have four A53 cores. Yes. Does the open ALPR container automatically load balance, so it's doing one-fourth of the work on each of them?

Starting point is 00:34:31 In this case, it's designed for 64-bit ARM, that particular app, so we could drop it right in. If there's not a native ARM 64-bit, then it would not be able to do that. Let me ask the question differently. To do this demo, did you use one core or four? It utilized all four. And I kept that picture

Starting point is 00:34:50 because we're all developers. We like our vodka. So that's an example of being able to run a container inside the drive with no changes or modifications to the platform. And again,

Starting point is 00:35:03 this is just that app running in place on the drive, acknowledging and being able to read this. So it doesn't matter what the container is. I could grab a different app. We've done things with TensorFlow as well that we've shown in some of the instances. So outside of this, we then started thinking, well, what about if we go into the space

Starting point is 00:35:21 of our friends in biotech? Because I've got to be friendly to a lot of different people. Blast is an application that's doing protein sequencing based on files in various different places and our different file sizes. This particular graphic shows across here the number of cores in the system. Here shows the number of drives with the in situ processing turned on. And this is the percent improvement as you add drives to the system or as the database grows so i made a little animation of the slides to show you that as you build it out you can see that you continue to see an improvement as the database goes up but it sits somewhat idle but as my database grows and i use more of my drives and more of my storage

Starting point is 00:36:02 the shift of how much performance gain I get just by turning on the in situ cores inside the storage I already have is about 100% improvement. Yes? We could, yes. So it was a conscious trade-off to look at a reasonable processor that still gives me benefit for an existing architecture versus going for 100% acceleration. A trade-off, if I went to the A7 class of application processors

Starting point is 00:36:47 or put more of them in there, my 12 watts per drive goes up. And then I start becoming uncompetitive from an SSD-only perspective. Yes? So if I wanted to upgrade or patch the Ubuntu embedded OS, how painful are the processes? How long does it take? It's all done through our software API I don't have the specific details on that

Starting point is 00:37:08 but I can certainly follow up with you on it we support upgrade path through our host API it's able to push software updates to the drive do we need to do it with the host no it doesn't require a reboot because at the level that we deliver the drive

Starting point is 00:37:23 that embedded Linux and that whole processing capability is shadowed from the customer. You don't actually know that it's there as far as the drive being plugged into the system. It's just executing the applications on your behalf. So do you expose any vendor-specific NVMe or the main command to let the host manage this embedded OS?

Starting point is 00:37:43 Anything there? Today, we don't open that option to customers because that creates a lot more support requirements. If you change something just wrong, it breaks the rest of the IP in the system. We are working to build a developer kit version of it that would allow users to have more access to the internal aspects of that processing.

Starting point is 00:38:03 What is the total internal slash bandwidth available? And you may have answered this question. So the NAND itself is run, we're agnostic since we don't make it. We can run on-v or toggle, so we run at those interface speeds. Whatever the NAND choice is, our net performance is relatively the same

Starting point is 00:38:20 because there's not enough difference between those. Can you give us a ballpark? I think the current NAND interface is like 400 megatransfers per second is how they quote it. Across all the channels? We're using a 16 channel controller in all of our different versions. Last I checked

Starting point is 00:38:36 we're around 16 gigabytes per second just at that interface. Per channel. 16 gigabytes per channel. This might be a hard question to answer, but did your CTO or somebody that designed the chip say that, okay, if we're controlling 8 terabytes of NAND and we have 4 A53 64-bit cores, is that a good mix? As opposed to saying, let's just put 2 cores on there or 8 cores.

Starting point is 00:39:03 The 4 was a solid balance. They went anywhere from two to 16. They were thinking of doing a core per channel, for example. And between a gate count limitation inside of an ASIC and or FPGA and the performance characteristics gain you get, it made sense to do four as the first point for this particular iteration.

Starting point is 00:39:24 Yes? So adding these four cores, does that add any cooling concerns, mis-thermal challenges? So because my entire drive runs at 12 watts in a U.2, I actually have less thermal constraints than a lot of the drives on the market today that are running at 25 watts.

Starting point is 00:39:43 I don't add substantial thermal the way that this particular solution is designed today. It doesn't get in the way. In fact, I'm saving thermal budget at the CPU level in the host because the host is not actually running, or it's off doing something else. So I'm going to skip my nasty slide

Starting point is 00:40:03 that keeps transitioning on me and go straight to this one. So when we, so this is kind of the nuts and bolts of everything we did. So we had to start with a new ASIC. So we did design an ASIC device. It's a 14 nanometer SSD controller that has those A53 cores embedded, single SOC. There's no two-part build to this particular platform. So you get your NVMe SSD with the compute on the side if you want it, or you can just run it as an NVMe drive. We weren't about to put it in anything but the standard form factor,

Starting point is 00:40:30 so you get your U.2, your M.2, your EDSFF. And if you really want size, I have an add-in card that can support over 100 terabytes. We had to make sure the management of the product was correct, so we wrote the firmware and algorithms around reliability of the NAND from ground up. We have about 12 different patents

Starting point is 00:40:49 on various versions of ECC, LDPC, and error recovery, including one that allows for the drive to have failed devices removed, replaced, and rebuilt as a field-replaceable upgrade module. And then we wrote the firmware, and we focused on things like QoS. So this is your 5.9 window for my FBGA based product, let alone my ASIC product, which is going to shrink that. I will call out this is a very good marketing slide because I make it look like we're really

Starting point is 00:41:15 cool right in the middle of the other guys. This red window actually sits in the time scope ahead of my drive. I'm not faster than they are. I'm more consistent than they are. And half the customers I talk to will take that consistency today over the speed of the response. And that's with computation running while doing I-O to the drive. Then on top of all that,

Starting point is 00:41:38 we put in the in-situ platform. So the in-situ platform, to kind of recap, has hardware accelerators on side of the ARM cores, has the full-fledged, in this case, Ubuntu. Again, we can run other OSs if desired, and we added the Docker container capability to the solution. Going into VMware, especially with now ARM being supported, is a potential for us.

Starting point is 00:42:00 So at the end of the day, it looks something about like this. Proprietary controller, so yes, I am another controller vendor. No, I'm not going to sell you a controller. I will sell you a drive in various different form factors or variants, but I'm not going to sell you a controller. And all these different form factors. As I mentioned here, the AIC shows up to 64. With QLC, I can do 128.

Starting point is 00:42:21 So for us, finding the needle faster, having the in-situ processing as a core tenet of what we're up to that's what we call computational storage today it's all about near data processing we're moving the compute closer to the data we're getting as close as we humanly possibly can at this point by putting it inside the asic on the drive there are absolutely other ways to do near data processing and there's absolutely ways this is not going to solve your problem. So keep that in mind. We wanted to make sure that we found people that can actually use it today

Starting point is 00:42:51 and help the rest of us figure it out, so having the weight of Microsoft Research has been a great blessing to us. We did have to do the flash agnostic controller. It does support TLC and QLC today. We've already characterized early QLC parts from one of our vendors, and we can get half a drive write per day on a full line write, one gigabyte per second write to QLC

Starting point is 00:43:11 with our flash management algorithm. Just a general question. In C2 processing, you know, that has an idea, and then NGP has a good execution of that idea. Yeah. Are there other vendors who are executing on this idea? You know, that as an idea. And then NGP has a good execution of that idea.

Starting point is 00:43:25 Yeah. Are there other vendors who are executing on this idea? Yes, absolutely. We have friends in the marketplace. So Stephen's sitting in a couple chairs over from me, just shaking his head. He has a various version of InSitu. My friend that's hiding in the market around here somewhere and not presenting today has another solution

Starting point is 00:43:45 that's based on a host-based FTL that can run this type of thing. They run compute and storage, but they use a host-managed drive. So they're not an NVMe target. They're a block storage target. So they're similar. They're as close to what we're doing as we have

Starting point is 00:44:01 in the marketplace as a like product. So where ours is a standalone, there's a host-based. In the back. Okay. The part of the application that I predict like the W show for the camera tracking images,

Starting point is 00:44:16 you mentioned that you'll give a storage image, which basically means you have to give the main system memory anyway. In that case, is it because you're scaling yourself? Because I have to pay the cost of in-memory anyway. Right. Right. Right. We're focused on in storage. So in the example with the cameras,

Starting point is 00:44:46 it's a lot more about post-processing than it is real-time processing. So I'm not going for in-memory or real-time data management where it's sitting in memory first. I have to store the bits first, whether it's video image, file, whatever the case may be.

Starting point is 00:45:00 Our product requires it to be stored and then pulled back into application processors. That's just a choice from our perspective. Yes, right here. How would the resolution be analyzed from the flash? Is it taken care of by the flash? Yes. Is it taken care of by the intuitive process?

Starting point is 00:45:17 So the question was, where leveling garbage collection, the standard NVMe stuff, that's managed by the NVMe half of our drive, which is a separate ARM core, which is designed for that, which is the M class. The A class processors are only for application execution.

Starting point is 00:45:31 Yeah, so, the question was, has there come a point where networks and interconnects are so fast that off-storage can do actually makes more sense? We sort of double bandwidth, If you're looking to do it when it's... Anything that's a stored bit, if you have to move it off of that

Starting point is 00:46:01 into some host memory base that is not of like size, I'll still have an acceleration factor. Yes, you can make it faster and you have other ways to offload it. You can offload it to multiple systems. But if I've got 100 gig to one petabyte, I'm still going to be faster in some way. The acceleration value will definitely drop. So Gen Z, for example, doesn't stay? No. In fact, we're capable of doing a Gen Z interface on our devices if we wanted to.

Starting point is 00:46:33 So these are enterprise class drives. There is power loss protection built into every single one of the drives. One thing I didn't specifically call out, because I don't want to put too much IP on the thing, our DRAM footprint, we use a thing called Elastic FTL. It's a homegrown FTL that requires less than the traditional 1 gigabyte to 1 terabyte that most controllers use today. I use 1

Starting point is 00:46:54 gigabyte to control 8 terabytes. So that way I have the room for the DRAM I need in my application processors and I still use less total DRAM than any other drive of my light capacity. So your power loss protection will ensure that both the application and I still use less total DRAM than any other drive of my light capacity. Both the application DRAM and the user DRAM are protected. The paradigm here is post-processing paradigm,

Starting point is 00:47:16 but the streaming, live streaming, if you wanted to process the data before storing, could this architecture do it? This architecture, no. His architecture, yes. I'll be honest. There are so many different ways to do this type of architecture. We're absolutely focused on in-storage compute.

Starting point is 00:47:38 I'd love to do both, but you've got to start somewhere. Any other questions? Yeah, very early. Yes, in the back. So we use a host-based API library that you use as your point as your storage target, and then it pushes the application, a copy of the application, into all the different drives. And the CPU is sitting idle waiting for the application response from storage.

Starting point is 00:48:12 We're faking it out effectively by only sending up the results and sending the whole data set back. So there's still some further quote-unquote post-processing required by the CPU, but it's substantially less. Any other questions? Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.

Starting point is 00:48:50 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #84: Deployment of In-Storage Compute with NVMe Storage at Scale and Capacity

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.