Storage Developer Conference - #112: Computational Storage Architecture Development

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 112. Well, good morning, everybody. It's 8.30 on a Thursday after a long week of presentations, so we're going to try to have a little bit of fun this morning with some educational information and, of course, some product pitchy stuff because I am a vendor today. I've spent the last day being the co-chair of the technical working group. Today, I'm going to give you a little bit more pitch about what my company is up to and what we see going on in the market. So with all good technical problems and technologies, you have to have plenty of three-letter acronyms

Starting point is 00:01:09 to keep you busy for the day. So today's learning opportunities, also known as TLOs, are going to talk to you about things like the edge and the need for... Yeah, I know. There's a few. Edge technologies, things like the... It's going to do it to us again today. I was doing that last night.

Starting point is 00:01:27 Same mic, yeah. Yeah. You should use my bad voice. I should use my bad voice. I'm going to talk to you like this today. If there's not something wrong with my SDC presentation like last year, it just wouldn't be fair. So I'm going to give you kind of four highlights.

Starting point is 00:01:41 Edge needs CSDs. So if you were here yesterday, you heard about a computational storage drive. So my company manufactures computational storage drives. And I'm going to talk to you about the different form factor availabilities for those. And my cohort in crime, Mr. Ellie, is going to go grab my samples that I left on the top of the room signage

Starting point is 00:01:59 because I was playing around out front. And then I'm going to give you a little bit about how we put our architecture together for this CSD, and it uses a PCSS, or a Programmable Computational Storage Service. Again, that's a four-letter to go with our three-letters because three wasn't enough for us. And I'll give you some examples of how we use them

Starting point is 00:02:18 for artificial intelligence and machine learning and show you some example workloads that we've already demonstrated for both customers and technical events. And then I'll give you a little bit of Hadoop and database examples of what we're doing. So the idea here today is to talk about computational storage deployments and how we've deployed them, how they would work if you would like to deploy them yourselves, things like that. So it was brought up yesterday that we all have a lot of data, and we all know that we need to do a lot with that data. And a lot of the times people don't want to hear me tell you why I think you need my technology,

Starting point is 00:02:54 because if I tell you why I need my technology, then I'm trying to sell you my Kool-Aid. So what I wanted to do is spend a few moments and let you listen to a few industry experts talk about some of the problems in the market with data. You can use their words for it, not mine. Here's a quick video for you. Going to be adequate anymore to support the emerging internet of things. Oh, cool.

Starting point is 00:03:20 I'm telling you. I'm going to have one of those days. PowerPoint crashed. There we go. Internet of Things is approaching us faster and faster. People like your fridge, dishwasher, coffee maker, will all have their Internet connection. And they will be able to gather data.

Starting point is 00:03:45 50 billion network devices, or 50 billion devices that can drive in traffic, it's not the devices, it's the email and data that is just growing exponentially at a certain point in life and not support any more data. One of the problems that has plagued cloud applications is the latency required to get data back and forth to the cloud. It just functionally wouldn't work.

Starting point is 00:04:12 There would not be enough bandwidth. The servers themselves would get overloaded. We're needed network for more and more content. You really need to have a device that can process the information that's coming in real time. As these new use cases evolve, the Econoscar, the Connected Plane, you've got this need for speed and latency and locality of compute that's going to drive you to do some of these functions at the edge. When it comes to making important real-time decisions,

Starting point is 00:04:44 edge computing significantly reduces the latency. Instead of us adapting to computers to learn their language, computers are becoming more and more intelligent in the sense that they adapt to us. It is estimated that 45% of created data will be handled at the edge. That means storage, processing, analytics, and decisioning. And that's going to drive a need for some new capabilities and new technologies. So hopefully that gives you an idea. That is definitely a spin of the conversation related to edge computing. And granted, there are a lot of different buzzwords we're using today. We've got micro-edge, mini-edge, near-edge, far-edge.

Starting point is 00:05:29 And there's still the traditional data centers that still have to cope with some of these problems around data. So realistically, definitely spend a little bit of focus here from an edge perspective because it does provide a perfect platform for technologies like computational storage to be a value add to the ecosystem. So one of the things that we want to make sure and highlight as we walk through the technology discussions and hopefully throughout the course of the day as the other vendors and other people present is, this is not a technology that we're designing or developing to replace existing architectures,

Starting point is 00:06:00 but to augment or to simplify, in some cases, the existing architectures. Because there's been a lot of debate about what this is trying to fix or replace. It's not. It's trying to look at it in a new way. And you can see here, this is an example from Gartner talking about where they see all the different opportunities for data generation. And it basically means that we have so many opportunities to expand the way compute is used, the way storage is used, at various different locations in very differing ways. The way these systems are being built, the way they're being deployed,

Starting point is 00:06:32 the way people are using them, is not just rack and rack upon data. So we went from big iron sands to, I'm going to build a rack of white box whatever, dirt cheap. Now we have to start architecting boxes that are designed for hardened environments, for autonomous vehicles, connected planes,

Starting point is 00:06:47 or just even the now POPs, points of presence, that type of stuff. And those architectures provide an opportunity for something like computational storage to come in at the beginning of those architectural developments so that they're not trying to retrofit, but they're able to actually start off that way. And then you can look

Starting point is 00:07:04 at it as far as the retrofit ideas for the larger data centers, if you will. So this is an example I tried to put together in the simplest way I could as a marketing guy trying to talk to technical people. With all the data being moved to the edge, this funnel represents the idea that I'm looking for the blue dots. And the blue dots are located somewhere, and I need to get them to the appropriate filter within my funnel. And as we move that data through the funnel, you'll see that I have actually lost data, but I've gotten the data I wanted,

Starting point is 00:07:36 but it took time to get there. Some showed up sooner, some showed up later. But if I go back, you'll see there's actually seven down at the bottom, and only six come out. That's actually a problem that exists for one of our customers today is they're losing visibility some of their data because it's getting filtered in the wrong spot or it's not getting to them fast enough to let the systems manage that. So, and I totally forgot my introduction pitch.

Starting point is 00:07:58 So as you've probably seen already, I've swapped slide templates a couple times today. I'm going to continue to do that throughout the day. So I've got two or three more transitions that will take place through the deck. I'm trying to keep you guys all awake and interested and paying attention, so if you can come up to me either at the break, because I don't want to take away from the next presenter, I have a table outside. I will happily pay you 10 bucks from Amazon if you can tell me how many different PowerPoint templates I used, and they are posted online already, so you do have a cheat sheet if you need to go out and look at it. So, when we look at where

Starting point is 00:08:31 we can deploy these products, the concepts of computational storage, and specifically for us, these are all workloads where we have engaged customers today. But again, it's not just the data center. So I can talk to the hyperscales and the second tier hyperscales and third tier hyperscales all day long and get a lot of good business from those

Starting point is 00:08:50 folks. But really, as we move down this infrastructure, the edge devices and the center infrastructure are really where we're starting to see more attention and more interest. As mentioned in the video, connected planes, autonomous cars, we've got business there today because they do see the value in this technology and what it can bring because their infrastructure has power limitations. It has processing limitations. It has architectural design limitations that can prevent it from being able to accomplish what it's really trying to do. So when you think about an autonomous car, they want to be able to use one of these guys, your M.2, because it's small and it offers an opportunity

Starting point is 00:09:27 to have some capacity attached to it. But if I can put compute in something they're already buying and give them some extra processing horsepower for no extra millijoules of energy, why wouldn't you want to try using that technology and see if you can make it work? There's going to be workloads it doesn't work for, I guarantee you. From that point of view,

Starting point is 00:09:44 this is the different kind of ecosystems that we see as opportunities for what we classify as our technology known as in-situ processing. So processing within the device. And in that case, you can see several different workloads or even infrastructures that we've looked at today. TensorFlow's machine learning, I'll show you some examples of that a little bit later on. Last year, we introduced the FACE algorithm and what we did with a co-project with Microsoft around image similarity search, which is an inferencing-like or even a search-like architecture.

Starting point is 00:10:14 We've got databases that we've played with. We've got things that we'd throw in Docker containers that we drop in the drive. And you'll hear a little bit more about some other people's ideas on how the Linux subsystem inside a drive will be of value later today from one of our partners and co-sponsors of this event. Content delivery is a great example of where we can see these architectures work

Starting point is 00:10:33 because we're putting... If you think about where the ecosystem is being built and you read through the industry about where we're doing edge deployments and who's really building the edge data centers, it has a lot more to do with the telcos and the content delivery guys because we're streaming everything. Disney Plus

Starting point is 00:10:47 goes live in a month or just over a month. We've got Hulu. We've got Netflix. All those guys have infrastructures that are now existing further away from the data centers. They're not just buying Amazon anymore. They're putting it somewhere else where they need it. These types of architectures can help them. Machine learning, as I mentioned, we've

Starting point is 00:11:04 even talked about HPC as an opportunity to offload. We had a nice conversation last night, and HPC guy's like, I see a use case for this. It's a great opportunity to look at this technology moving forward. So my one big product pitch slide, if you will, is this guy right here. So I already showed you the M.2. That's eight terabytes. This guy can run up to about 16 terabytes. This is our new fun EDSFF. You may have seen the presentation yesterday that talked about the new form factors. And then we also have your standard 2.5-inch drive.

Starting point is 00:11:32 And I can make this guy fit up to 32 terabytes. And that's where we get into an interesting conversation is blast radius or ability to actually deal with the gravity of the data on this device. If I've stored 32 terabytes of data and I only really care about a couple hundred gigabytes, why do I want to pull all 32 terabytes back into host memory to do some work?

Starting point is 00:11:49 Why not let the system that you already have that has the capability to do it search through, sort, rearrange, do whatever else, not wear, leveling, and garbage collection, but actual data management, data analytics, data transformation at the device level, it saves you the bandwidth problem that is always going to exist.

Starting point is 00:12:06 No matter how many lanes we put on the front of it, no matter what form factor we put it in, no matter how much power we give it, we will always fill the lanes of traffic we create for data. And the data size is not shrinking, so that's why these things are of value. The fun part is things like this. I think we started with one form factor.

Starting point is 00:12:22 There's now like 12. So there is a lot of debate going on in the industry what to do with form factors around these technologies as much as it is what we can do inside of them. So our architecture today, we build a ASIC-based computational storage processor that is built into our NVMe SSD controller. We looked at it from a perspective

Starting point is 00:12:43 that there is enough of a market adoption for products and technology today that putting it in an ASIC format, saving the hops, giving you something that you can scale in the right form factors was very paramount. But we are delivering an off-the-shelf NVMe SSD, so we have to do all your traditional data management,

Starting point is 00:13:02 wear-leveling, garbage collection, flash characterization, because I don't care which flash vendor you give me because I'm not a flash guy. That's one benefit that my solution offers is I can work with any of the NAND vendors. We put it in the right form factors and then we add what you classify

Starting point is 00:13:15 as our startup value add, which is this in situ processing stack. We took a look at it and said we want to make this as flexible for our customers as possible and there's absolutely opportunity where what we've done doesn't fit for all workloads. But for the workloads and for the customers we've talked to, this is what they would like to see come about. So we've got a full drive Linux running or an OS.

Starting point is 00:13:36 So the application cores that we've installed in our ASIC or built into our ASIC let us run an OS. Does it have to be Linux? No. I've talked to folks about FreeBSD. We had fun as an experiment in a lab. We actually got Windows running inside the drive. Why you do that, I don't know, but it works. We can offer the virtualization concepts

Starting point is 00:13:53 by letting you drop containers in the solution. This, again, is a flexibility play. It gives you the opportunity to be more flexible or easier to deploy your solution. And then it's built off, in this case, an ARM quad core, as I mentioned, a partner of ours. And we've even got the ability to throw hardware acceleration in it. So this first solution gives you quite the opportunity

Starting point is 00:14:13 to look at these devices as basically a Linux subsystem within your particular platform. As we go through the course of the day, you're going to hear various other ways that these particular solutions are being designed and built for the concept of computational storage. Hopefully what we have to offer in some of the examples I'm going to walk through in the next few minutes will give you an idea of what you can really do with it.

Starting point is 00:14:33 So as we look at it from more of a realistic implementation perspective, we've taken some of the resources from a CPU, we've stuck them in the design, and then we've built a solution stack around it to support you. And I'm going to keep reiterating a couple of key points because I want to make sure that they come across correctly for at least our type of implementation of computational storage it's an off the shelf NVMe drive

Starting point is 00:14:55 it will do, look, act and treat your data as any other NVMe solution will today you have the opportunity to turn on the additional ARM processors within our product that then can offload different types of workloads. And I've got AI workloads. I've got training ML workloads that I'll walk through, show you a Hadoop example.

Starting point is 00:15:14 But the other part of this is we have to pay very much attention to the ecosystem it's being plugged into. So, for example, this form factor only gives me 8 watts. So I can't blow that budget while turning on the compute. So we had to write the architecture and design the architecture correctly so that your data can come in and out as an NVMe drive, and I can still do processing, and I don't overpower the system. And this is where things like the connected plane, autonomous car, or net edge device platforms are really taking a hard look at this

Starting point is 00:15:42 because I'm staying in that envelope. I've optimized the solution to give you the best of both worlds. And we've classified that as watts per terabyte because it's a high density, low power consuming compute offload. Throw that into a TCO model for the overall system and your overall system power comes down as well. And I have an example of that in the deck. So when we looked at it, we said, here's your traditional SSD. You've got a media controller, you've got some DRAM, and you've got some NAND. And there were some companies back in 2012, 2013, that took this off-the-shelf design and actually implemented a version of what we're now calling computational storage.

Starting point is 00:16:16 The trick that they ran into as a problem, if you will, is that singular media controller has to do too much. They don't have enough processing power to do true compute offload and manage the device. So we said, well, if we're going to do this right, we're going to add into the solution a secondary application core dedicated to compute offload. And that's where these ARM quad cores come in. We load the OS. And then the next part of it is,

Starting point is 00:16:42 well, if I want my customers to be able to use this, I can't add another interface. I can't create another I want my customers to be able to use this, I can't add another interface. I can't create another path because it needs to be able to plug in and work and be able to be operating from that perspective. So we're able to take the application or a version of or a partial of, depending on the level of complication

Starting point is 00:16:57 and engagement with the customer, and migrate the application to actually execute user code inside the device. So I'm not modifying the code. I'm taking an instance of it, dropping it in, and doing it in parallel across multiple drives. That way you get this concept of parallel and distributed computing with very little effort.

Starting point is 00:17:14 Today, the way that we're doing it, it is a custom library and a custom API. That's one reason why we joined up with SNEA and created this computational storage working group along with about 40 other companies because we realized that the way we're doing it is a little bit of kind of new and innovative but I can't make everybody adopt just my way of doing it because there's other people in the room that are doing it differently as well but we all agree if we make at least the discovery and the way to plug it in and see what it can do common then it'll work better for everybody and since I already had DRAM in the device,

Starting point is 00:17:45 I can share that DRAM between managing the drive, where leveling, garbage collection, data placement, and data manipulation in the way of transformation or any of the workloads that I'm about to walk through. So with that, I wanted to get into a couple actual architectural designs because you got to do a little bit of a mix and match of product pitch and technology and the natural execution of it. So this first example is where we took the concept of using R on Drive Linux. We loaded Keras APIs into it that now run a TensorFlow application known as MobileNet V2. Now this MobileNet application is an object identification or object recognition application.

Starting point is 00:18:29 The little video is a GIF file, so I didn't bomb poor Brooke and the SNEA team with a large file, so that's why it's a little jumpy from that perspective. But as you can see, this application is taking those particular objects, it's identifying them and telling you what they are, and giving the four closest representations of it. Now, the trick to understand is I have not modified that application. That is

Starting point is 00:18:49 simply the application executing real-time in my device where the USB camera is passing information through the CPU to my SSD. I'm running the code, and I'm replying back to the host with the answer. This can be done at scale where we've done examples of where we plugged multiple cameras into a system. Each camera goes to an independent drive. They all run concurrently and the host is sitting idle. The host is simply doing data pass-through. None of the actual execution of that mobile netv2 is being done on the host resources. So this is an example where you can use it. In this case, it could be surveillance or it could be some other form of useful machine learning type of workload from TensorFlow. So another example is when we get into the conversation around neural networks

Starting point is 00:19:32 and weightless neural networks versus convolutional neural networks. Last year with our example with Microsoft, it was a convolutional neural network. Weightless neural networks are starting to gain traction, but some of the problems we're running into when you look at these, for example, this particular academic study, it's this concept of federated learning or distributed learning. Google's brought it up in 2017. There's been a lot of papers on it. They tried doing it using cell phones for purposes of image capture and stuff like that with the Android platform, but it needs to be able to migrate it into more realizable and useful technology as well. And so we took an effort and said, well, we need to look at it from a different point of view and do parallel and distributed training

Starting point is 00:20:10 can be done in our drive. We can do training in our drive. Because it's a federated or transfer learning process, we're taking advantage of technology that others have deployed and implemented and being able to make the world, quote-unquote, better, faster, stronger. Again, you're not buying extra hardware. You're simply using a device that has this resource

Starting point is 00:20:29 available. So I'm not adding to a system. I'm just using existing platforms with a new piece of technology that you'd buy anyway because you need the storage and making it do something a little different. So this is going to be a quick kind of tutorial, if you will, of a walkthrough of how federated learning works and why it's valuable to the industry from that perspective. So I've got two systems here. On the left side is your traditional machine learning training algorithm path, and on the right side is how you do it with our technology

Starting point is 00:20:58 or a computational storage solution. So the first thing you always have to do, and this is always going to be the case, is you've got to put data in the drives. Data you're going to store in this case, storing pictures, storing whatever you want. For example, the object tracking that was done in the GIF on the previous slide. So that we don't change. You need the storage to work. It has to be common storage, hence off-the-shelf NVMe. But then it starts to get interesting when you get into the next step, because what we then do is we migrate a copy of the existing training model into each one of our compute resources on our drives.

Starting point is 00:21:30 Now, each of these drives have independent data being stored on them. They're not identical copies. It looks like I'm replicating the same image, but they're actually storing massive amounts of data in parallel across different devices. So now each of these training models are going to start doing some work, and when you do the training inside the device, each of those products, or each of those drives, are doing real-time training, while on the other side you're re-migrating all that data

Starting point is 00:21:57 back into the host resources and using the host to do that model training. So this is your traditional versus your opportunity to save a lot of bandwidth and host power and host resources. The trick that then becomes is as you start to evaluate and update that training model, you'll see that on the left-hand side, I've got a model that the host CPU has managed through all of that data. On the right side, I've got a slightly different variation of that model because each of the individual devices have done a sparse model update. They've transferred that sparse model up to the main update, and it's created an even

Starting point is 00:22:30 stronger model because I've distributed it across so much more resources, and I've done it on a more localized set of data. And since that data set is smaller than the models being trained on, it's actually more efficient. The trick was to be able to get it back up to the host and recombine it, if you will, into something in the way of a new or innovative workload. And that's kind of what's been going on with this particular focus. So then what you do next is, on the left, you have to continually repeat these steps. Train the data, load a whole bunch of information, evaluate the data, create a new model. And you're doing that by migrating data back and forth as I was showing with my fancy green arrows. On the right side, I've simply migrated that new model down

Starting point is 00:23:12 and I'm going to reiterate again, but I'm going to continue to be a level of model value to the customer or to the person using this workload because I'm creating a more distributed and useful example of this training. And it saves host resources to go off and do what it needs to do in a way of gathering new data to put into those devices because you're constantly updating the storage with new information to create the need for an updated or new model. So it gives you the ability to walk through this path over and

Starting point is 00:23:40 over again where I'm always going to consume the data, but I don't always have to return the data. I can return the value of the data in the way of this, in this case, a training model. So I'm doing useful work on the data. The data has never actually left my device, but yet the value of that data has been presented back. And that's really where this concept of computational storage starts to gain even more net value, if you will, to the market. So what does that look like in reality? So this is an example of a database that was run, and it is a somewhat small database, but the concept here is I want to get to the most efficient level possible. And this particular training algorithm shows that after only doing four iterations with the federated model, I've reached a 94% accuracy of the data.

Starting point is 00:24:26 So you can see that it asymptotically gets up to 98% over a very long period of time in the existing model. But by doing it in a federated fashion, it only took four iterations within the device to get to the same level of model training that it took an entire ecosystem of GPUs and other things that we're using for machine learning. So these are the values that we're bringing. I've saved power. I've left the host alone. I've stopped moving data, freed up the data pipe to ingest more data

Starting point is 00:24:52 because I'm not moving data back out because you always end up with a two-way street problem. And yet I'm still providing what the customer really needs at the end of the day, which is the value of that data. And that's what our computational storage and what computational storage products are working to provide. So in real time, as an example, this particular video clip is showing that we're going to do a model training of that object. So as this person moves the object around, the camera is tracing, following, and learning what that object is from all the

Starting point is 00:25:22 different angles. You can see up there the model update that there's been 18 model updates in a short period of time the video loop has been running because this particular box here at the top, this is the quad core in my device. This is not the host CPU. Those guys are the ones doing all the work off of the different drives doing those model updates. I don't have the host actually executing this code real time

Starting point is 00:25:44 on that image file or that video stream. The devices themselves are executing those model updates for that particular device. And this can go on. We kept it short, again, to keep file size down. But you can see that the ability to run real-time compute algorithms with no modification to the code, this wizard machine learning algorithm was just copied into the multiple devices that are used to do these model training updates. So that's where the value of computational storage

Starting point is 00:26:11 and what we're offering our customers provides. So AI and ML is great. It's awesome. We know there are big buzzwords, and I'm going to continue to ride that wave as long as I can because it's fun, and I'm a marketing guy with a little bit of techie. But there's also some real-world big data stuff like Hadoop that are useful to talk about I'm going to continue to ride that wave as long as I can because it's fun, and I'm a marketing guy with a little bit of techie.

Starting point is 00:26:29 But there's also some real-world big data stuff like Hadoop that are useful to talk about, as well as some databases, and different ways of looking at how you can manage these things. So you're going to hear a lot about different database implementations, different Hadoop implementations of this product. So I wanted to give you our spin on it for today. So this represents a Hadoop cluster that was built, and down here kind of represents what we're trying to do with our data, which you see that we've basically taken a portion of the Hadoop workload, the data management node,

Starting point is 00:26:52 and we've migrated into multiple NGD SSDs or CSDs. And up at the top, you can see the two flat lines that are called the 16-core host. We've dedicated 16 cores of this Xeon processor in the baseline of this product to normalize that result. So no matter how many drives are in the 16-core host. We've dedicated 16 cores of this Xeon processor in the baseline of this product and normalized that result. So no matter how many drives are in the system, this particular workload on this set of 12 drives,

Starting point is 00:27:12 the performance is the same as it's doing this particular application, which is a sort application. So we're like, well, that's great. We can make it faster, but let's make it more efficient. So we turned off eight of those, or in this case, yeah, 12 of those cores, three quarters of the processing power.

Starting point is 00:27:30 So if I turn those off, of course, with no computational storage drives turned on, it's going to run slower, and it's going to consume more energy because it takes more time to complete the task. But then I start turning on the computational resources within just a couple of the drives, which manage part of the data.

Starting point is 00:27:44 And as you can see, as we start turning on these drives, at two, four, six a couple of the drives, which manage part of the data. And as you can see, as we start turning on these drives, at 2, 4, 6, right around 9 drives, I am now processing this application on this data set at the exact same rate of speed as the host was doing with 12 additional host Xeon cores versus using the computational cores inside my drive. And again, my drive is consuming no additional power to provide that performance benefit, and I'm saving you power because the Xeon is not running as fast, or it's off doing other things.

Starting point is 00:28:15 So over here you can see that we looked at it from a power perspective because that's part of the TCO model. And you can see, again, as I start to turn on the drives, my crossover point on power savings is actually before I reach a performance benefit. But then as I get significantly more power consumption savings, I've also gained 40% in execution at just 12 drives. And I'd challenge anybody to tell me someone that does a Hadoop workload with only 12 drives. Now, will that asymptote off? Absolutely. It's not going to constantly be a forever better improvement.

Starting point is 00:28:46 But the simple fact that at 12 drives or half a server, I can provide you execution 40% faster on a given workload, that provides value to what this technology and what our products do for our customers. And then if I look at how you may build one, and this is definitely a singular representation of how you may consume this particular technology, but I'm using SAS HDDs in this case

Starting point is 00:29:07 because that's dirt cheap because that's what people want to do when they come to building these big data platforms. It takes nine rack servers to get me to 864 terabytes of data storage or three quarters of a petabyte. And it has a single Xeon processor in each box because I'm keeping it dirt cheap.

Starting point is 00:29:23 I don't want to put a lot of effort into it. On this side, we take our 8 terabyte M.2s and some fancy new platforms that support 36 of those. And in three single 1U chassis, I can give you that same exact density. So now here's your, I can shrink it and make it better play. But what this shows up here is I've now added 432

Starting point is 00:29:45 capable processing cores to that subsystem. So I went from 9 Xeons with a whole bunch of cores within the Xeon to 432 additional drive cores that can be used to manage data. Now, Terrasort works great

Starting point is 00:29:57 because, again, data's consumed and I'm just simply looking through it. WordCount is another great example. I'm counting information through the data. I'm not looking to move the data. I'm not even looking to transform the data. I'm simply looking for the value of the data. And that's really what comes into play for this particular example. So another way to look at it, we did a MongoDB example, and we took a different spin on this particular workload. So this is an example of a retail website that's running in

Starting point is 00:30:24 Mongo, and you can see the data being generated by the different websites kind of scrolling through this simple little video clip. So from this perspective, I'm not necessarily looking at computational storage as an acceleration of an architecture, but an ability to scale the architecture.

Starting point is 00:30:39 I turn those things off three, four times, I swear. So this is going to probably slip on me again, but let's see if I can get it to stop. But the idea here is I can provide scale because as I add a storage device to increase the size of this particular

Starting point is 00:30:55 website or a retail footprint that's using a MongoDB platform, I provide them the opportunity to not have to add more processing, simply more storage. And that's really what it comes down to value. Again, sorry folks for that. So, scalable computational storage.

Starting point is 00:31:12 The ability to drive what I classify as the new cloud, because we have a cloud, we have a fog, we have somebody at one point called it mist to get closer to the edge. However you look at it, our data is moving all over the place. It's no longer in just one location. It's no longer in just

Starting point is 00:31:28 one type of architecture. It's no longer in just one type of workload. And being able to provide this flexibility of workloads and deployment is very much a key for what we're doing from this perspective. And that's what this edge data growth is doing when it's challenging these platforms. If you look

Starting point is 00:31:44 at some of these new servers, we have a partner of ours, Lenovo, that has built an edge server. They were running around at the event when they first launched it by pulling it out of their backpack. It can plug into a wall outlet. It's got 5G engines on it, and it pulls right out of their backpack. Well, they need storage in that server. How are they going to get enough storage in that server? And there's compute in that server, but if it's just plugged into any Joe Blochmo outlet, it's not going to be multi-core dual processor Xeons in there. It's a smaller, lower power, lower cost solution,

Starting point is 00:32:12 but it still needs to provide value. So if we put our particular drives in with it, we give them some extra boost at no additional cost to the system because it's just storage. They have to have it anyway. And if you can give them a high-density storage solution, that's even better. And then, as I've tried to illustrate this concept of in-situ processing or our version of computational storage, it has a wide range of opportunities to engage with our customers.

Starting point is 00:32:37 You can SSH right into our drive if you want to have that capability and treat each drive as a Linux microserver. You can just simply use the GUI we provide and recompile on ARM. That's the extent of some of the software changes required for some customers. Our particular workload example that we did last year, I didn't want to reuse the same slide two years in a row, showed that we could save about 500 times the load effort

Starting point is 00:33:00 for the algorithm that Microsoft was using for their AI workload because they're not loading all workload. Because they're not loading all the images, they're not loading all the data into the host memory, and they're not loading memory, processing on it, flushing it, reloading. They're simply getting the value of the data out of the storage devices at scale. So I'm running a little bit ahead of schedule because I talk way too fast when I get excited about my technology. So I've got a few minutes for some questions in the room. If people have any thoughts, I see some curiousness on some people's faces. So please feel free to ask a question. I can go back over anything or I'll let you free for a few extra minutes.

Starting point is 00:33:39 How does that communication scheme work between your onboard Linux and the host? So for our particular implementation of this, if I go back to my fancy little drawing here, we use, in this case, a tunnel over the NVMe bus. So we actually embed TCP packets within the NVMe transfers to move the compute resources over. So I've heard people say that TCP can be a little slow or a little odd as a choice, but it's a tried-and-true way that exists within Linux

Starting point is 00:34:13 as a platform. Again, we're reusing things that are already known, not trying to reinvent the wheel. So yes, thank you for the question. Cool. Yes, sir? You talked on a previous slide about your 16-core system versus four cores. You talked about power on that.

Starting point is 00:34:33 And what you used as a reference was you did not have the additional processing power for the computational services? So the question he was asking is, in this example here, this particular case, we did compare both with our drives on and off, which would be simple. I don't just turn my computational resources on. He asked, did we also do a comparison of using non-NGD or just traditional drives? We did do that as well. It shows a similar type of performance challenge, but I wanted to just...

Starting point is 00:35:17 We didn't put that particular data in this slide, but we have done that work. If you're interested in knowing more about that, I can certainly provide it. This was our drives on and hard drives off for this particular representation, for sure. Yes, sir? You talked about the architecture

Starting point is 00:35:34 having a processor with a hardware acceleration. For the example, was it largely just using the risks of the CPU, or did you actually offer something to the hardware acceleration? So right now, as far as the products we provided, we do have a hardware acceleration engine inside the CPUs. We have not actually had to turn it on yet. We're actually working right now with a customer to turn on that hardware acceleration

Starting point is 00:36:01 to further advance some of the AI workloads that we're using. Because we do have workloads today that we've tested that customer assets look at that are not optimized yet, and so we actually slow the system down. I'll admit that we can be slower in some workloads. We are in the process of turning that on, and there's a lot of opportunity because it's a very open-ended engine from a perspective of what we can do. But it's part of that ARM subsystem that we've embedded inside the drive. How are you balancing the power of the A-plot between the RAM and the ARM?

Starting point is 00:36:31 So the question, and I should have repeated the last one, so I apologize. So the question is, how do we balance the power between the overall subsystem of this? So when you look at my particular solution as an NVMe off-the-shelf drive,

Starting point is 00:36:43 I'm not quoting a million IOPS. I'm not quoting three gigabyte writes. I'm not the fastest drive on the planet because I don't need to be. So we've challenged and taken an effort looking at it as a random read, heavy read-centric-like device, which is what everybody's calling a read-intensive drive. There's overhead in those devices if you build the rest of the ASIC correctly so we did an ASIC from the ground up we're on 14 nanometer process technology

Starting point is 00:37:09 which gives us a bunch of power savings and then the workloads we've optimized the rights that we're not conflicting over that particular 8 watt limit for the M.2 and then when you get into the EDSFF and U.2 they have a much larger envelope so the most challenging one we have is definitely the M.2.

Starting point is 00:37:26 And you'll see that in the raw performance numbers. But again, as I've shown, raw performance versus compute, I don't have an issue with the speeds of the drive when you're using it in a computational example. Yes, sir. Sorry. You first. Your hand went first. You're showing a direct connection there

Starting point is 00:37:51 between the man-made and the whole price. Does that work? So this is an oversimplified marketing diagram. No, so the trick here is the data comes in over the NVMe protocol, the transport, PCIe transport NVMe protocol. Once it's in the media controller, we're transforming that data to be stored in the NAND

Starting point is 00:38:13 using the NAND, in this case, toggle mode or on-feed type of architectures. This really should be touching that. So we do use the media controllers attached to the ARM processors to do that so that we can see the data structure. Because we have to know the data layout, we have to know which LBAs are where, that kind of stuff. But the trick is I'm not

Starting point is 00:38:31 using anything in the way of a storage protocol, I'm simply creating a bus between the application core and the media cores to do that management, which is still significantly faster than using the NVMe protocol overhead. So you're raising, getting the media controller, but you're sharing the RAM and then you're accessing it. faster than using the NVMe protocol overhead. Basically, yes.

Starting point is 00:38:53 There's a couple of additional internal buses that reroute where the application course talked to the media, but the trick is we did not put, in our case, we did not put the application processor in line blocking the NVMe so that we don't create a contention between writing data and managing data. Yes, sir?

Starting point is 00:39:11 So, does your host API Scott support getting data from a different CSD into the application processor on another one? So the question came from my friend who loves peer-to-peer, asking if I support peer-to-peer.

Starting point is 00:39:28 We're all CMB. Yeah, we're CMB. So this first version of the product is not designed for that particular workload. It is fairly new. It's your host memory, right? Yes. The API is a screw-peer-to-peer,

Starting point is 00:39:41 but if I want to get input data from a couple of my drives and bring it to one of the arc boards, you can do that through host memory. Correct. ADI is a screw-peer-peer, but if I want to get input data from a couple of my drives, I bring it to one of the ARM cores, you can do that through host memory. Correct. Any time you want, right? Exactly. So there are multiple ways to deploy this in an environment. Today, our view of it with our current solution is I'm going to have a whole bunch of these in a system.

Starting point is 00:40:09 I'm going to push an exact copy of whatever I want to do to every single device that's in that particular workload environment. If the information I'm interested in is not located on one of those drives, I'm simply going to get nothing back from that drive because it's looked through it and has no value added. It comes back with a null or whatever you want to call it. Longer term, now that things like peer-to-peer, CMB, and some other new features are coming out within NVMe and the PCI architectures will certainly be enabled in the next solution from that perspective. Yeah, absolutely. Another one I did have. What are your thoughts on Ethernet?

Starting point is 00:40:38 So, good question. Is Ethernet a good place to put in line with the NVMe? So, we've had a lot of discussions with customers about that. We have no qualms with necessarily doing that. As of today, there's no immediate plans to do it on this solution. There's always the dongle or the attach point from certain other vendors in the networking space that can provide that solution for us. So, there's opportunity to do that, but it also creates a lot of challenges that a lot of people may or may not know about around drive management and things like that. So for today's products, we're going to stick with an NVMe solution.

Starting point is 00:41:12 So, yes, sir. You already asked a question. Yeah. No, you. SW. Sorry. So there's some implied data structure in there. You're processing video, which is not necessarily blocks. So you're talking probably about this guy right here, right, for example?

Starting point is 00:41:38 Any of those where you're... So this particular application is being used to capture and store information, and in the host memory version, a non-computational version, you're streaming that video into the application. It's doing the data manipulation, identifying the object and sending it out. And it's being shipped to a drive

Starting point is 00:41:56 to be stored as the drive stores it. I'm copying that instance into our device. So all I'm doing is moving that ability, that exact application that already knows that file structure, already knows how that file system our device. So all I'm doing is moving that ability, that exact application that already knows that file structure, already knows how that file system's working, it's attached to the device just a little closer. I'm not recreating that structure from that perspective. It could be file, it could be block,

Starting point is 00:42:16 it's whatever that particular application and this API are expecting, there's a replication of it in the drive. So you're using the drive's file system or the host's file system. Is that true? How does the Linux kernel on the drive that has a file system, I assume,

Starting point is 00:42:33 and then you have the host's file system. Which file system is actually used to sort of create the structure of any... It's based on the host file system. And you're able to pass the sort of L a structure of image? It's based on the host file system. And you're able to pass the sort of LDAs to that file

Starting point is 00:42:50 and the details to the... Right, which our friend over here in the corner, I didn't get his name, made the comment about my data path drawing being wrong. I am talking through the media controller that's understanding the media structure from my on-drive OS. So it

Starting point is 00:43:06 understands that because, again, if I were to turn it on or off, the host will still do the same work. I'm just moving an exact copy of it in there to allow that application to run closer. The data path for my internal controller is talking to the same engine that is in line of the data path from the host engine. So I don't have that overlap problem. So I may have drawn it incorrectly, but it functions correctly. So you're saying that the ASIC is issuing NVMe-free commands

Starting point is 00:43:36 using LTVs provided to it by the host engine? In a matter of sense, yes. There's a little bit of a difference because we're not using NVMe as such. We're using an internal bus, and that's some of the IP that I've created, or we have created, if you will. So it's not a specific way of doing things?

Starting point is 00:43:55 Yeah. It's on a separate internal bus through that process, yes. It does create... It creates a lot of questions. I'm happy to talk about them one-on-one, potentially under NDA. There's a very good reason they don't put the CTO in the room because he would give all those answers.

Starting point is 00:44:15 But yeah, so again, the architectural structure of it, this is a very simplified version of it. There are definitely ways we can show how the programming model works. This is more of a hardware look to it, if you will. We have talked a little bit in the past about the software and how the data actually flows within the device.

Starting point is 00:44:34 But again, that tends to be a little bit more low-level conversation. Sure. Sure. So typically then, your customers, is the host API signed directly to the block device like in namespaces? Or do customers put a file system on the host, on the block device

Starting point is 00:44:52 and then the host API talks through the file system? So... Or are we getting to another area you want? That would be an area where I've chosen to be slightly dumb enough

Starting point is 00:45:04 that I can't answer it or I can say I don't know. Every once in a while I get into that conversation with the team and they're like, you don't want to know that because you don't want to have to lie to people saying you don't know what it means. There's definitely some secret sauce there that I am not comfortable with sharing. I think the idea is there's a lot going on and there's definitely a lot of questions. But at the same time, there's a lot going on, and there's definitely a lot of questions, but at the same time, there's a lot of examples of where it's already working. I think that's an interesting point in general. Yeah.

Starting point is 00:45:30 Something like that. Flexible, programmable competition. The one that you're providing with your arm force is linked to this as well. What does that mean? The file system, the key values, whatever they are. Just remember, because we're recording this. I know, they're not hearing him. So Stephen, my friend in the TWIG and also co-worker in the computational storage products department

Starting point is 00:45:56 was discussing the concept of the file systems and how they work and aspects of what we need to do in the way of work within the computational storage TWIG to make this more understandable for our customers. So there are aspects he was asking about how the interaction between the API and the file systems work that I was admitting that I don't know the answers to.

Starting point is 00:46:14 And that is definitely some of the secret sauce and some of the patented stuff that they've done inside the company itself. Cool. Yes, sir. How do you follow your operations? Do you use a sir? Yeah, so we have multiple different models. We have a GUI-like model for simple engagements

Starting point is 00:46:35 that use the API where you do a storage call to the R API, and it does the work of migrating an ARM-compiled version of your application so that as a user, if you're doing it at the highest level, you have to recompile whatever application you want into an ARM instance so that we can migrate that into the ARM core. So if you're working on x86...

Starting point is 00:46:53 Yes. So the host side will have to do the compile of the code into ARM to allow it to be migrated into the device. Beyond that, it's not changing the code or rewriting the code. It's simply recompiling for a different architecture. Yes, sir? With the ARM core

Starting point is 00:47:10 versus trying to do the application, are you saying that the ARM cores are more efficient, or just that the whole, that it's more efficient because it sits closer and you've got a better bandwidth? So the question came out, is it using the ARM cores

Starting point is 00:47:25 more efficient, or is it just because it's closer that it's creating a benefit to the customer? Is that a proper repeat? Okay. So from that perspective, the ARM cores we chose definitely have some trade-offs. We could go much more powerful cores. We could add more cores. We chose, for this particular

Starting point is 00:47:42 case, a quote-unquote less efficient as far as processing core, but a more power-efficient core because we didn't want to engage that power offset. That's why I said earlier in the presentation, there are absolutely workloads where I slow things down today. Because it's closer, because I'm

Starting point is 00:47:58 not requiring a storage call of actual media out of the device, at the size of our devices they are today and the speeds that we have within the flash, the cores we've chosen are efficient enough for 90% of what we've seen today in the market. So the idea is if I have to pull media from here over NVMe more than two times to fill a host memory buffer so the host can process on it, I can probably provide you some sort of either acceleration or equal performance at lower power. Because the more you have to pull that data out and put it into host memory, flush it and repeat and keep doing

Starting point is 00:48:33 that iteration, the more efficient this subsystem becomes. Our goal is to limit the IOs coming out while you're allowing continued IOs to go in. And that's really where our focus is. So it's a storage-centric computational storage solution. As you can see in the server complex over there, we still show a GPU. Real-time data coming in, say, for example, front-end LiDAR and cameras from an autonomous car, you're probably not going to use my solution for that. You'll have some other form of high-performance NVIDIA

Starting point is 00:49:05 or something doing that, because that's true, in-line, real-time, I can't hit something. But all of the rest of the surrounding data and all the other information that that car has from an ecosystem perspective can easily be processed by this particular solution. Yes?

Starting point is 00:49:22 Today, the drive presents itself as a block device. Do you see, for the application use cases going forward, that it is more beneficial, perhaps, to be a file or an object device? Sure. So the question is, today it's a block storage NVMe device. Do we see value in potentially having it be file or object-oriented type of a solution? So for today, the size of the company, the efforts we put forward,

Starting point is 00:49:51 it's going to stay as an NVMe device. As the ecosystem evolves and the needs for products in that space continue to evolve, we'll certainly look at adding those other variants of the product to our portfolio. But right now, it's strictly an NVMe block storage device. All right. If there aren't any other questions, we're at the actual end time, so I'm glad I left some time for some questions from you guys.

Starting point is 00:50:17 Thank you very much for your time and attention. And appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #112: Computational Storage Architecture Development

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.