Grey Beards on Systems - 155: GreyBeards SDC23 wrap up podcast with Dr. J Metz, Technical Dir. of Systems Design AMD and Chair of SNIA BoD

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today Dr. Jay Metz, Technical Director of System Design at AMD and Chair of SNEA Board of Directors. Dr. Jay has been on the show before, and he and I were at Storage Developer Conference last couple weeks ago, and SNEA also presented at Storage Field Day 26, which I attended. So, Dr. Jay, why don't you tell us a little bit about yourself? What was your most interesting takeaways from the conference? Sure. Thanks. Appreciate it. Yeah. So, my name is

Starting point is 00:00:55 Jay Metz. I am a storage and networking guy, predominantly storage networking, I should say. And I basically have worked in the no man's land of storage and non-storage companies for pretty much my entire career. But I am also the chair. It's pretty much the IT universe, right? I guess so. I guess that's one way to put it. Okay. Yeah. I mean, you know, the thing is that the storage often is the redheaded stepchild's redheaded stepchild, as I like to say. And, and so, you know, the, the funny thing is that, you know, storage is one of those things that people tend to think about as kind of a necessary evil, you know, that all you have to do is store a bit for me. And

Starting point is 00:01:36 you're, you know, what do you have to do? It's not a big deal. What's the big, what's the whole point? What's the big, what's the problem here? You know? So what I do at AMD is I am hired and I'm working to kind of coordinate a company with the team that I work with to kind of move beyond the component side and into a systems approach where we talk about the relationships with the different components. And storage is a wonderful method to do that because data is, you know, the lifeblood of any data center or any kind of system whatsoever. And so having these pieces work together and those relationships between the components is where I tend to thrive. So that's what I do for the organization. And then I also work as the chair for SNEA, the Storage Industry Association that we primarily focus on not only standards development, but also education and trade craft in helping people understand the futures of storage,

Starting point is 00:02:44 working on different new developments. And the Storage Developer Conference, which SNEO hosted, we saw all kinds of new technologies, everything from, you know, memory movers, something called the Smart Data Accelerator interface, which was highlighted in the storage field day, as well as DNA storage, which was also highlighted in the storage field day. And CXL, computational storage, the work being done for SMB. We had plug fests.

Starting point is 00:03:16 It was just all things storage. And a lot of the people that were involved are just true storage geeks and very developed for that kind of that kind of uh personnel that kind of audience to take the flash memory summit and and just really go deep into the depths of the contents that are there i would say you know like i i didn't because i was at storage field day uh i didn't uh attend a lot of sessions but it seemed like they were a little less technical than normal. Maybe it was just the sessions that I looked at. You know, in the past when I've been

Starting point is 00:03:50 there, and quite frankly, it's been quite a while since I've been there. So maybe that was the conference for me. A lot more technical stuff was discussed. I mean, you know, like file systems and things of that nature. But maybe I missed all that, right? Tell me I missed that, Dr. J. You definitely missed it. Okay, well, that's good. That's good to know. So, you know, a couple of things that you guys talked about at Storage Field Day. Let's talk DNA storage. I mean, where the heck is that? I mean, the guys seem to think that they're at a point where they can store gigabytes of data and they're scaling up to terabytes. Oh, yeah. I mean, so this is a long game, right?

Starting point is 00:04:37 There's no short-term thing here for this kind of work. And they've got a lot of stuff that they're working on that still are players to be named later. But they've really only started recently in the grand scheme of things. It's only a couple of years old and they only joined SNEA last year where they became a technical affiliate.

Starting point is 00:05:01 So the issue here is that if you're going to be thinking about these things and you're going to be working on the different hardware and software layers, not to mention the different protocols for this, it's, it's one of those elements of the actual capacity is pretty irrelevant, whether you're talking about, you know, whether you're talking about gigabytes or terabytes, that's not nearly as important as the persistence over time. Right. So it's a starting point. Maybe. I mean, it's certainly, you know, persistence is certainly a key to any storage, right?

Starting point is 00:05:39 I mean, but nothing's going to persist. Well, nothing besides DNA is going to persist for thousands of years, let's say. But access and, you know, gigabyte, you know, we're talking terabyte. I got a terabyte on my damn laptop, let alone my iPad. And, you know, it seems like 600 gig is gone without even thinking about it. Well, I mean, capacity is important. Yes. Capacity is important for the actual implementation and the functionality of putting it into practice without question. Absolutely. But in terms of where the focus is right now, the capacity will come, right? It's one of those

Starting point is 00:06:19 things where if you do this right, the capacity is going to come. The methods and the technology are pretty blunt force trauma at the moment. There's not exactly a surgical precision when it comes to the capacity side, because the real focus is being done on the level of persistence over time. And so I didn't mean to minimize the fact that capacity, you know, it doesn't have significant role to play in time. But the purpose here specifically is the, you know, the fact that they're looking at creating a method of encoding, decoding, searching, moving, and prioritizing DNA-based permanent storage. And the capacity will eventually start to come over time because the archiving part of it is

Starting point is 00:07:01 obviously key. So having massive amounts of capacity is sort of an assumption at this point. So Dr. Jay, is the primary use case mainly for the resilience and archive being that primary use case? Or, you know, we had someone on, I think, Ray, last year, we were talking in detail about this. And is replication more of a a use case for a dna dna based storage i think well you're so replication is part of the process by which data is made resilient in dna storage so can you be a little bit more specific which when I say, when I think of replication, I think more of application availability and the ability to, you know, long-term, let's, you know, abstract

Starting point is 00:07:53 like DNA and the protocols from it. When I'm talking to customers about replication, it's mainly to put data where it needs to be for either business continuality or from a performance perspective. So, you know, this is why we love object storage, because replication is inherent in object storage. And it also benefits for resiliency, et cetera, et cetera. So from, I guess the better way to state the question is, what's the in application for the resiliency that DNA storage provides? I believe is probably smart to think of it as sort of a write once, read almost never. Yeah. Archive, deep, deep archive. Yeah, we're talking like glacier levels of access. And the only thing that you really need to make sure

Starting point is 00:08:51 is that the same bit is going to come back when you need it, right? So from an application perspective, you're not going to see this as a means of mission-critical storage at all, right? That's not what this is. We're not talking about application access on a real-time basis. So as long as the replication is there for the security of the bit, I should say.

Starting point is 00:09:17 That's intrinsic to the storage medium. It's not necessarily replicating data across vast distances or anything like that. You're replicating within the DNA storage cell, I guess, in order to make sure you've got sufficient copies to do reads and last a long time. Yeah. And I think the other part of it, too, is that the locality of it may actually have some sort of influence on the replication policies. And when you start to tease that out a little bit, the thread will draw you into, okay, well, how do we find the data that needs to be put into any one particular place? So the searching becomes a critical element. And then, so how do we do the

Starting point is 00:09:57 data movement? Where do you actually start to have the process of reconstructing that data somewhere else? And what are the methods by which that happens? And those are all things that the DNA Storage Alliance are working on. You mentioned search. Why do you think search is such an—most archives don't have a search requirement per se, I guess. I guess it's there all along, but it's not something that's specific to the archive. Well, there is no concept of things like inodes in DNA storage, right? There could be.

Starting point is 00:10:38 It doesn't necessarily have to be nothing like that, but there could be. Oh, well, whether we call it an inode or we call it something else, the structure of the way that inodes work in drives is not the same way that they would work in DNA. So the method has to be recreated and reexamined. So you're saying that they have to have some sort of way of indexing the data that's sitting in this DNA, I'll call it device, for lack of a better word. So that they know where the data is or at least what the data is and that sort of stuff. I mean, the sessions at Storage Field, they talked a lot about, I'll call it ECC, and having informational tags in front of every segment, I guess, which might be considered a block in a normal storage device. It says this is block number 25 of 2018

Starting point is 00:11:42 of this particular file entity or something like that. Yeah. And I guess if we're talking archive, let's bring it back to kind of my level in understanding of storage and how storage works. If you're as old as any of us and you've had to restore anything from tape and you didn't have a catalog of what was in the tape, like I've had projects that I've been called in and literally just given a box of tapes and said, restore the exchange data from these tapes without knowing what backup software that was used to back it up.

Starting point is 00:12:15 Just simply, here are the tapes. That is, with hard drives, it's quite different. Connect the SAS drive to almost any system, it is going to be able to read the files. It's going to be able to identify the file system on the device. And then, you know, I just have to install the drivers for that file system in my preferred OS. When we're talking about deep archival, you know, think 20 years from now, 30 years from now, someone's given a proverbial box of DNA and said, read the data that's on this DNA device. Restore my exchange data from this DNA device.

Starting point is 00:12:57 Yeah, yeah. What I was trying to say is that we're not talking about removing the functionality, right? We're not talking about that at all. We're talking about because of the nature of the medium involved from dna there is the processes we use today the track where the data is in a dna strand is not effective for this medium correct correct yeah so so i and i do appreciate you giving me the opportunity to try to clarify what i'm trying to say here um you know because well, let's face it, the propeller spins the other way once you start getting to this level.

Starting point is 00:13:49 You know, we're talking about biochemical engineering here, not just storage. Right. And all, hey, I am not a biologist and I'm not a chemist. So, you know. I don't think anybody on the call is actually. Yeah, exactly. So I have to be very careful what promises I'm making.

Starting point is 00:14:07 Right. But I mean, you know, all of the things you're talking about, you know, is where we, you know, where we have to try to get everything right. Because, you know, as you know, in storage, we only have one job and that's give me back the right bit that I asked for when I asked for it. Right. That's the job and the the nature and i think one of the lessons that we learned you took you bring up you know tape keith you know i think one of the lessons that we learned is that we don't treat tape exactly the same way way that we treat um you know spinning drives and we don't treat spinning

Starting point is 00:14:39 drives exactly the same way that we treat nand and we don't treat nand the same way that we treat NAND. And we don't treat NAND the same way that we treat NOR. So nobody uses NOR for a reason, right? Because the nature of the medium actually affects what happens with the storage itself. So all of these systems, all these processes that have to be put in place, and that's what makes storage actually quite interesting and fascinating to me because of the fact that, you know, the storage is a system. It's not just the medium. It's not just the way that the, you know, the bits are laid out. It's a process by which all of those things are kind of packaged together so that you can do your one job. And that means finding the right bit and in a DNA, you know, chemical, biochemical solution.

Starting point is 00:15:23 That's beyond my ken. I got to be honest with you. I couldn't sit here and tell you how a searching function for the correct DNA strand was supposed to work with any confidence whatsoever. I just know that that's one of the things that they have to work on for creating it in a way that anybody can do data structure storage using DNA and adhere to the same principles for what this is supposed to mean, right? Because if you've got vendor A and vendor B and vendor C, and they've got different ways of encoding and finding data, that's just not going to work too well. Same thing as any other storage device. Yeah. I think they have some way of tagging the block with some information that can be matched

Starting point is 00:16:05 and that's to some extent how the searching works? I don't know. It's all kind of a fog in my biochemical soup up there someplace. You have our own biochemical soup? Exactly. Maybe neurological soup. I don't know what you'd call it.

Starting point is 00:16:23 Now we start getting into behavioralism. It was a pretty interesting, it was a pretty deep discussion about how, you know, they're going to try to protect the data. They didn't actually tell us what the ECC code was going to be, but they did mention that ECC was going to be existent and there was going to be like a prefix and a suffix to every segment or block of DNA. And, you know, it's kind of like, Ethan, that to some extent the packets get lost, but they're somewhere in the soup, so you can find them, I guess. It's pretty interesting. And, you know, replication of the actual packets was an important aspect of the DNA storage. Yeah.

Starting point is 00:17:11 I kind of felt bad for them because they had to compress so much of that information down into the 20 minutes that they had to talk about it. Yeah. And I guess, as they say, it's kind of a good problem to have because people were very interested in it. They were extremely curious about what was going on. They wanted to find out more. I guess it falls into the always leave them wanting more. Exactly.

Starting point is 00:17:27 I thought it was a good session. They did a good job. The guy from Western Digital was pretty sharp. He understood the storage aspects of it. He understood, to some extent, the biochemical aspects of it. So, I mean, it was good. The guy we had on a couple years back, Catalogic, was yet another DNA storage vendor that was going after this thing. What surprised the hell out of me was

Starting point is 00:17:48 they're at the gigabyte level. They can store and retrieve a gigabyte of data pretty easily today without too much of a problem. They seriously think that terabytes on the horizon and it's not that hard from what they're telling me,

Starting point is 00:18:09 which is pretty damn impressive from my perspective. I agree. I agree. And I think one of the things that kind of scares me overall is that I'm rapidly getting to the point where I will not be able to understand how storage works anymore. I think some of us are already there, quite frankly, Dr. J. I mean, we understand somewhat the protocols, but, you know, how it actually works under the covers, not as much. Optane is, you know, one of the things that, you know, it looked pretty interesting, but it wasn't clear what it actually did under the covers. Yeah. recovers but yeah and the the funny thing is is that the optane story itself is uh is a kind of a

Starting point is 00:18:49 a unique case study i think we're going to find something out of a harvard business review at some point in time oh yeah yeah to try to figure out you know what what went right and what went wrong there because i thought the technology was really quite um quite amazing so yeah it was it was interesting you know how micron and and intel kind of split up the technology end of it and you know it's it i i don't know if there was some restrictions on what micron could do with the technology whether they could you know offer it to other vendors or not probably not but you know it could have been a little bit more open and maybe it would have made a different sort of effect or impact on that.

Starting point is 00:19:28 Yeah, we had got a terabyte of the stuff in our lab, and we played around with it. It was quite amazing. Finding a terabyte of Optane is pretty impressive, Keith. Well, I know a couple of people at Intel, so it was really fun to play around with. It was really interesting to play around with. It was really interesting to play with. The concept, and for later days, like everything else is cache. It's memory

Starting point is 00:19:56 tearing at the lowest level. And it kind of has me curious about some of the other topics at SNEA. And, you know, as we're kind of coming out of the fog of Optane, CXL is starting to become kind of the... That was, yeah, it was a very prominent discussion. I mean, the other discussion at Storage Field Day 26 was the SDXI. You want to tell us a little bit about that, Dr. J? Oh, sure. So SDXI stands for the Smart Data Accelerator Interface, and it is a software-based DMA engine, data movement. And the reason why this becomes really important is that other hardware-, DMA engines are unfortunately not portable, right?

Starting point is 00:20:49 You have very specific hardware-related engines to, you know, to move data from one location to another. I mean, it was always hardware-driven. I mean, it was the speeds and stuff like that required were pretty impressive, right? I mean, we're talking memory access, right? Yeah. I mean, and I think one of the things that's really kind of cool about SDXI is that not – well, it's software-based for one thing. But the other issue is that you could run it on any device that could run SDXI, run the software-based solutions.

Starting point is 00:21:25 There's no instruction set required. There's no specific hardware set required. Basically, if your hardware vendor creates something that has SDXI in it, you could basically have your application address it and attach to it to do some memory movement. Now, as far as functionality is concerned, where things get really cool is that what it allows us to do is affect the application performance when memory access is required that you can't currently do. So let me give you a better example. You are very familiar with containers, with virtualization, with the abstraction layers, and we're getting further and further and further away from the hardware, right? And that becomes a problem because what happens is the application developers,

Starting point is 00:22:15 they don't like the way that things are working, so they basically build another abstraction layer on top of it. I don't know if you've ever really traced how many different abstraction layers they are for metal as a service, but it gets really deep. Yeah. I've seen a couple of these things. It's painful. Yeah, exactly.

Starting point is 00:22:31 So by the time you actually do all these translations to get memory to move from one location to another, let's say, for example, you're doing storage virtualization just as a nice simple one, right? You're talking a pretty hefty latency element, you know, to get that stuff coming back. And so what winds up happening is that SDXI allows the application to directly call the hardware to move the data from one memory region to another memory region, and effectively bypasses all of those abstraction layers. So you don't wind up having to worry about having to go through layer after layer after layer of memory abstraction and create all these different buffers. You really start to think about all the different buffering that has to go on for each of these

Starting point is 00:23:17 different abstraction layers. It's just, it's enormous. So by being able to just simply say, move this data from this location to another location and have it accessible via a privileged software thread, you bypass a whole bunch of this latency-inducing movements to these layers. You see this operating across something like GPU Direct to move data from a storage device to a GPU? I mean... So, no, it's not quite like that. So the way that the way that GPU direct for storage works is it effectively avoids from one device to another device to bypass the round tripping that goes through multiple, multiple software layers of each of these different... Let me rephrase that.

Starting point is 00:24:06 Let me try that again. It's a little bit different with GPU Direct because the way that GPU Direct works is that it doesn't require the GPU to call the CPU for controlling the movement between different devices. In other words, the CPU doesn't have to... I thought I'd move data directly from storage to GPU without having it be buffered in the CPU.

Starting point is 00:24:29 That was my understanding of GPU direct. Correct. Yep. And that winds up being, you know, the big advantage to GPU direct because you're not going up into the kernel for that. So that has to do with avoiding the use of the CPU as a buffer for that memory movement. Right. What STX-9 allows us to do though, is it allows us to use the CPUs, the application that's on the CPU to basically move into a different memory region as required without having to go

Starting point is 00:24:57 outside of the CPU or without going through multiple memory buffers. Okay. So from a practical perspective, I have, I have, you know, I'm a virtualization guy. I'm a traditional virtualization guy and I'm, let's do a simple form of virtualization. We're accessing storage via CNI from Kubernetes and that storage is being provided by a VM in your hypervisor. So I'm going through the Linux kernel to access that and Linux drivers is yet on storage or on a VM that's provided by storage to another Linux system that's using the file system to provide the storage. So, you know, this is three layers of abstraction that the application developer has to go through to call memory. So this capability doesn't necessarily bypass the Linux kernel, but necessarily bypasses

Starting point is 00:26:14 the drivers so that I can call the storage directly from the hardware device, the ultimate hardware device that's providing the capability? Well, this is more of a memory data mover than a block data mover. So we're not talking about LBAs in the way that a traditional storage device might be accessed via any bare metal or virtualized environment. So this is really how do we handle the bypassing of the different boxes. Dr. J, tell me what a memory mover, who would do a memory mover? I mean, shared memory kinds of structures, is that what you're thinking of? I mean.

Starting point is 00:26:51 Yeah. So, I mean, Keith's example of the virtualized environment is a really good one, you know, the container environment. So, in many different storage environments, you've got, you know, virtualized elements of the controllers that require, you know, the data to be moved from one virtualized environment to another virtualized environment. So let's say you just got two containers, right? And you need, you basically want to move from an application container into a storage container, or, you know, from an application, which in a VM to a storage VM, right? So what you'd have to do is you'd have to have the hypervisor do all that memory movement,

Starting point is 00:27:29 which in and of itself is a virtualized environment from the actual abstracted actually from the storage. What we're looking to do here and what we're talking about doing is saying, instead of talking to the hypervisor, talking to the memory buffers, we're going to talk to the hardware because we're really just moving to these protected memory spaces, so that the information is the data is accessible directly by the by the application and the virtualized environment. So for instance, instead of writing zeros with the CPU into this environment, you can basically just call a SDXI routine to zero out that memory space right so the work is actually done by the cpu uh naturally without having to have all these different these different component

Starting point is 00:28:11 parts you know uh continue to move data from one buffer to another buffer so if i'm building a system a tier memory system and i need to move memory from different tiers of storage this would be a conduit for that and enables me to build my application across more platforms just as long as they support this protocol correct as a matter of fact the applications don't know any of this i mean although basically what they're going to do is they're going to call the, you know, the calls and that's it. Yeah. So if I'm, if I'm, if I'm working a Kubernetes project or Kubernetes related product project, this is where I would build, I would build this capability into the platform that I'm

Starting point is 00:28:56 building. So let's put some names behind this. If I'm OpenShift and I want to say, you know what, you want to run your applications faster across more of our cloud providers? We'll support this protocol so that you don't even, as a developer, you don't even have to worry about it. You just, memory and access and IO is just faster on something like a OpenShift across different brands, whether it's AWS, OCI, or Azure, it's at that level. So the application developer isn't calling, may not call it, but the platform that they're using may call it. That's a good point. Go ahead. I'm sorry. Go ahead, Ray. No, no, no. Answer, Keith, please.

Starting point is 00:29:42 I think you're spot on. I mean, we haven't had that conversation yet with something like OpenShift, but I think part of that is simply because of the fact that the SDXI protocol is so new, right? But having one of the abstraction layers in between the storage and the application that can do this as high up to the application as possible

Starting point is 00:30:05 would definitely make life a lot easier for application developers. Absolutely. So just to be specific, this provides an API to moving memory within one system. It's not across systems. That's correct. I mean, CXL and stuff like that could start having some sort of shared memory sorts of structures and maybe your SDXI could potentially plug into a CXL memory tier that could be shared between systems.

Starting point is 00:30:39 Yep. Where do you think we're working? Okay. So in that case, if you've got a shared memory structure or device out there on this PCI bus, it's connected to two different systems or four, then you could use SDXI to move data from the shared memory to the local memory or vice versa. Yep. Yeah. This gets to be really interesting as if done correctly in the integration

Starting point is 00:31:02 is, is done correctly. You know, we have, you know, think of as a data center as a node and the interface to, you know, the app in application developers doesn't really care if the interconnects between the components of that data center as a node is CXL or any other protocol. All they know is that they're interfacing with, you know, let's again, pick on OpenShift. And then my friends at Red Hat are worrying about the SDXI interface and how that leverages the underlying hardware,

Starting point is 00:31:41 which may be CXL based or some other network protocol. Yep. And so, you know so it's worth noting that VMware and a couple of others are part of the SCXI working on the protocol. But it's a memory protocol. It can't be an IO protocol. It can't be NFS or SMB or... No, no, no, no. You're right. It's a memory protocol. As a matter of fact, What the hell is the Storage Networking Industry Association doing in memory, Dr. J? Well, memory and storage have been on a collision course for a very long time. We talked about Optane earlier, right? And there are a lot of initiatives going on in both the memory world as well as a storage world that get into that fine layer of granularity. So we talked about the CXL just a moment ago, right? So the impact of computational storage put on as a node on a CXL fabric, where you're going to have a memory pool that's shared by multiple computational storage processors, for example,

Starting point is 00:32:39 that's a collision between storage and memory. You can't develop each of those different things completely in isolation. It's really silly to try to think. Yeah, and I love there's an organization that's trying to tackle the complexity of abstraction. You know, I've talked about this a lot, that we're building abstraction upon abstraction and upon abstraction.

Starting point is 00:33:01 And these abstractions are moving way faster than the actual standards of the underlying hardware and the capability. And it feels like we're building castles on sand. Yeah, exactly. And we were talking earlier, making a slight joke about not knowing how things work anymore. I do think that's actually kind of a risk. And one of the things that we're trying to do at SNEA is help with the education of this because it gets pretty convoluted pretty quickly if you're not, you know, living this day in and day out. And that's what the Storage Developer Conference is really supposed to be able to do.

Starting point is 00:33:42 But, you know, we're now getting into the area that I find to be, personally, I find very fascinating, right? Where we are now, you know, for years, I've been talking about the, you know, the accordion, the expansion and the contraction of technology and how it kind of collapses upon itself and gets pulled out like a Holberman sphere and readjusted and brought back together again. You know, what CXL allows us to do is, you know, what the promise of CXL allows us to do, I should say, is it allows us to think about, you know, the relationship between each of these different devices in not so strict a fashion. For years, it was compute, network, and storage, right? And compute and memory were all part of the same thing. Well, network's got memory, storage has memory. And then we started talking about networks

Starting point is 00:34:33 inside of the compute, right? We've got, you know, AMD's got the Infinity Fabric, NVIDIA's got the NVLink, and so on and so forth. And then you've got, you know, the compute, in-network compute that's going on there, which has its own memory functions. And you have the rich and robust virtualized storage environments that we've gotten for decades now that has been proved over time. It was only a matter of time before someone said, you know what, why are we doing all these things in isolation? Why can't I prevent the

Starting point is 00:35:02 IO from happening altogether if it doesn't need to go there? And why can't I do more processing on the storage in situ? You know, why can't I do that? Well, we can. Okay, well, how do we get them to know that this stuff is happening over there so that I don't have to do it? Oh, well, that means we have new communication, you know, protocols that we have to do. What if I don't have to pull it out of memory altogether? What if I just process it right there in the memory, as opposed to having to kind of store and then forward and buffer and all these kinds of things. So all this work is being done in NVM Express, it's being done in SNEA, it's being done in CXL. And the same people are saying, well, we know that these things have to be done here, but they're not done in isolation. So the work on NVMe over CXL, well, what does that mean? Well, at first glance, it means block storage over CXL. No, it's not what

Starting point is 00:35:51 it means because NVMe is more than just block storage. NVMe has controller memory buffers. It has some host memory buffers. It has being able to process the NVMe commands on the local drives themselves. Well, that already exists. It's existed for years. Well, what if I were to do more processing than just NVMe commands? And so on and so on and so on. If you've got a computational storage drive, for example, and that's a very limited, that's a memory bound and compute bound function that exists on a drive. What if I were to pool storage, pool memory across a CXL fabric so that that small drive, which has that bound resources, can now access additional resources and memory?

Starting point is 00:36:32 How do I access that? Well, I have to be able to call those commands using NVMe. Are you saying compute computational storage would have a shared memory? Yeah. Yeah. I'm thinking, you know, if you look at it, I look at my, I have an HPE 6000 with dual controllers. It has a couple of AMD Epic processors in it.

Starting point is 00:36:56 And I'm thinking, I'm looking at this thing and I'm looking at my external servers to it. And I'm thinking, why am I sending IO requests to this extremely capable pod to come back to compute just to run on processors that's actually slower than what I actually have in my storage array? So it gets to the point, and this becomes a question of orchestration of networks and IO, et cetera. How do we build intelligent enough networks where I can orchestrate where the compute runs?

Starting point is 00:37:34 Why can't I have just have a couple of GPUs in my storage array? And if I'm not using it, if the process doesn't need to use a ton of CPU, but it needs a ton of GPU, why send that stuff back out to my distributed compute and now I have to worry about how to orchestrate that IO and the application logic? Why can't I just run it

Starting point is 00:37:57 closest to where the data is at? Because data has gravity. It is a fascinating computational compute for me or storage is a fascinating area. He's not just talking computational storage. He's talking computational memory. Well, and again, this is why it's NIST NIA, because it's the same problem, in my opinion. How do you get the I.O. to the compute as fast as possible or as efficient as possible at the

Starting point is 00:38:27 right IO profile? Data. We're talking data. Well, and that's where, you know, that goes back to your original question, Ray, which is, you know, SNIA is about data. You know, we're about protecting data, moving data, storing data. It's about data. And we're not so arrogant as to think that we have all the answers. On the contrary, we know that there are many different facets that have to be addressed. So we're focusing on working with these other organizations to ensure that the industry itself can solve these particular problems without having to, you know, reinvent the wheel every single time. Because quite frankly, you know, the CXL group is now dealing with concepts that storage has addressed for decades, oversubscription, fan-in ratios, those kinds of things. And working together allows us to learn from each other's past experiences,

Starting point is 00:39:21 because, you know, that's what's going to allow us to put the right tools for the job in the right place. I would say the other thing that, you know, and I was thinking about this as long as we started talking about this, there's security implications of SDXI are pretty intensive. I mean, you have, you know, these guys out there snooping cloud memory, you know, to try to find passwords and stuff like that. Having something like SDXI in this space, is that going to increase the security

Starting point is 00:39:51 exposure? I guess that's the real question. That's a really good point. So security in SDXI was built in from the ground up. I didn't cover it here because it's one of those things... Actually, hang on a second. Things are starting to get a little noisy. Let me start over. I'm really glad you brought that up because we didn't talk about it because it gets a little bit more involved, but security was built into SDXI from the ground up. So all

Starting point is 00:40:21 of these things about data movement have to go through a pre-process of establishing privileges for being able to handle this. And then there are checks and balances along the way to ensure that that security hasn't been breached. But it goes into some detail that's a little bit more involved that I feel comfortable talking about. And it also would take a little bit more extra time. But it is built in from the word go. So I have to assume stuff like encryption even before the data is transferred, et cetera, et cetera. And the authentication of the even ability to pick up data and put it somewhere is a consideration. Yeah. So it's a multi-stage process, right? So you've got the setup of the DMA movement, and then you've got the actual movement itself.

Starting point is 00:41:19 As it turns out, when you have the privileges set up in advance, that's before any data gets moved at all. And then when you have the data movement, one of the things that could happen in the process is the data could be mutated. It could be encrypted along the way. It could be decompressed or compressed along the way. There's a whole other element of STXI

Starting point is 00:41:38 that we just haven't had a chance to get into. I just gave one example about bypassing all those layers of abstraction, but STXI is a very robust data mover with security implications taken into consideration. So, Ray, this is just all to go back to my premise. All

Starting point is 00:41:55 storage is just a bunch of networking. No, it's not. Networking can drop packets. Storage cannot drop bytes. I'm going to let you guys fight that one out. There's nothing to fight. He knows as well as I do.

Starting point is 00:42:13 You can't, you know, well, that's a different discussion. I just want to say for the record, I coded a DMA interrupt handling program back in the 70s, 80s maybe. It took approximately 127 microseconds to do the interrupt. And the DMA interrupt occurred every 128 microseconds. So doing DMA in software is tough. Yeah, it is. It's not easy. Time constraints are very severe here. Yeah. And as it turns out, it's not a perfect tool for everything, right? Nothing is. I don't want to give the wrong impression that we have suddenly come up with a panacea for data movement, right? Because every time you do something in software,

Starting point is 00:42:57 you have increased the overhead all the time, right, for every virtual bit, you have to have a physical bit. And in software, a software-based approach is going to be useful for certain things like portability, chaining functionality, that kind of things that may wind up, you know, be more advantageous in software if you've got large transfers and then small transfers, small data movements, small DMA might be much more appropriate for hardware-based engines. So don't get me wrong. I mean, we're not trying to say that this is a solution that's going to do away with all hardware-based solutions. That's just silly. It's like the tape is dead argument. Anybody who says that doesn't know anything about tape. And this goes back to kind of my serious argument

Starting point is 00:43:48 that the higher level abstractions are actually moving faster than even something like this infrastructure focused protocol and data mover. I have friends who have just been hired on to AWS to continue to add features to S3. So, you know, we're dealing,

Starting point is 00:44:10 and then the industry as a whole has to catch up to S3 as a capability across these new capabilities across their products as well. So, you know, it's a, it's, it's an exciting area that, that,

Starting point is 00:44:24 that just continues to, to move faster than any of us can individually follow. Yeah. And I think that's one of the reasons why SNEA is so, you know, so valuable, you know, because of the fact that it's a place where you can find those people who are passionate about the stuff that you may not be, but you need. So I am nowhere near as impassioned about the DNA data storage from a professional perspective. Intellectually, I am. I'm curious. I'm a curious person. But when you talk to the people in the DNA Data Storage Alliance, for example, wow, they're really into it. They're really, really into it. It's the same way for green, right? Energy efficiency.

Starting point is 00:45:11 It's more of an intellectual academic pursuit for me. But the people who are really involved in it, and thank God they're there. Because if I need to ask the question, I can go ask a question. That's what having this kind of collegiality is so critical and why the Storage Developer Conference was so important because people who are really, truly, honestly curious can go talk to the people directly who are passionate and understand what's going on.

Starting point is 00:45:42 And the level of conversation at the conference is just phenomenal. You can go on YouTube and you can see the SNEA videos. It's the actual channel, right? SNEA video is the channel on YouTube. And you can see the specific presentations that are done and give you a taste of it. But the birds of a feather conversations, the hallway conversations, the being able to talk

Starting point is 00:46:06 to the people who write the code you know uh that's that's intense man i mean you know you you sit down and you talk to the people who are writing the actual code the protocols and they can tell you why these things were done and you know what the what the threats are that you never even thought of in a million years yeah I did not know that that was going to be an issue. Ah, yes, but that it is. Alright, listen, Jens, this has been great. Keith, any last questions for Dr. J? Not that I would actually understand the answers, so we have him on for a specific reason.

Starting point is 00:46:44 Dr. J, is there anything you'd like to say to our listening audience before we close? Actually, yeah, there is. There's one. First of all, I want to thank you both. Our pleasure. Honestly, you know, whenever I do my short takes on the blog, I always, always, always look to your blogs, your presentations, your podcasts. And I know that you've done a phenomenal service.

Starting point is 00:47:10 And just wanted to truly thank you both for all the hard work you put into, you know, opening up this stuff to the audience. We appreciate it. Thank you. Wow. Okay. That's it for now. Thank you very much, Dr. J, for being on our show today. Thank you for having me.

Starting point is 00:47:29 And that's it. That's it for now. Bye, Dr. J. Bye, Keith. And bye, Ray. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know.

Starting point is 00:47:44 And if you enjoy our podcast ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Your Ad Here

Grey Beards on Systems - 155: GreyBeards SDC23 wrap up podcast with Dr. J Metz, Technical Dir. of Systems Design AMD and Chair of SNIA BoD

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.