Storage Developer Conference - #135: SmartNICs and SmartSSDs, the Future of Smart Acceleration

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 135. Hello, this is Scott Schweitzer, and welcome to the Storage Developer Conference. Today I'm going to talk about SmartNICs and SmartSSDs and the future of smart acceleration. This is the abstract of the talk. I'm not going to dwell on this.

Starting point is 00:00:55 You all can read it some of the time. So let's just jump right on in. So today we're going to do some quick framing of the discussion, go over what SmartNICs are, go over SmartSSDs and computational storage, and then talk about the rise of accelerators. So accelerators aren't new. Mainframes have had them since the 1950s. These were expensive tube-based systems, came out in 57, 58, real beasts of a mainframeframe weighing in at over a ton. But the interesting thing about this particular model, the IBM 7709, is that it supported overlapped IO, and that allowed them to bring in intelligent controllers in the form of the IBM 7607.

Starting point is 00:01:38 And that was probably one of the first examples of an accelerator, an extra processor designed to offload the main host CPU from doing stuff. Coincidentally, this was actually the first system to support Fortran for those out there that are all-time programmers. On the server issue side of things, this is something we want to keep an eye on, is CPUs have scaled out extremely well. You know, we're up to 128 cores per CPU.

Starting point is 00:02:08 We're probably going to go above that sometime in the near future. But we haven't scaled up much above 4.5 gigahertz. We do overclock, and there are examples of overclocking up to 7 gigahertz on some power platforms. But for the most part, these aren't, you know, production clock frequencies. I mean, we're kind of stuck in the three to four and a half gigahertz rut. And so scale-up is really the only thing that we have available to us.

Starting point is 00:02:32 About 21 years ago this month, NVIDIA introduced the GeForce 256 chip shown on the bottom right there in October 1999. And for the past two decades, GPUs have kind of proved that there's a thirst for accelerators in the market, and they've pretty much dominated in the high-performance computing space, and they're starting to bleed into other markets as well, crypto and others. You know, for the past few years, everything has kind of become software-defined. It's all about software-defined composability, orchestration. And the desire is to build larger, more generic systems so that these can then be orchestrated to leverage the entire infrastructure to provide better overall solution performance. one last slide in this uh framing space uh around around uh speed is there are three components uh

Starting point is 00:03:31 to speed and that's you know essentially the cpus the disks and networking and there is memory too and and this is kind of i think tracks along with cpus so we'll talk about this as well with CPUs. So CPU densities have soared. You know, it's an undisputed fact that Moore's Law has gone extremely well for us. You can see the chart on the bottom right. The y-axis is logarithmic. That's why it's kind of a linear slope to that, and that represents densities of transistors for different processor architectures. And if you have the slide deck, you can expand that out. But what this essentially means is that we're well

Starting point is 00:04:10 into the tens of billions of transistors, single die, which has allowed us to do some amazing things. Discs on the meantime, in the meantime, discs have gone from spinning media, which we've had since the fs, to solid state drives. And in doing so, we've jumped three orders of magnitude in performance. Spinning discs were pretty much at the end of their evolutionary curve. We've played every trick in the book that we can to get the most performance we can out of them. But when something is spinning, you can only spin things so fast before you run into issues. And that affects access times.

Starting point is 00:04:56 Whereas with stored solid state drives, you know, you can have 64,000 IOQs, each queue with 64,000 entries. And you can have performance that just blows away spinning disks. Three orders of magnitude for access time is kind of the convention in the space. Finally, we have networking. Networking is probably the least discussed for the most part, but probably one of the more interesting ones. We've kind of struggled along with gigabit Ethernet for quite a long time. It had been around before the turn of the century, but for the most part, it still dominated the data center until about 2010, when most folks were moving over to 10 gig ethernet. Since then, we've had 25 gig ethernet, and that's pretty much picked up on adoption around 2018 or so. And then with 2020, 2021, we're actually starting to see

Starting point is 00:05:40 pickup on a hard gig ethernet. So that's growing or starting to grow exponentially as well. And these things were all factors that are pushing hard down into this space of performance. Now, one last framing slide, and that is the corporate players in space. If we look across the space, the three big CPU guys, and a year ago I wouldn't have thought of NVIDIA as a CPU guy. They do GPUs and all, but general purpose CPUs wasn't really their game. They're $300 billion in market cap compared to Intel's 210 and AMD's 90.

Starting point is 00:06:19 But the interesting thing is over the past weekend, they announced the acquisition of ARM, their intent to acquire ARM, which puts them right in the fold of a CPU player from a general purpose CPU as opposed to a specialty CPU. On the electronic side of things, we have the other players that kind of sweep along the bottom. Samsung is kind of the big dog in that space, but they make a wide variety of different products that go beyond chips, refrigerators, appliances, things like that. You've got Broadcom and Qualcomm in the $127 billion range, along with TI, and then you also have Morvell and Xilinx.

Starting point is 00:06:59 Okay, so SmartNICs. What's the big deal with SmartNICs? So SmartNICs are in networking what GPUs are to high-performance computing. SmartNICs arrive much later on the scene. That's allowed them two decades of time for chip integration to increase geometrically. Now that we're down to 7 nanometers, we can put an awful lot of transistors on a chip. We have things like silicon interposers that allow us to apply multiple chips within the same package. Now some folks call them chiplets. So that allows us to do very tight integration of things.

Starting point is 00:07:39 And then there's also the software angle on all this. The software for integrating in coprocessors or accelerators has gotten extremely better. So arriving later to market allows us access to a wider portion of the market. SmartNICs are going to apply across the entire data center. Another big advantage of SmartNICs is that they introduce the concept of a separate computational domain to the server that doesn't or didn't really previously exist. And if you look at a server, they generally have something like a DRAC card, you know, Dell's Remote Access Controller or some sort of autonomous little SOC for doing system management that sits on its own dedicated gigabit Ethernet port.

Starting point is 00:08:24 But for the most part, the server is one big computational domain. And that becomes a problem when you get around to security. And so if you can actually use the SmartNIC itself as a computational domain, and from there launch applications into the host CPU complex. You can keep those domains separate, and if someone were to violate one, they wouldn't naturally be violating the other. And also, you could use it for orchestration. So you could actually have the SmartNIC be the controller, the cluster controller, to dispatch pods or containers on the x86 complex.

Starting point is 00:09:13 Also, the SmartNICs are now starting to rival the host CPUs in power. Granted, they're not up to a 128-core CPU yet. That's probably still a little ways off, but we're getting there fast as we're making the jump to 7 nanometer. You're going to start to see some very impressive core counts, impressive amounts of programmable logic, AI engines, and other computational resources available. Also, SmartNIC's benefit from advances in PCI Express

Starting point is 00:09:49 and the protocols attached to PCI Express, and we'll talk about that shortly. So who's pushing for all this stuff? Obviously, there has to be a consumer for these products. Otherwise, there wouldn't be a market and nobody would be interested in buying them. Well, the big two are financials and hyperscalers. Financials are your big banks, your high-frequency traders, and your exchanges.

Starting point is 00:10:19 And why these guys are pushing for smart mix is fairly obvious. What they're looking for out of it is to dollarize time, right? The faster they can come to a decision on something and trade on it, the more money potentially they can make. So time is money to them. If you look at electronic trading today, if you're trading using a normal accelerated NIC and you're doing everything through the host CPU complex, you're talking in the microsecond range, several microseconds,

Starting point is 00:10:55 basically a microsecond half round trip to get the data in and out and then a couple of microseconds to run through all the host code to handle the feed normalization and, you know, trade book and all that and decide what, what the trade is going to be and then execute it back into the exchange. If you do this all on a smart NIC or an FPGA based smart NIC,

Starting point is 00:11:18 you can be executing trades in under a hundred nanoseconds. We've seen examples of that. Worst case, you could be doing things like feed handlers or risk jacking and all that kind of stuff within the NIC. So the financials have really been pushing for SmartNIC for the past half dozen years or so. Also, there are the hyperscalers. These are the big data center managers, the Googles, the Facebooks, the LinkedIns, the Microsoft's. So the hyperscalers

Starting point is 00:11:47 are pushing for this. And the reason why they're pushing for it is the data center tax concept. When you are a hyperscaler, you're making extensive use of virtualization and containerization. And when you use those technologies, you're also virtualizing the network. And in virtualizing the network and managing that virtualized network, you sometimes consume up to a third of the host CPU cycles in managing that virtualized network. So by moving that workload off to a SmartNIC, you can reclaim those host CPU cycles and apply them back to customer workloads. Finally, there are streamers. In this work-from-home climate that we all live in today, we're making extensive use of streaming video either for Zoom chats like this one

Starting point is 00:12:36 or for things like Hulu and Netflix and Amazon Prime for watching videos. And all of that streaming takes an enormous amount of computational workload because what happens is a video is delivered in one format. It typically needs to be decoded back to a raw format and then transcoded into half a dozen different adaptive bit rate formats so that the video can be watched on a variety of different displays. You'll notice, you know, if you're a Netflix user, for example, they have 4K content, right? And then the default content is 1K. And then what you don't know is they also have

Starting point is 00:13:17 content at 720, 480, and I think they even support 320 still. So that's all adaptive bitrate stuff and transcoding. And, you know, programmable logic can do transcoding at a 20th of speed of normal CPUs and do it at roughly the same power budget for the same workload. So, you know, 20 times more, but the same amount of power to do one as it would on a host CPU. So let's look at the architecture of SmartNICs, right? There are three planes used within the SmartNIC architecture.

Starting point is 00:13:53 There is the management plane, and that's what normal humans like you and I would use, your sysadmins. This is where you're going to execute command line instructions or applications that use a RESTful API into the NIC, even some older protocols like SNMP, Simple Network Management Protocol, for managing the control plane, right?

Starting point is 00:14:20 And the control plane is what tells the NIC what to do, essentially. And in the control plane is what tells the NIC what to do, essentially. And in the control plane, that's more program-to-program exchange of data. You know, so this is OSPF and BGP, OpenFlow, SDF, that kind of stuff. And so the control plane is what twiddles the bits for the data plane. This is where that real action occurs and data is transformed or packets are dropped if there's security threat. And so, you know, packets come in on the data plane and they exit on the data plane and then they go through the PCIe bus, the host complex, and come back through the PCIe bus back into the NIC. Here we are executing P4 instructions, potentially IP tables, OBS, DPDK type stuff, BPF, routing tables, all that kind of stuff can be offloaded into the data plane.

Starting point is 00:15:13 Now, if we look at the building blocks that make up a SmartNIC, there are essentially three categories of building blocks. We've got cores, CPU cores like ARM and MIP64. We've got hard logic, stuff that's instantiated in actual circuits within the ASIC. We've got programmable logic like FPGAs. We've got memory. And we've also got interconnect logic, which kind of falls into the hard logic category. Also, we have protocols, right?

Starting point is 00:15:45 There's PCI Express. And then with PCI Express comes some software protocols that sit on top of that, CXL, C6. And then there are the networking protocols, right? Ethernet, InfiniBand, UDP, TCP. And then higher level application protocols like HTML3, which includes QUIC, and you can do TLS offload

Starting point is 00:16:10 or kernel TLS offload, and we'll talk about that in a bit as well. And then there's the ecosystem stuff, right? This is where the rubber hits the road and where developers get involved and third parties. What languages are you supporting on the SmartNICs? You know, what type of software development kits are you providing? You know, for apps that are developed, do you have an app store? How do you make those apps available to customers?

Starting point is 00:16:35 Are customers allowed to write their own apps? You know, so there's a whole ecosystem around SmartNICs or will be an ecosystem around SmartNICs, or will be an ecosystem around SmartNICs. So let's jump in and actually take a look at the six most popular SmartNICs that are available today. We'll start with Broadcom's Stingray. Broadcom is kind of the old standard in the market. They've been doing NICs forever.

Starting point is 00:17:10 And what they have here is a single-chip implementation, probably one of the leanest and meanest out there. It's got its own hardware flow classifier that's probably third or fourth generation by now. It was used in prior products and has evolved fairly well over time. They have eight ARM cores for doing control plane stuff for the most part. And they've got some IP accelerator blocks in there for encrypt, decrypt, and FI management and that kind of stuff, Ethernet, PCI Express.

Starting point is 00:17:41 And then they've got two banks of DDR that can be used by the ARM cores for control plane stuff. And you can see all of that in the chip diagram to the right here. But for the most part, the whole thing happens within the BCM 5880802H. So that's the Stingray by Broadcom. Newest kid on the block and newest as in brand new, can still smell the new Nick smell on it, is Fungible. Fungible came out at Hot Chips last month, middle of last month.

Starting point is 00:18:18 They came out stealth mode. Essentially what Fungible has is a smart NIC that is based on MIPS64 cores. They're the only ones that are using MIPS64 at this point. And they arrange multiple cores, each into data clusters, and they've got eight data clusters. They've got a high-speed chip network that ties all that together. They've got schedules and control logic, as well as DDR4 and HBM memory that's available to all of together. They've got schedules and control logic, as well as DDR4 and HBM memory

Starting point is 00:18:46 that's available to all of this. And then they've got hard logic for encrypt-decrypt in the network units, as well as the PHY logic. And then they've got PCI Express controllers, multiple PCI Express controllers. These guys are the Nucleon on the Block.

Starting point is 00:19:07 Like I said, they just came out of stealth mode. It'll be interesting to see how they move forward over the next year or two. They're doing some interesting stuff with UDP acceleration that requires their technology on both ends. So it'll be interesting to see how customers view that and how adoption goes. Another standard in the industry is Intel. Intel's obviously been making NICs for quite some time. And their VistaCreek or their N3000 product is their current SmartNIC generation. It is by far the most complex smart Nick that's out there today.

Starting point is 00:19:47 It's a seven chip product. So they've got six, a six and an FPGA all on the same board. Fortunately for them, five of the a six are theirs. One of them is the one that's not theirs. Uh, one of them is, uh, the one that's not theirs is the Pax 8747 PCI switch chip.

Starting point is 00:20:10 Um, and then the FPGA is there. So I came through the Ultera acquisition, uh, half a dozen years or so ago. And then they've got two banks of DDR four. So, uh,

Starting point is 00:20:21 like Xilinx, they are the only player with FPGA logic at the heart of their smartNIC approach. And so what they do is they bring packets in from the QSFP28s through retimers into the FPGA, and then they steer them at the FPGA where they can do any sort of packet analysis or transforms on the packets or transforms on the packets or actions on the packets. And then they could route the packets over to an Intel XL710 for additional NIC processing,

Starting point is 00:20:53 and the Intel XL710 then is connected to the PEC switch, and that goes out of the PCIe bus. The FPGA could also just handle the packets directly and put them on the PCI Express bus. So there is a tremendous amount of flexibility in the implementation that Intel has chosen here with the FPGA. This is a different approach. FPGAs afford a very wide and deep pipeline architecture for executing on packets. So we'll move that. Next up is Bluefield 2 by NVIDIA, formerly Mellanox.

Starting point is 00:21:45 Bluefield 2 is an elevation or an elevation, evolution of Bluefield. Bluefield was obviously the predecessor to Bluefield 2 is an elevation or an elevation, evolution of Bluefield. Bluefield was obviously the predecessor to Bluefield 2. Before that, it was called Tylara. It came in through an acquisition that Mellanox had done years ago. I think it was EasyChips at the time. But what NVIDIA has done here is they've taken the ConnectX 6 logic, which is extremely well proven in the industry, and replaced the Tilerra M-piped technology on the network side of things.

Starting point is 00:22:17 Brought that in. Excuse me for one second. Brought that in as a the packet classifiers. And then they can forward that back out. Excuse me. To PCI express. And along the way, they have eight arm cores for doing control plane management of the packets and some DDR4 memory for those control plane cores.

Starting point is 00:22:55 They also have hard logic for encrypt-decrypt, including TLS, and they recently announced support for kernel TLS, and they recently announced support for kernel TLS, so they can offload the TLS encode decode into the NIC through kernel TLS. So these guys are going to be the ones to watch from an install-based point of view. Mellanox has done an extremely good job of pushing Bluefield 2 into the data center, and there are a number of wins in that space, and it will be interesting to see how they move forward. Okay, the second new kid on the block, and new in the sense that they've been around

Starting point is 00:23:33 only for about a year or so now, is Pensando and their Naples chip. Pensando is using P4 as the programming language for their data plane processing engine. And P4 is the trendy hot new language for managing network packets. Everybody's moving towards it, or a lot of people are moving towards it. Not necessarily everybody, but a lot of people are moving towards it. What Pensando has done here is they've got a P4 processing engine. They've got some ARM cores on the side for control plane stuff.

Starting point is 00:24:07 They've got obviously packet buffers and memory. And it'll be interesting to see how this takes off. These are a bunch of ex-Cisco guys. They've still got strong ties back to Cisco. And this may, who knows, someday become, you know, Cisco's smart neck of choice. But these guys are going to be interesting to watch over the next year or so. And finally, we have Xilinx.

Starting point is 00:24:32 With Xilinx, we've got the U25 smart NIC. And here what Xilinx chose to do is they – it's similar in some ways to Intel's N3000. They've got an FPGA chip at the heart of their architecture, but it's a single chip, and they've got a single chip Ethernet controller. It's Solifer's previous X2 chip, so that's an extremely well-proven chip in the marketplace,

Starting point is 00:25:03 and that's coupled up with a Xilinx Zynq chip that is extremely well-proven. The Zynq chip has four ARM cores on it. It's a complete system on a chip. Those ARM cores are for control plane activity. They have access to two banks of memory, 2 gig and 4 gig. One of them is available, the FPGA, and the other one is available, the ARM cores. And what that allows them to do is, much in the same way as the Intel N3000,

Starting point is 00:25:36 packets can be transformed directly off the SFP28 interfaces and sent down to the PCIe bus, to the host CPU complex, or they can be shunted over to the X2 controller, let the X2 controller process them, and then put them on the PCI Express bus, and then it obviously works the other way as well. This approach is extremely flexible because it allows you to put logic into the FPGA to manipulate packets and then shoot them over to the PCI Express bus to either the host CPU complex or potentially another adapter on the PCI Express bus,

Starting point is 00:26:12 and we'll talk about that in a couple of minutes. Okay, some of the evolving issues in the SmartNIC space, and all this stuff is very important, and it'll be interesting to see how it plays out over the next couple of years. The first one was brought up on IEEE Hot Interconnects panel that I was fortunate enough to host a couple of weeks ago. And that was a separate computational domain, right? The smart NIC can be viewed as a separate computer within the server itself,

Starting point is 00:26:47 much like, as I mentioned earlier, the iDRAC, the remote access and control little processor that runs on the server motherboard that allows you to power the server up and down and do some basic BIOS configuration and see the console, that kind of stuff. You could actually put that, you know, logic into the SmartNIC such that you could use the SmartNIC as a separate computational domain to launch other applications within the host, right? And if the SmartNIC were to get compromised, the host would not be compromised, right? Also, you could use the smart NIC for orchestration and have controller requests come into the smart NIC and dispatch containers or pods directly into the host CPU complex from the smart NIC, right?

Starting point is 00:27:41 So those are interesting concepts that we're going to see come up in the next couple of years. P4 and PNAs, PNA stands for programmable, or excuse me, portable NIC architecture. P4 is obviously the programming language that is of choice right now for networking. And so P4 is being adopted, as I mentioned before, by Pensando, but Xilinx is using it extensively in its new line of smart NICs that are coming out. Others are picking up on it.

Starting point is 00:28:13 And then this whole portable NIC architecture is a way to define one or more NICs, soft NICs, within a system that could do different tasks. So you can almost carve up a physical NIC into multiple soft NICs and then assign those soft NICs to different applications and steer traffic through that methodology. Cores and planes, obviously, you know, some of the vendors are scaling up the number of cores. Broadcom and NVIDIA are both doubling their core

Starting point is 00:28:47 count as they make the jump to seven nanometer. You know, it's questionable whether they're going to start applying some of those cores to the data plane. Everybody knows it's not the right thing to do, but on the other hand, you know, they need computational resource sometimes to manage those packets. So it'll be interesting to see how that turns out down the road. I'll touch on protocols in a minute, but these are ways to tie multiple NICs together. Excuse me. And then there's security and orchestration. Security being, you know, managing the security domain from within the NIC itself,

Starting point is 00:29:23 using the NIC as a secure enclave that is separate apart from the host complex. You can use the NIC as a hardware key manager if you wish to. And then orchestration, obviously, for launching pods or containers directly from the NIC into the host CPU complex. So let's talk about smart SSDs and computational storage.

Starting point is 00:29:50 FPGAs have been used in storage for years. They've been used in controllers to allow the controllers to do a wide variety of things. You know, early implementations of ASICs are often done in FPGAs. So if you want to do RAID and things like that, erasure coding that has been done in the past in FPGA-based or flash controllers, as they were sometimes called.

Starting point is 00:30:17 This allows us to do a wide variety of things with the storage array from cache offloading to ray data reduction and things like that. The benefit that FPGAs bring to computational storage, and this is something that most people may not be aware of, is it's a different computational model than normal CPU-based processing, as we talked about before with cores on the SmartNICs, right? With FPGAs, you take a program,

Starting point is 00:30:54 and instead of executing the program instruction by instruction and having to go out to memory to fetch each instruction as you crunch through the instructions and go out to memory to fetch data and all of that. With FPGA, you actually take that whole instruction flow and you boil it down into gate logic, the actual ands and ors and logical instructions that would make up that program once it's been boiled down

Starting point is 00:31:24 and reduced to its core. And what that does is it gets rid of all those memory accesses and allows you to dramatically pipeline the execution of data so that, in a sense, the FPGA acts like a black box that represents your program. And you basically put your data in and you get your data out, transform the way you want it to be transformed. And that's why these things are so blindingly fast doing things like transcoding or genomics, things like that. And they can, you know, have been applied, as I mentioned before, to storage for the last two decades.

Starting point is 00:32:01 So if we look at it from a more storage-specific example, what we have here is the ability for the FPGA to handle a wide range of tasks, storage tasks, and be fully reconfigurable along its product lifecycle while delivering ASIC or hardware-based performance and do all this in a massively parallel way, right? That's another thing about ASICs is, you know, an ASIC is what an ASIC is the day it ships.

Starting point is 00:32:35 With an FPGA, it's reconfigurable during the lifetime of the product. And so you can add more functionality over the lifetime of the product just by flashing in new firmware into the FPGA. And you can have multiple parallel pipelines within the FPGA to do different things. In this example, you know, we've added an encryption accelerator, you know, after the computational storage was developed. And then we added a decryption accelerator. And then finally, there's an analytics engine that was added, right? And each one of these takes up additional FPGA space,

Starting point is 00:33:13 but allows us to pipeline execution and deliver performance at the same speed that you would normally expect to come from hardware. It also allows us to adapt to changing compression algorithms along the way. same speed that you would normally expect to come from hardware. It also allows us to adapt to changing compression algorithms along the way. I mean, today's compression algorithm is not tomorrow's compression algorithm. You know, GZIP, you know, was popular one day, you know, and then Brotli, I believe is how it's pronounced, was popular shortly after that, leading up to the Zipline accelerator. So, you know, just like video transcoding, we've had various different video algos along the way.

Starting point is 00:33:52 We've had various different, you know, compression algorithms along the way. And doing this in FPGA logic allows us to update those algorithms as they are improved. So computation of storage drive, this is the process where we put the controller, it becomes an FPGA controller and that is, is actually put into the drive itself. Right. And we can put programming logic within that FPGA to offload different tasks. If we want to take a drive and we want to encode or decode video as it enters leaves the drive, or if we want to do compression of data as it enters or leaves the drive,

Starting point is 00:34:48 or if we want to change the addressability of the data within the drive, we can do all that through a computational storage device. And we have a couple of great partners in this space, Samsung and ScaleFlux. You can buy computational storage drives from them, and that's kind of where these are going. So these are accelerators, but they're accelerators that are buried into the actual drive hardware itself, whether they look like NVMe drives or classic SSDs. You know, it's buried right in there with it, and it sits alongside the controller.

Starting point is 00:35:23 And then there's also computational storage processors. These are, you know, separate boards that are managing the storage, much like a controller. We've got partners in this space in the form of Bitware and Identicom that put these products out. This allows us to accelerate writing the data to storage, doing some of those transforms we mentioned earlier in storage, and do peer-to-peer transfers between storage.

Starting point is 00:36:04 And then there are computational storage arrays, right? And that's where we put the FPGA in front of the whole array and let the FPGA manage the array and do, you know, RAID and other erasure coding, other tasks within the FPGA. We've got partners in the space in the form of Bitware. Okay, so let's get to the rise of accelerators. So in the accelerator space, we've got a couple things pressing in on us. We've got PCIe Gen 5, PCIe peer-to-peer, and then two new protocols that you may not be aware of, CXL and C6,

Starting point is 00:36:40 and we'll talk about those in a second. So PCIe 5, with every turn of the PCIe crank, we double the data rate. So with PCIe 5, we'll get the double speed bump of PCIe 4. So now we're into the hundreds of gigabytes per second of, I don't know, something like that, gigabytes per second for 16-lane Gen 5. We'll have support for PCIe peer-to-peer that kind of exists in PCIe 4. It showed up with a kernel enhancement that came out last year. And then we've got new protocols, C6 and CXL,

Starting point is 00:37:22 which I'll go into in a second, that are supported on PCIe 5. PCIe peer-to-peer, this came out with Linux kernel update in 2018. This allows two PCI Express devices to talk to each other on the PCI Express bus. So accelerators can talk to NVMe storage. For example, that's probably the best use case. Um, it used to be before the kernel mod, you kind of needed to control both ends of things.

Starting point is 00:37:51 Uh, after the kernel mod, there's been some standardization and this has allowed you to, um, you know, have an accelerator from one company, talk to an NVMe, uh,

Starting point is 00:38:02 drive from another. Um, but it's, it's only, it only handles acceleration between two devices. So it's a good point-to-point solution. Compute Express Link has been around for the last couple of years, and it's an open consortium of companies. And the objective of this was to come up with a better way for accelerators on the PCI Express bus to work with the host.

Starting point is 00:38:31 CXL is a master-slave model. You know, the host is the master and the adapter is the slave for the most part. It's cache coherent. This allows the accelerator to have access to the CPU's cache and main memory and it allows the host CPU to have access to the accelerator's memory and so what that

Starting point is 00:38:54 does is it enables high degree of data exchange between the two computational platforms the problem is though it's only a one-to-one type of thing, one accelerator to one host. So it's fine if your solution just needs a GPU card or just needs an FPGA card

Starting point is 00:39:20 or, you know, one device, SmartNIC, whatever, smart storage. But if you want to aggregate multiple devices together, you can't use this technology, right? So along came cache-coherent interconnect for accelerators, C6. This is a peer-to-peer model. And there's actually a couple of different versions, not versions, I should say, a couple of different use cases where this can be utilized. It can be utilized in a master-slave kind of a processor-to-accelerator model, as shown in the first diagram on the bottom there.

Starting point is 00:40:00 It could also be processor-to-mem like MasterSlave. But it can also be processor to accelerator and to memory, right, in kind of a star-like configuration. Or it can be done in a daisy chain approach where it can be processor to processor and also processor to accelerator to accelerator to accelerator. And that's where things begin to get very interesting. Also, it supports a NUMA model. This is non-uniform memory access, which allows you to take all the storage in all these different accelerator devices and drop it into a single virtual addressing space.

Starting point is 00:40:42 And that allows different programs to access different data on different devices, you know, directly through your normal conventional programming routines. So let's talk about accelerators from, you know, kind of a higher level point of view. We talked about the separate computational domain a couple of times already, but I think that's going to be one of the biggies that we see cropping up over the next couple years is implementations where people are using an accelerator or a smart NIC as the controller for the macro sense, where it's controlling all the resources of the server, not just basic power on, power off of different things. But it's allocating computational tasks out to different accelerators within the box or orchestrating applications that utilize multiple accelerators through a uniform or a NUMA memory architecture.

Starting point is 00:41:46 We're going to see the rise of KTLS, kernel-based TLS security, getting offloaded to these accelerators. TLS is computationally intense, and if we can move that off the host CPU to a dedicated accelerator that's got logic, hard logic for doing TLS, encrypt, decrypt, that would dramatically improve the overall performance of any solution. Another biggie that we haven't talked about yet is Competential Computing Consortium. And this is an open consortium of folks that are looking at how to improve secure enclaves

Starting point is 00:42:35 within a computing architecture. So that if one enclave is compromised, it doesn't affect the other enclaves on the computing device, right? one application could compromise the host CPU such that it gains access to another application's data and storage and keys and other content. And this just is something that needs to get locked down. And so Intel and AMD both have architectures around this. But what we want to do is be able to extend those architectures to other computing devices like accelerators.

Starting point is 00:43:33 So the accelerators can either share a secure enclave, much like a skiff, you know, within a building where a whole building is a skiff. So you've got multiple conference rooms that are part of the same SCIF, right? Or, you know, setting up individual secure enclaves within the server for different tasks, right? So the Confidential Computing Consortium is looking into these kinds of things, and I think they'll play into the accelerator space sometime in the coming years. Also, and I touched on this a couple times already, is orchestration, right? You know, when you want to launch multiple containers or pods,

Starting point is 00:44:14 you want to be able to run that orchestration in a secure environment that's trusted, and you want to dispatch those applications in the most efficient manner possible. And if you were to do that within the host CPU complex, and the host CPU complex gets compromised, you could potentially compromise other items, other namespaces within that host. So what you want to do is be able to orchestrate those from afar and limit the compromise capability. Finally, there's the Xilinx Manager.

Starting point is 00:44:54 The Xilinx Manager is a way for dispatching code into FPGAs through a containerized model over the network. And that'll be something interesting to see over the next couple of years. So let's look at an example of SmartNICs and where we're going with this kind of technology and where we can expect C6 and CXL to be applied. This is actually a C6 example.

Starting point is 00:45:29 So what we have here is a normal server that's got two sockets. And we have an application that theoretically runs almost entirely within the accelerators, right? We have a video feed coming in from the network somewhere, a smart camera somewhere that's been encoded. That video feed comes in the SmartNIC. The SmartNIC has in it a video decode for that camera. It decodes the video in real time, and it puts the frames of the video into a

Starting point is 00:46:06 frame buffer such that a, another accelerator card on the PCI express bus can access those frames. Now the frames could be stored in HBM on the smart neck. They could be stored in, you know, storage on another auxiliary card. It really doesn't matter that much. But anyway, the SmartNIC is doing the video decode. It's doing the networking component, and then it's doing the video decode and putting them in a buffer. or ML application that, say, is looking for a face or a license plate or, you know, some sort of theft at a cash register or whatever the case may be.

Starting point is 00:46:52 It's looking through that frame buffer for specific things, right? Once it's done and it's looked through that frame, right, it marks the frame as consumed or used, right? And there would be another accelerator application that would then pick that frame up and begin the encode process and encode that frame such that it can be stored into a smart SSD and also be viewable in, you know, multiple number of bit rates.

Starting point is 00:47:22 So that way, you know, security guard or whatever could be watching it, you know, on a laptop or his phone or, you know, you know, computer or a series of monitors, right? So you're going to want the data available on multiple bit rates. So what you've got here is essentially three different accelerators that are operating on essentially the same data and then storing it off to a smart SSD or other storage for future use. And the benefit of that is that none of this data is getting bopped around, you know, back and forth between the host CPU, right? It would exist in the NUMA architecture on one of the devices,

Starting point is 00:48:09 whichever device has, you know, the adequate space to handle it, probably the SmartNIC or whatever. And then it's getting moved around as it's needed, but between the PCIe device and the accelerator that is working on it through NUMA, the NUMA memory architecture. And then in the end, it gets stored off to the smart SSD. So each one of those transactions should be as efficient as possible. And it'll only be looking for doing, you know, what it needs to do. So you're minimizing the amount of data movement over the PCI Express bus. You're really not getting the host CPU involved

Starting point is 00:48:51 in the actual process. And the net result is that you'll be able to process much more video, much faster, and store it out more quickly than you can currently do that today, having the host CPU involved in everything, right? So this will allow you real-time requirements for things like, you know, managing hundreds of cameras at the Olympics or within a big,

Starting point is 00:49:15 vast complex and doing all this computational work in real-time. So more resources. If you're really interested in smart NICs, I've only touched on a lot of the concepts today because I only had a few minutes to do that. We have a series of articles written on smart NICs, you know, what makes a smart NIC smart, what's the difference between a smart NIC and a regular NIC,

Starting point is 00:49:41 you know, the shift from smart NICs to FPGAs and how they may dominate as well as what's going on with C6 and CXL and SmartNIC architectures. So we've got those articles. These are clickable from the PowerPoint or PDF. And then also we had an IEEE Hot Air and Connects panel at Hot Air and Connects 2020 a couple weeks ago. You can click on that link, and that will take you to the YouTube video of the panel for more information. I appreciate your time today, and if you have any other questions,

Starting point is 00:50:13 please feel free to reach out to me. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to Thank you. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #135: SmartNICs and SmartSSDs, the Future of Smart Acceleration

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.