Grey Beards on Systems - 174: GreyBeards talk SDN chips with Ted Weatherford, VP Bus. Dev. & John Carney. Dist. Eng. at Xsight Labs

Starting point is 00:00:00 Hey, everybody, Ray LaCasey here. Jason Collier here. Welcome to another sponsored episode of the Greybirds on Storage Podcasts, show where we get Greybirds bloggers together with storage, this is the vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have this here today, Ted Weatherford, VP of Business Development, and John Carney, distinguished engineer for software architecture of Exite Labs.

Starting point is 00:00:43 They were both at a recent AI Infrastructure Field Day event, which I attended and were the hit of the show, in my humble opinion. So Ted and John, why don't you tell us a little bit about yourselves and what's up at Exite Labs. Oh, great. Thanks a lot. This is Ted Weatherford and VP of BizDev, as you said. We're going to just do some good exploration in Q&A and kind of continue from where we did on Tech Field Day. I want to pass over to John Carney, our distinguished engineer and architect with a focus on the E1 and many, many, many decades in this exact area. Hey, John, give him a hello. Hello.

Starting point is 00:01:18 Hey, Ted. This is John Carney. I'm a distinguished engineer, software architect at Excite. I've been at Excite. I'm in my seventh year at Excite. I was the first employee in the U.S. And I came in to start the DPU product, which I think we'll talk a lot more about as we go. I'd like to try to talk about half the time on the DPU and half the time on the DPU.

Starting point is 00:01:46 and half the time on the switch product. And maybe we could start with the switch and follow up with the DPU. And then maybe we'd talk about some customer wins if that makes any sense. Anyway, so the X2 product is a chip. You've been shipping for a while now, and it's your switch chip. Somewhere in there, you mentioned 3,000 plus cores of parallel processing and a parser core. and all this stuff. What the heck is, what do you do?

Starting point is 00:02:20 3,000 core sounds like a GPU, not a switch. Doesn't it? I mean, it's great. I think I'll give you like the insight story on the whole thing because there's always history and philosophy to all this, right? What is the team done in the past? What's their belief system in terms of what the market needs, right? So our team comes from the network processor world.

Starting point is 00:02:46 There's a lot of folks from Easy Chip, Motorola Free Scale, which, and if you follow back, far enough, you know, free scale was dominant in communication processes. There's really network processors. And they did a lot of multi-core stuff early. Gosh, they had Seaport, they had their other product line. So multi-core, you know, Harvard architecture or risk core thing, it's been going on. a long time, you know, and so this team is really experienced in both utilizing these kind of products. For instance, John and his team on the East Coast, they did a lot of work not only designing this kind of stuff like at Broadcom, but also Juniper Francisco deploying and software coding,

Starting point is 00:03:30 the low-level, you know, very interesting data paths, right? So the team is this veteran team, both in Israel, more on the easy chip side, but a bunch came from Melanox. common, a bunch came from actually Anapurna Labs recently from the Nitro team. So there's sort of been this blending over the years of network processing now turning into DPUs, okay? Communication processors, the network process kind of became DPUs. So if you really pan back from it, you've got sort of like an eighths generation, you know, easy chip type thinking masquerading around as an Ethernet switch. And so, I mean, this is just the way. there are. There are 3,072 Harvard Architecture cores. The little efficient cores designed specifically

Starting point is 00:04:19 for packet processing, specifically really layer 1.5 to layer four. So it's not, you don't look at this product as a stateful product. That's where the DPU comes in. But for forwarding, it just does everything under the sun. So it's really this eighth generation network processor. That's the long answer. Yeah, yeah. So maybe we should preface this with discussing the Exite Labs is a chipless fab. Is that I'm not sure. I'm not sure how to. We call it fabless simicenducticode. Sorry, John. Fabless semi-condunducts. So you design chips and you've got these two chips out there, one for the network switch and one for the for the DPU. And you're actually shipping both chips, right? I mean, as far as I know. Yeah, let's run the schedule. So, The X2 is our second generation switch. Our first generation switch was a 7 nanometer TSM 25.6T. And the second generation, the X2, so there was an X2.

Starting point is 00:05:20 Now there's an X2. The X2 is a 12.8T. It has 120, 100 gig surretys on it. So it's going after the edge type of applications. So inference at the edge or actually extreme edge. We're actually up in space. We'll talk about that one later. That product is a 5-90s year that taped out in November of 2023 and came up in the lab in April of 24.

Starting point is 00:05:48 And then went general availability first spin, no metal spins in November of 2024. And we're in mass production. I mean, here we are well into 2025. So we're mass production on that product. And then the E1, which we'll talk about later is this 800 gig GPU or 64 Neoverse 2 Armcore edge server. that product arrived, taped out in November last year, November 2025, arrived in the lab in actually March, and then we started sampling it early June. And that product is general availability now of January and should go to mass production,

Starting point is 00:06:27 4,000 hour qual of summer, you know, June, May June timeframe. So the products are both stable and are out in many customer hands and, And we'll be going, the X2 goes to scale this year. By scale, I mean, you know, 100,000 plus units. Yeah, yeah, yeah, yeah. It's pretty impressive. Not a lot of companies can do these sorts of chips anymore. I mean, it's pretty rare to be producing these sorts of high parallel processing,

Starting point is 00:06:59 high core count chips in the world today, right? I mean, it's not a lot of companies do this sort of stuff. I want to let you know, like we really want to build a company and our dream and our trajectory is to take this thing public. But if you pan back from the problem that the world's in right now, we have an energy issue, especially in the United States, China less so. But we have one, right? And we're scrambling to catch up. And then you've got the chip problem. Like it's all TSMC or the memory vendors and it's a choke point, right? but you go the third one down of like, well, what is slowing us down? What is going to ultimately slow us down? There's only 10 teams in the world that can do this kind of stuff with the kind of quality

Starting point is 00:07:44 you need and the kind of cadence. And there's six principal AISCs that you need for AI factories, whether it's inference or learning, training. And so if there's 20 entities, companies, or governments on the planet that want to control AI, they have a hard time doing it because you've just got six chips. you need to build at a invidia cadence or better. And you just have only 10 teams that can even do it. And we're 260 people, 45 or 50 of which are contractors.

Starting point is 00:08:16 And there's about 10 of us, 14 of us in the apps and commercial team. So it's a very efficient team doing these two products. And if you look at our schedules, which I just laid out, we're on a TikTok. We put out a complex chip, whether it's DPU, switch, DPU, switch. kind of every year to every 14 months. And that's really hard to do. Yeah, it's almost Apple scale kind of stuff. By the way, Apple, when I list the 10 teams,

Starting point is 00:08:44 Apple's up there. It's a Vago Apple and Nvidia, and frankly, to be arrogant, us, you know, it's hard. It's really hard game. We'll talk a little bit about some of the Apple comparisons. We talk about your DPU chip, which has got more cores than

Starting point is 00:09:01 any Apple product ever put out. But that's another coin. Different category. Very different category. John, I'll let John take that. Hey, John, when we get to that one, John, you take that one. Very different what Apple's doing. Oh, yeah, but, but similar in a way. I mean, well, we'll talk about, no, well, Dice and slice it. Well, Dice is like, it's a very interesting approach from a switch perspective, given the fact that, you know, so I'm assuming that you're really targeting kind of these AI workloads in basically kind of the in in data center and kind of the edge data center use cases as well because I mean you know edge period all the edge that's what would be all the Harvard architecture you know you got that being classic you know basically it's like that parallel

Starting point is 00:09:47 fetch and storage like eliminates that classic von noyman bottleneck right uh we got deterministic timing and basically and better you know honestly power and security isolation as well so I mean almost everything's Harvard now. I don't even know why I threw that out there because almost everything's got a separate instruction and data bus. But I just want everybody to know that you can code it like you normally code it. And we do have an assembler for it. But we actually have all the libraries done, which is all the table management and hard stuff. And then you actually have a Python wrapper to call everything. So it's actually pretty, pretty easy to program. And you're not talking about a lot of lines of code. A forwarding data path for a, you know, beating a tomahawk or a trident or a, you know,

Starting point is 00:10:33 or a Terrelinks by Morvel. Going after those products, you don't have a lot of lines of code, you know. But the beauty is you have the future proof and you can do all kinds of things that AI does need. You know, there's combination, there's filtering. There's all kind of things you want to do to save energy and to save latency. That's a, you bring up a really good one too with the energy. you know, kind of one of the things when I was like kind of reading into some of the specs on your, basically on your switching architecture is the fact that the, uh, basically the power reduction, um, from that method. John, this, yeah, this is what's never been done? So if you, if you say,

Starting point is 00:11:12 okay, what are you really doing different? What have you done that's never been done before that directly maps to benefit to the end user? Nobody's built a programmable product that's low latency and very high packet performance. at a low power per bit, at a low jewels per bit. And we have. And that's our feather and our cap. Because you can go back through all the program switch architecture or the ones that are out, the Tridentes, the Silicon ones,

Starting point is 00:11:41 the StrataDNX or the Jericho, you know, Kumerons of Broadcom. They are all huge power tax, huge latency tax, huge jitter tax. So we just have the performance of Tomahawk. You know, our latencies can be at 450 nanoseconds. And the real numbers on Tom Ocks are 600 to 800 nanoseconds. So one of the beauties of having a parallel architecture is that you can decide to trade off latency for, you know, for the application.

Starting point is 00:12:12 So if you have a heavy lift, which none of the data center does, by heavy lift, I mean, you know, layer two, a bunch of ACLs, layer three, maybe some fancy tunneling, you know, like this kind of stuff. you have a really long list of instructions to get all that packet service terminated or created or replicated, counted, blah, blah, blah. But for a data center application, it's just a simple routing application. You need very little. And so at that point, you can turn the latency down and be parallel at 450 nanoseconds

Starting point is 00:12:46 with our product. And so all the competing products are pipelines, sorry, are pipelines. They have a fixed latency through that pipeline. So whether it's Tomahawk or Trident or whoever, it's a fixed latency, and it's generally a little bit higher. Yeah. And, you know, it's like it's really interesting too because it's a, you know, I do like kind of the, you know, price performance per watt is going to be what's going to measure a lot of these AI factories that are getting built at this point. And the reality is you're solving a very, very parallel problem. And basically the parallel architecture that you've got as far as, you know, like I said, basically doing the Harvard architecture.

Starting point is 00:13:23 that many cores. It's a, yeah, it seems like a pretty, pretty compelling product to, to basically help a very, you know, parallel issue, uh, that we've got in building these AI factories. I mean, nobody is, yeah, oh, go ahead. That's pretty impressive. It's all software defined architecture. It's awkward, it's open flow networking come come of age to some extent. Uh, you mentioned, uh, that's it. Like nobody's ever done a layer 1.5 to layer seven. OSI switch that's truly software defined. And you put our two parts together. And for the first time, you've got a full OSI stack that's the day.

Starting point is 00:14:02 I'm going to ask about this smart switch thing later. But in the meantime, the, the architecture is extremely parallel. You've got this 3,000 cores of Harvard architecture. You've got parser cores. And somehow this is split up across six. 64 packet engines? Is it dynamic? Is it fixed?

Starting point is 00:14:33 No, there's a lot of dynamicism and there's a lot of interconnect that's in the design. So when you look at the published slides about the 64 cores, it's a 400 gig building block. And what's sexy about that is for us, like when we're going up or down market and building a switch that's 200-tabit or 400-terabit or 400-terabit. it or down the other way, you can use that basic building block to scale up and down. So in that sense, I think of like the tiling that the natural symmetry that a FPGA has. There's nothing about this. It's an FPGA. It's an expensive tapeout.

Starting point is 00:15:11 But you have this basic building block. There's other basic building blocks like the port groups come in surtees groups of, you know, in. I think it's eight surtees. So the given speed of the surtees, you know, and the PLLs that, that, create the physical layer 30 speed because you can run 100, you can run 50, you can run 25, you can run 10, and every group of four ports can be dialed, for instance. But that 400 gig building block for the throughput with the cores, yeah, it has 48 cores, 48. So if you take the 64 times 48, you'll get 3,072.

Starting point is 00:15:49 That's the core count. I got it. I got it. All right. It's really for us. Like the customer doesn't have to think about that, okay? Yeah, yeah. for the record. Yeah, yeah. No, they don't. But it's for us. Yeah. No, it's a, it's a, it's a, it's a silicon

Starting point is 00:16:02 tape out velocity thing. Because look, when you pan back from this, you've got to have a team that's, that's, that's, that's, that's, that's, it's, that's, it's, that's, it's, that's, it's, it's, you got to keep up with jensen or you're not going to be part of the, the solutions that displace invi. So the other thing that, uh, a lot of the networking stuff these days, uh, that's focused on AI talks about, you know, I don't know what I call it, I guess, advanced load balancing capabilities. So to try to distinguish between inferencing traffic, training traffic, regular traffic, that sort of thing.

Starting point is 00:16:38 As a software architecture, you seem like you'd be able to do all that, but there is, you know, in my mind is a question of how much time do you have to make these sorts of decisions and the sort of switches that are operating at 400 gig and things of that nature. Yeah, so the way that works is there's two different things you're referring to that are very critical. One is differentiating one kind of traffic from another so that when you do have congestion, the really important traffic gets through. Everybody's got a story for that in switching. Why our narrative is different and powerful is we're not locked into a particular hardwired,

Starting point is 00:17:19 configurable version of that. So when it comes to classification and then the scheduling part of that, like do I let the packet through, do I delay it, do I drop it, you know, do I send some information like instrumentation or telemetry to somewhere else to tell them what's going on? You don't have any constraints. We can code all of the features that everybody else has and then you can bring in some refinement to that. So it's future-proofed, okay? Or if you want to do something that nobody else has. And the larger CSPs do. They do want to build their own congestion management, their own classification schemes, their own scheduling algorithms. They do want to tweak it and tune the system.

Starting point is 00:18:02 But the other one you mentioned was the hashing or the spreading. That one's very straightforward. Like when you have an open loop system where every switch in the fabric is just talking to itself. Okay. It's getting information via packets at a slow rate. from other places in the network to figure out where the congestion's developing. But you don't have a control loop that's scheduled over the whole thing. And that's good because that's cheap.

Starting point is 00:18:32 And so if you want to go to a scheduled fabric, you're into the silicon ones. You're paying five times per port per bit. If you're into the Jericho, Ramon, you know, Broadcom, Strataddy, and X, you're paying for that. It's a centralized scheduler, called a pull scheduler. And it's complex. It's complexity. And the thing about things that are complex is they don't scale. So back to this, the Ethernet switches like the Tomahawks that are dominant in the fabrics at the highest volumes in the data centers and the AI front ends.

Starting point is 00:19:05 Those switches are cheap and they're open loop. So to get the utilization of the fabric, you have to spray over all the ports. And then all those flows take different directions. And you don't want the flows to get out of order. and the packets are different sizes so they can get out of order. So all that out of orderness has to get fixed in the DPU or in the NIC, and that's rocket science.

Starting point is 00:19:31 But what I'm trying to say with our product, it's different. We can do any of the spreading or hashing, and that's called entropy. We can add any amount of entropy and look at anywhere in the packet to create that entropy, whereas all the configurable products are fixed. So they have great entropy, maybe,

Starting point is 00:19:47 but there's this thing called hash polarization where as you have more and more tiers of switches creating this giant fabric, like you'd find an Amazon or Google or Microsoft or Oracle, you know, you get hash polarization because the hashing algorithm sucks. It doesn't have the right entropy for the exact traffic patterns, which relate to the workloads that the CSP is doing, right? The Amazon or the Microsoft Google. So that hash polarization is just simply inefficiency. It means there's links that are underutilized and there's other links that are overutilizes and maybe dropping packets. Because if links get overutilized, they'll get congested. When you have congestion, you can get,

Starting point is 00:20:27 you can get dropped packets. For AI, you can't drop packets. So like as the Ethernet switches, go into Ethernet, what you said to summarize, the hashing is all important to be able to tweak it and add the right amount of entropy to avoid hash polarization, to avoid low utilization and congestion. And then the other one, you said, you need the QOS, you need the quality service. You need to know what every packet is, where it's going, how important it is. And when things do get ugly, congestion, and you approach dropage or some dramatic activity, you need to be able to avoid that. And we have all the knobs and everybody else's fixed function.

Starting point is 00:21:08 Unless you go to the router chips. Yeah. Sorry for the range. Software-defined nature of the product makes it very flexible in that respect. All right, so we're about halfway through the talk here. I want to move on to this DPU thing you got. So, John, talk to me about what is, what does a DPU need 64 cores for? I mean, I've got, I got Linux servers don't have 64 cores.

Starting point is 00:21:33 I've got a Mac. I got a M5 Mac doesn't have 64 cores in it. What do you need 64 cores in this thing for? John? Yeah. Let's talk about sort of. the traditional way this has been done, and then we'll talk about how we got to 64 cores in a DPU.

Starting point is 00:21:52 So the traditional DPU really started from the NIC and added some compute off to the side of the NIC, you know, integrated some small number of cores. And really the idea of the DPU in that architecture was that the NIC portion of it would do, you know, 90% of the processing, and most packets would not touch the cores. But what we've seen is that that's not flexible enough, for one. Two, is that it creates a very difficult kind of programming model

Starting point is 00:22:31 where you kind of have to split the programming between, you know, a standard programming model on your CPU cores with sort of a proprietary or much more difficult programming model in the NIC portion of the DPU. So when we architected the E1, we made a decision that we wanted the compute system to be on-path. In other words, we wanted the ability for every packet to be able to be processed by CPU cores, to allow for the standard purring model, but also to allow for the full packet, rate and the full capacity to be programmed with cores. So what we did is we scale the compute system

Starting point is 00:23:22 and we optimize it for energy efficiency. So although we have 64 cores, we don't run them at the highest frequency possible. We run them at a little bit of a lower frequency. That optimizes for performance per watt. We've also done some optimization of the cash sizes, again, really focused on data playing workloads. So we have a very capable server class compute system,

Starting point is 00:23:51 but optimized for data plane workloads and for energy efficiency. Yeah, I mean, server class. I think I saw someplace you have four DDR5 memory buses. I mean, it's enough memory to, it's more memory than the world needs. Well, I don't know if I go that. The world's only 800 gig DPU. It should surprise you. Like, how are they doing that?

Starting point is 00:24:15 Yeah, and let me just speak to the memory interfaces. Yeah, we have four DDR5 memory interfaces. And for certain use cases, you know, the architecture of the chip is such that when a packet or a PCIE transaction arrives at the chip, it goes to our system level cash. so it doesn't touch the DRAM. And for many workloads, the goal is to process that transaction and forward it out of the chip or terminate it before it ever needs to go to DRAM.

Starting point is 00:24:52 And so in that case, the DRAM is available really for your databases, for your tables, for your stateful flow tables, connection tracking, for your billing, for all of those things. And there are use cases for which, there's no locality, meaning for every packet that arrives at the chip, there are database accesses that are not hot in your cache. You have to go to DRAM for that.

Starting point is 00:25:23 And so we architected the chip to try to keep the packets on chip, while the databases, which you have to access at very high scale, remain off chip and the DRAM. What do those cache layers look like on your chip? Yep. So we have a system cache. So we have a system level cache that's shared by all the cores. That's 32 megabytes of system level cache. And then we have a dedicated layer two cache for each core that's a half a megabyte.

Starting point is 00:25:58 And then we have L1 caches, a data and instruction caches, L1 caches that are 64 kilobytes. So again, the cache hierarchy is optimized for data plane workloads. The compute system on the chip is not going to compete at the highest levels of server-class CPUs. But for data-plane workloads, it's going to give you about twice the performance per watt than you can get on any other kind of compute CPU chip. So, I mean, DPUs kind of started out by, you know, offloading some of the functionality from the host to the DPU to do networking activity and things of that nature, to accelerate networking connectivity, to accelerate networking performance and those sorts of things. This has gone a whole other level up, up the stack, it seems. I mean, you're running Linux on these things.

Starting point is 00:27:08 Is that correct? Yeah, yeah. So we're running, you know, a standard Linux distribution. You know, internally we're using a Ubuntu 2404 distribution. We have customers that are using other Linux distributions. Yeah, I would say you're right. I mean, the DPU started really, you know, first you had the nick and then you had sort of the smart nick that started to offload some of the functionality of the host.

Starting point is 00:27:34 then it really got to DPU. And it's about offload, but it's also in a sort of cloud SDN use case. It's also about, you know, security, isolation, virtualizing the services of the infrastructure so that that's all done off of the host. And it also provides the ability for the host to be bare metal so that there's no you know, there's no cloud service provider software at all potentially running on the host, and so that the tenant can have the entire host for their use, and then the DPU is providing all of the virtualization of the infrastructure.

Starting point is 00:28:22 Now, am I also reading this right that you guys are offering, so I was looking at your spec sheets, are you offering basically the 800-gig data, plain bandwidth, basically in 2x400 in that card, and you're also doing that 800 gigabit for 75 watts off the PCI bus? So we have different systems that we have available. So one thing I'll say is that we're a semiconductor company. We're not a system company where we sell chips. We're structured as a semiconductor.

Starting point is 00:28:58 I've worked for system companies. I worked at Cisco and Juniper. And I would say at a system company, you probably have 20 to 1, 50 to 1 software engineers to hardware engineers. We're not structured that way. We're structured as a semiconductor company. And so our goal is to sell chips to partners. To integrators, yeah, yeah, that are building.

Starting point is 00:29:21 Or to the hyperscalers. And they're the ones that are integrating their own software and solution on it. So what we're providing are the infrastructure pieces of software that enables our customers to do this. Now, in terms of the power, it really depends on the use case and depends on what IOs and what interfaces and whether you're using all the memory channels or not. So what we're shipping today to customers, we have what we call an E1 server platform. It's a 1U server where the E1 is self-hosted. In other words, it's not connected to a host. It is the host.

Starting point is 00:30:03 Right. That has 2 by 400 gig Ethernet. And in that case, the PCIE can be used in a root port mode and to connect to other things. Like in most of the time, it's connected to storage, but it could be connected to any other kinds of devices or accelerators. That's sort of a reference platform that we provide for our customers to do evaluation. We have another platform, which is our PCIE add-in card, which you would plug into a host. And that add-in card, again, is on the networking side is 2 by 400 gig. On the PCIE, it has an edge connector that's Gen 5 by 16, so the edge connector can handle 400 gig.

Starting point is 00:30:53 And then we have an MCIO connector that gives you another 400 gig of PCIE, which can be used to plug into another Gen 5 slot or it can be used to connect the DPU directly to local storage, for example. And again, depending on the use case and depending on the skew, we have different skews for the chip. You could be anywhere over 100 watts to under, you know, 75 watts. So it really depends on your use case and your skew of the chip as to where you fall for within the power. the power range. Right. No, I mean, it's just, it is impressive.

Starting point is 00:31:32 The, basically the, the, the range of power options that you've got there. And the fact that, you know, that does fit very well into, you know, a lot of edge, edge AI use cases. Yeah, because power, power is a huge constraint, as you well know. I mean, you know, a standard, like you take a standard, you know, AI, AI factory single box. like that's effectively for you. And those things usually suck like 10,000 watts power. Most of it go into the GPU, right? And anything that you can do to basically reduce.

Starting point is 00:32:10 And those things also, by the way, you know, have on the back end of them, you know, they got like 10, there's basically two system necks that are 400 gig. And then basically each GPU has its own neck, which is a 400 gig. neck coming out the back. And anything you can do to reduce the power on that is always highly desired. So the, because you stick a rack full of those things in there. And I mean, and these, you know, the current, you know, shipping generations are doing, you know, you're doing like, you know, between 80 and 150K a rack and the future generations of stuff,

Starting point is 00:32:52 you know, as everybody knows between, you know, like invidion and AMD, you know, what you're shipping. I mean, you're talking like a quarter to half a megawatt per rack. Yeah. A new world. New world. So I mean, Google published a one, yeah, one million watt per rack. Yeah. Wow. Yeah. All right, John. Talk to me about this Sonic Dash benchmark thing. I, I look at these numbers and they're just unbelievable. Yeah. Yeah. So, Let's just talk a little bit about what Sonic Dash is, and we'll talk about the benchmark. So Dash is an open community initiative.

Starting point is 00:33:37 It was started by Microsoft. And the purpose of Dash was to really define the host services that are used in the cloud. And I think if you go back a decade or more than a decade, You mentioned OpenFlow earlier. You know, open flow and, you know, other kinds of sort of primitive SDN frameworks, you know, allowed the cloud provider to build the services on top of these primitives. And what's happened is over the, those decade or decade and a half,

Starting point is 00:34:21 it's converged to the point where, like, the services in the cloud are known. and can be well defined. And so the idea of the DASH initiative was to actually define these services and to define the APIs for those services within the bigger Sonic project. And by defining those enable technology providers like Xite to be able to create implementations

Starting point is 00:34:47 to really optimize for performance and for power. And so that's what the DASH initiative is. One of the SOTSITAMs, one of the source services within dash is called a v-net to vnet service. And it's a very, very heavy workload. It has to do connection tracking. It has to do it at very high scale, hundreds of millions of connections.

Starting point is 00:35:19 It has to have very high numbers of active connections, as well as new connections. And so there's a benchmark test that's created for Dash called the Hero, and it's defined at different throughputs. And there's one called the Hero 800 that requires you to do 12 million connections per second while having over 120 million background connections running. And so the E1 has been tested. We have an independent third party that tested our dash implementation, and we not only passed that benchmark, but we exceeded it by about 20%. And we still see optimizations that we can do to improve it even more. But we're the only DPU that's been able to pass the Hero 800 test with a single device.

Starting point is 00:36:16 This is a very, very heavy workload. just an example of what you can do with the E1. So a single DPU connected to a single server is handling more than 14 million connections per second for a minute and a half or longer. Not to mention the fact that you're doing, you know, I don't know, 128 megabits per second of other data flows that are in the background of this thing.

Starting point is 00:36:44 What in the world requires 14.3 million connections per second or a minute and a half. I mean, so I think the key to this is the word disaggregated, which is the first word of dash disaggregated. And so the,

Starting point is 00:36:58 really the way dash is deployed within the cloud provider is in the switch. And you said, we'll talk about smart switch later. And so by combining

Starting point is 00:37:09 the DPU with the switch, these workloads can be disaggregated from the host. And I think, if you think about it, you know, there's all different kinds of workloads within the cloud environment. Many of them don't need anywhere near that kind of capacity.

Starting point is 00:37:28 And those often will be performed just locally right on the host within the nick that's on the host. But there are some very network-heavy workloads within the cloud, and those will be tunneled to the smart switch to have the services applied in the switch. rather than right on the host. And this is the architecture that they're after with Dash. I was just going to say I can basically tell you that I have actually seen those levels of workloads, but it would require the entire viewing audience to sign an NDA. All right, well, we won't go there, Jason.

Starting point is 00:38:15 All right, so talk to me. So we got a couple of minutes left. Talk to me about some of your customer deployments here. I mean, you've got some big names, right? Yeah, no, it's exciting. And we've got the ones we can, you know, publicly talk about and the ones that we can't. But we're officially off to the races. The marquee one, because it's just the sexiest thing going, is we're in space.

Starting point is 00:38:40 We've won the third generation of the SpaceX's satellites. So this is really, really material, and we're allowed to say that we have two or more chips in space per satellite, the Starlink Gen 3. And the two chips that, or the one ship that we can talk about is the X2. So there's at least two X2s in every satellite. And then in terms of their exact forecasts of how many satellites are going to go up, that's something you can kind of YouTube for yourself. but it's very substantial. And they chose us because they needed the programmability. And we were able to pass the radiation, the vibration, the temperature, and especially the low power.

Starting point is 00:39:31 So we came in lower power than every other switch at 12.8bable. So, yeah, they have their own protocols. We can't talk about what they are. So they had to code packet services directly. They were able to code and demonstrate their unique packet services, you know, in three weeks, which is amazing. There's no, there's no programable switch product where you can write a service in three weeks. So we're really excited about that. And no, it's big.

Starting point is 00:40:01 And we expect to have more programs, you know, fairly won over time. And, yeah, that's the sexy one. And then we also have announced a partnership with HammerSpace. And HammerSpace is a software company. And we're building a reference system together under the open flashplatform.org philosophy. And it's the densest warm storage platform ever created. You can do one x a byte per rack. Now, nobody's going to want to do that.

Starting point is 00:40:33 But you can get four times five or four times six petabits inside one, one hour. you with a standby power of 600 watts. And what that's going to sweep the market in is the context caching and tearing of that, of that context as the context gets so big right inside the rack or the row of every single, you know, Vera Rubin or anything competitive to that. So that's a big one. And we're getting interest with all the CSPs globally on that because we've got the only DPU that can fit in the space and has the 800 gig connectivity.

Starting point is 00:41:17 So in that, and that's a series situation, you're running not only deep use software, let me call it, but also Linux software, file systems, NFS. You're running all this. It's a server stuff, right? John, just chime in. I mean, this is the perfect application for us. And the only competition we have won't fit. I mean, the Bluefield 4 is not here yet.

Starting point is 00:41:44 We're about 18 months ahead of them. And they will show up Q3. And that's a 300 watt, you know, multi-dye, super expensive, don't want to be in the chip business anymore. So, yeah, we're positioned to sweep the warm flash storage market for all of AI or all applications. Yeah. And I think, you know, I think the, you know, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the. and the integration we have with the networking,

Starting point is 00:42:12 with the compute system, PCIE, just maps perfectly to this use case. I think if you were to look at the platform, it has either five or six E1s in it, depending on whether you're in a 19-inch form factor or an OCP form factor. And I think if you were to build that with any other, you know, with any other solution, you just wouldn't get to the power efficiency and density.

Starting point is 00:42:41 You know, that platform, you know, that platform has four terabits per second of networking capacity, as well as four terabytes per second of PCIE bandwidth to the drives. And, you know, you mentioned Linux, and, you know, this is really the beauty of the E-1, is that it's all standard programming. So for these use cases that are traditionally built with, you know, in servers, those use cases can be run on the E1 without any modification. So, you know, HammerSpace as our partner is able to take their software and run it on the E1, you know, immediately without having to do any kind of special development.

Starting point is 00:43:30 Technically this is a storage server we're talking about. It may not be sophisticated multiple gigabyte cache kind of storage server, but you're running. It doesn't have to be. Yeah. It's a distributed schema. If you look into what we're doing, it's a scatter gather over every single little cartridge, five cartridges per rack, six cartridges per rack if you're meta style, 23 inch rack. It's distributed. You take the X86 centralized bottleneck out of the whole thing.

Starting point is 00:44:12 There's no server in the middle of every transaction between the block storage and the other side. It's just a flat distributed NVME network. You don't need all that horsepower. Yeah, and for the specific application like target that that's for, I mean, that actually, you know, that works really well and it's really efficient. Yeah, we see other, we're seeing a lot of other, you know, appliance, you know, I think of this as an appliance. That's the way I think about it. And we see a lot of other appliance kind of use cases that aren't necessarily storage, but are more networking oriented.

Starting point is 00:44:53 And getting the kind of, you know, network throughput, you know, by putting multiple E1s in an appliance is, is, really, really powerful for workloads that can be scaled out. You know, the E1 really does well, you know, for those kinds of workloads. And we're seeing lots of use cases. Every day we're seeing customers with, you know, new use cases. And they map very nicely onto the E1. And this is not ending here. Obviously, X2 implies that it is going to be an X3 sometime.

Starting point is 00:45:34 Down the, down this. And E1 says E2 is not, you know, not outside the realm of possibility as well. I mean, obviously more bandwidth is as a key to these sorts of things. But what else you're going to do in these sorts of new chips that are coming down the line here? You don't even have to talk about them if you don't want to. No, let me take this one. If you want to be a processor company like a caviar and Intel or an AMD, if you want to be in that game, the multi-core game, you got to,

Starting point is 00:46:04 at the end have three, maybe even four power price performance areas, right? And so we came out where we are. You can decide where that is in the stack up. And we'll go up and down. You're going to see something that's higher bandwidth. You're going to see something that's lower. You need a street fighter. You know, you need something low end and that you can address control, plane, and all kinds other stuff. And you need something high end because AI is marching on at rocket speed, right? And then on the X, yeah, I mean, we're going to do a single tape out. where a single die attacks, you know, Broadcom and the others. I mean, the others are almost irrelevant.

Starting point is 00:46:40 I mean, there's spectrum, I guess. But we're going to attack Broadcom where we hit three of their generations of products with one die. So that implies some creative use of the dye and stuff. But, yeah, you'll see a higher bandwidth X series coming soon. Can't wait. I need one of these X-1s for my home lab, though. That's a different discussion. Hey, you can get those at Edgecore immediately.

Starting point is 00:47:08 Yeah, yeah, I can, I can, I can, I can, I, I can, I, I can, I, I can, I, I can, I can, I can, I can't, I can't, I can't do it, don't do it. Don't do it. It's tempting. All right. Listen, this has been great guys. Jason, any last questions for Ted or John? Well, they now, now I know. So, so, so basically with the whole, uh, basically doing the stuff with SpaceX, putting stuff in space. Now I know why your company starts with X. Yeah.

Starting point is 00:47:31 Yeah. I know, I hired, I hired, I hired a guy today. name Xavier and he asked for the email X at excitelabs.com and I'm like, yeah, you can have that. But it's like, wow, why does he get that? That's the coolest email. Look, we hope the SpaceX thing flowers into many, many design wins. We certainly can't say that's the case yet. But I think that relationship's going to be amazing because, you know, there's no truck, no truck rolls in space. No. Yeah, that's exciting. That's like, and that's like you got to have, You got to build hardcore products to basically last, you know, to do that.

Starting point is 00:48:10 And then, yeah, no, so that's an impressive feat. So congratulations on that one. Good win. Yeah, it's a good start. And I think it'll grow like crazy. You know, with the whole Elon marketing lately about putting data centers in space, I want to temper that a little bit from my perspective. I think that'll take some time.

Starting point is 00:48:31 But I think products have to be programmable and they have to be. robust for that environment and I think we're perfectly teed up if eventually someday we get the data center and space vision. I think we'll be right there. Yeah, yeah, yeah. All right, Ted and John, anything you'd like to say to our listening audience before we close? Oh, yeah, just reach out at sales at excitelabs.com and I'll get a hold of it. And even if you want John, you can do it that way. And I'll make sure you get connected. And yeah, we're here to stay. So come, come part. partner, whether it's our ecosystem or our products. And we'll be at two trade shows recently. You know, I'll be over it at the Nvidia GDC, but the teams will be at OFC, which is exciting.

Starting point is 00:49:18 Fiber Conference, it's amazing place to go because everybody really talks. It's very approachable. It's not too big of a show. And then the biggest show will be at Mobile World Congress. So, yeah, come out and see us that way or just reach out at sales at excitlabs.com. We love to hear from you. All right. Well, this has been great. Ted and John, thanks again for being on a show today.

Starting point is 00:49:40 I appreciate it. Yeah, it's been great. Thanks so much for putting us on. John, love it. Thanks. I learned a lot today from John listening to you guys to talk. Thanks. That's it for now.

Starting point is 00:49:52 Bye, Ted, by John, by Jason. Bye, Ray. Bye, Ted. Bye, by John. Until next time. Good night, John. Good night, boy. Good night.

Starting point is 00:50:02 Bye, guys. Bye. Until next time. Next time, we will talk to a less system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it.

Starting point is 00:50:16 Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.

Grey Beards on Systems - 174: GreyBeards talk SDN chips with Ted Weatherford, VP Bus. Dev. & John Carney. Dist. Eng. at Xsight Labs

Xsight Labs talks their latest SDN X2 network switch and E1 DPU chips with the GreyBeards...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.