Grey Beards on Systems - 157: GreyBeards talk commercial cloud computer with Bryan Cantrill, CTO, Oxide Computer
Episode Date: November 21, 2023Bryan Cantrill (@bcantrill), CTO, Oxide Computer was a hard man to interrupt once started but the GreyBeards did their best to have a conversation. Nonetheless, this is a long podcast. Oxide are makin...g a huge bet on rack scale computing and have done everything they can to make their rack easy to unbox, setup and … Continue reading "157: GreyBeards talk commercial cloud computer with Bryan Cantrill, CTO, Oxide Computer"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here.
Jason Collier here.
Welcome to another sponsored episode of the Greybeards on Storage podcast,
a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
We have with us here today Brian Cantrell, CTO at Oxide Computer.
Oxide's made quite a splash this fall with the release of their world's first class, first commercial cloud computer.
So Brian, why don't you tell us a little bit about yourself and what Oxide's been up to?
Yeah, well, thanks for having me.
You know, I feel that I am certainly in spirit a gray beard, although I am biologically not really capable of growing a beard.
You know, there was this kind of moment in my early adulthood when I remember I recall asking my father, you know, I'm not saying that I want one now, but out of curiosity, when am I going to be able to grow a beard?
He kind of gave me that slow nod no, like this is not going to be happening.
So I can actually, I'm not bearded, but I definitely have been an industry veteran.
So I was at Sun Microsystems for 14 years back in the day.
Spent a lot of time at Sun.
That's where I effectively grew up as an engineer.
Started a new storage unit, storage division inside of Sun called Fishworks,
developed a new storage appliance, the division inside of Sun called Fishworks to develop a new storage
appliance, the ZFS storage appliance. Sun was Oracle descended upon Sun. I did not. I loved
working for Sun, did not like working for Oracle quite so much. And I went to a cloud computing
company, Joyent. And Joyent was a cloud computing pioneer. I was there for nine years. And it was there that
I met my co-founder, Steve Tuck, who had been at Dell for 10 years prior to going to Joyent.
And so Steve and I were together at Joyent and it was a fascinating, informative experience, we had our own public cloud competing with Amazon and so on. And really
deploying Elastic Infrastructure and then taking the software that we were building
and making that available for people to run on their own hardware. And this is all on commodity
hardware. So I had kind of come up at Sun with these kind of design systems. And I was
kind of, I was running instead on commodity hardware, seeing like, well, you know, how bad
can it be? I kind of got an answer to that question. Things went pretty far sideways,
especially when Samsung bought Joyent in 2016 and really pushed to totally new and different levels of scale.
And when they did, everything started breaking.
And things that had been kind of minor issues at lower levels of scale
just became absolutely devastating.
And as we were beginning to kind of contemplate what was next
and what we wanted to go build,
we really realized that the what we wanted to go build, we really realized
that the problem we wanted to solve was the problem that we were really suffering with at
Samsung, which is if you look at the way the hyperscalers build machines, Google, Amazon,
Facebook, Meta, I guess, and so on, the way they architect those machines, those machines are not
available for purchase for, it doesn't matter what your size is when you're outside of those
hyperscalers, you can't buy these machines. And as we were having all these problems and looking at
the way they had done it, we're asking ourselves, why can't, I mean, to take a super concrete example, the, anyone running at scale is running
with a DC bus bar. So you're doing power conversion. You've got a shelf that is doing
power conversion with rectifiers that you can cut AC in converting to DC that's running up and down
a copper bus bar. And then you are having your compute elements, compute, network, storage. These are all then mating into that DC bus bar.
And everyone runs that way.
Google runs that way.
Amazon runs that way.
Microsoft runs that way.
Facebook runs that way.
You cannot buy a DC bus bar-based system from Dell HPE Supermicro.
They will pre-rack a system for you, but you actually can.
And that is the most basic element of running at hyperscale. There were so many other innovations
that the hyperscalers had developed for themselves that one couldn't buy in a commercial product. And our big belief, and I dare say a big belief of your listenership and of you all,
is that on-prem computing is not going away.
That Jeff Bezos, or I guess now Andy Jassy, are not going to own and operate every computer on the planet.
There's a reason to run your own compute.
And that may be for regulatory
compliance. It may be latency. What's that latency. And it may be for economics. You know,
as it turns out, it's really expensive to rent all of your compute. And especially if you are a
company, a modern company for whom software information technology is the differentiator, do you really want to be renting all of that in perpetuity?
Because, you know, famously, Jeff Bezos, you know, your margin is my opportunity.
And, you know, that and I think folks are beginning to realize that, that cloud computing is this incredibly important innovation, and we want to have that elastic infrastructure, but there are reasons to run on-prem.
And that is the gap that Oxide is seeking to address. outsized ambition of really rethinking the entire computer and getting out from underneath
the racked and stacked personal computers and really designing this rack scale machine for
purpose, hardware and software together. So true hardware, software co-design. And I think one of the most persistent themes in my own career is the value of that hardware, software, co-design, that we develop our best We know this from the device that's in our pocket. We know this from the car that we drive.
We know this from, you know, the airplane that we fly.
We know this from all of the products that surround us that are delightful and have changed our lives, have this element of hardware and software that have been co-designed.
But, you know, it's very econoclastic in silicon valley i mean
god the number of times i heard like oh we don't do hardware it's like oh okay because the you know
we uh and it's like okay so you don't you would not invest in apple or nvidia or amd right you
know it's like oh okay you just you just don't want to participate in that part of the economy
okay it's like you want to and so you know it was definitely an interesting experience to go to because we had to find
investors that shared that vision unfortunately we were able to raise money and um really start
on this very ambitious build um and that um took us you know three three years and change to go actually build it. And because it's, you know,
the bar is very high because it's not, it's not enough to just build it. You need to build it.
You need to think about your supply chain. You need to design for manufacturing. You need to
actually get FCC compliance, right? You need to, and you know know i learned a whole lot about the the way the importance of the
fcc compliance first of all uh and the way that whole process works because our rack has to have
the same emissions as a 2u server um you you don't get to it's not like emissions per like gram, you know, or what have you.
And so you have, you have to, you know, go through that, get, do,
and it's, it's very involved. But we, and so we, we did it.
We pulled it off. We shipped the thing.
We shipped our first racks on in, in June.
And we got it landed in customer data centers and running.
So it's pretty exciting.
And we did our formal launch of the cloud computers a couple weeks ago.
And it's been really exciting to see that.
The reason I think there's a lot of attention on what we've done is because so many technologists have seen this problem. Like we had at Joyent and later Samsung, they are enduring
all of this complexity that feels unnecessary. And they are asking themselves, why? Why do I have USB ports and serial ports and AC cabling and all of
this and BMCs to go deal with and a BMC management network to go deal with? And then a separate
switch for that. And then, oh, by the way, when this whole thing misbehaves, everyone points
fingers at everyone else or they all point fingers at me. I mean, that's the other thing that I, and I'm sure you all have had this experience too, but God There's a tweet that I loved about like I'm dealing with a Dell VMware support issue and I feel like I'm dealing with an issue with my divorced parents.
And it feels that way frequently.
And part of what I realized is there was a long time when I felt when we had some of these problems with Dell in particular, because we're a big Dell customer.
And, you know, we had a bunch of machines in a single data center that kept dying with uncorrectable memory errors.
And there was a long time when I felt like I need to get to the right person in Dell who can help me understand this problem.
And I began to slowly realize there is no person at Dell that's going to help me understand this problem.
Well, it's split across four different companies, maybe.
It is four different companies.
It is so many different layers of software. And the folks that actually understand or have the ability to truly understand that issue technically have got so little exposure to the way these systems are actually deployed that they actually were not able to be useful on the problem.
So this is kind of when that slow realization that like,
yeah, you can stop escalating now
because there's nobody to escalate it to.
Yeah, exactly.
You've got a company's undivided attention.
I mean, you know you're on a Dell call
when it's like they keep adding people to the call,
but no one knows how to solve your problem. So they, you know, not,
I guess I'm picking on Dell a little bit, but look, Dell, you should have, if you want me to
pick on you, you should have solved this problem. It's not, it's not a Dell problem.
It's not a Dell problem. That's right. It is not a Dell problem. No, you're exactly right.
It really isn't. And it, and I think that's the other thing that really should be said.
The people involved earnestly wanted to solve the problem.
Oh, absolutely.
But they couldn't because they didn't actually build the whole system.
And they've got another company.
As you said, it's the four companies.
You've got another company that As you said, it's the four companies. You've got another company
that's responsible for the bias.
You've got another company
that's responsible for the BMC,
for the software that runs there.
And this is software that is running
at the absolute heart of the machine.
And so when we started Oxide,
it was really with the vision of
we want to take responsibility for this entire system.
And in order to do that, it really required us to revisit some of these lowest level assumptions.
But Brian, if you're responsible for hardware, software, firmware, and everything in between, this is a major, major endeavor.
Major endeavor.
So you know what's funny is that as we were raising venture capitalists, God bless them,
don't always see the right risks.
So VCs in particular, as we were doing that initial raise, were like, look, technically,
we know you can build this.
And I'm like, technically, I am not explaining what we're like look technically we know you can build this and i'm like technically
i am not explaining what we're doing very well then because i really really don't know if we
can build this and because the other thing is like we have also done our own switch by the way
right so we did our own switch as well and in fact the question that i was absolutely terrified of getting on Sand Hill was, what are you doing about the Switch?
Nobody asked about the Switch.
I'm like, okay, great.
I'm not going to volunteer it because I don't want to.
But there was this idea of like, oh, technically, we know you can do this.
We don't see technical risk here.
And I'm like, you don't understand what we're doing.
This thing is loaded with technical risk.
And in fact, quite arguably, the riskiest thing we did as a company, we did
our own switch. We did our own compute sled. We did our own service processor, getting rid of the
BMC. We did our own system software on all of that. But the riskiest thing that we arguably did
is that we also do not have an AMI bias on the system. So we do our own very lowest level
platform enablement. We don't use the software.
We've got a great partnership with AMD, but we don't use AMD software.
When the PSP executes that first x86 instruction, it's not an AMI instruction.
It's not, we don't have UEFI in our system.
We don't have a GISA in our system.
We actually do that very lowest level of platform enablement and we
boot the entire system. So we turn on the PCI engines and so on. Are you talking Hubris now?
Is that what you're talking about? Are we talking what now? Hubris? Hubris is, we are, Hubris is the,
we actually, yeah, people, just for clarification, that's Hubris with a capital H. That is the operating system that we did on the service processor.
So we actually run that on the, one of the things that we had seen on the kind of the problem that Hubris is solving, capital H Hubris, is the replacing that BMC with a service processor that is a microcontroller. So it's, it is not one of
the problems we've seen with BMCs is they are running their own general purpose operating
systems on the BMC. So you've got some like downrev Linux sitting there. And so, well, it's
like, we want to be able to, you know, write, you know, use open BMC and Python or whatever.
You're just like, oh my God, no, all no. Because
then you, you know, I actually saw a really interesting presentation from HPE on, at the
Open Source Firmware Conference in 2022, last year, where they describe with the ILO, one of
the problems they have with the ILO, they can't find DDR2 anymore. So they are, they have to actually use DDR3 on the ILO. So you are now going to like,
literally your BMC has to train its DIMMs. This is the thing that runs before the operating.
It's like, we're training DIMMs so we can train DIMMs. I mean, this is like madness and you cannot
have, so we've got a much more slimmed down service processor that does not have DRAM.
There is no DDR for our service processor.
We've obviously got DDR4 for the AMD Milan part.
We'll have DDR5 when we're on an SP5 part.
But you don't have to have that on the service processor. When we were looking at an operating system for that microcontroller, we were looking around with the idea of like, let's find kind of the best microcontroller operating system.
And we weren't really finding one that was doing what we really wanted to go do, which was going to allow us total attestation, security, transparency, robustness, a bunch of things that we were looking for.
And we wanted the thing to be small.
We also wanted the thing to be small we also wanted the thing to be at rost um so that's what we we did hubris because we felt it was aptly
named uh for the hubris of doing our own operating system there um but that that's actually and then
actually the debugger for hubris of course is called humility yeah they're right so that uh
and humility has actually been um humility has been interesting because that has ended up being load-bearing in a lot of ways for us.
And hubris can go into such a small footprint that we use it in lots and lots of places.
So actually, hubris can run, we run it on an STM32H7 for the service processor, which is an Cortex-M7 class part.
It will also run on an M0 Plus class part.
So we also use Hubris on the manufacturing line to program the actual boards,
to drop the firmware images onto the boards.
And that is running on a system that has got 8K of SRAM and 64K of ROM, which is smaller than the first computer that I had.
I mean, that is like, I don't care how long your gray beard is, that is a legit small footprint.
And the fact that we've got a memory protected multitasking operating system sitting in 8Kk of sram and 64k of rom is pretty
neat um so but so that's all on that that kind of microcontroller side the the software that i'm
mentioning that does that lowest level platform enablement that is the operating system that runs
on the host cpu that's the host operating system so the same same, that's what we call Helios, which is our LUMOS derivative that runs,
that acts as the hypervisor.
It also does that earliest, lowest level platform enablement.
So you've got one operating system that you boot
coming out of the PSP.
And when you've got that kind of one holistic system,
there are a bunch of things you don't need anymore. You don't need ACPI, you don't need UEFI.
So those just don't exist in our system. Our system is not designed to run Windows as on
the bare metal or ESX. You run Windows as a guest and you don't run ESX at all. So you don't run
our... That host operating system functions as the hypervisor for the control
point.
So yeah, it's a KVM based hypervisor.
Is that?
It's not, it's Beehive based.
It is kind of similar to KVM in spirit.
So we, you know, both KVM and Beehive are effectively bringing the microprocessor its
cup of coffee with respect to virtualization.
We, it's really amazing, right? All of the work the microprocessor does.
And part of the reason we liked Beehive, KVM's origins were really started with a pure
user-level machine model, Kimu. And KVM was effectively the accelerator for Kimu.
And there's just a lot of hair on that.
Kimu itself, I don't know if you've been into the Kimu source code,
but it will burn your eyes.
We really wanted to do, we like Beehive because Beehive kind of post-dates the hardware support virtualization.
But we also wanted to do a new machine model.
So we did our own new user level machine model called Propolis, and that's a DeNovo Rust implementation.
So kind of in the, you know, Firecracker from AWS is, we took a long look at Firecracker,
but really not designed to do what we needed to do, which is to run multi-VCPU guests with multiple gigabytes of DRAM and high-performance IO and so on.
So we did our own machine model there.
But yeah, certainly leveraging the microprocessor support for hardware virtualization.
Right, right, right.
Let's talk a little bit about storage, Brian.
Yeah, sure.
So it's a ZFS-based system?
Yeah, so what we have done is we really wanted to – well, so first of all, there are a couple things.
One, we wanted to have kind of the right hardware as a building block.
So each compute sled has – and the system is managed holistically by the user. So you get the rack, you plug it in, and the objective was to your provisioning VMs within
effectively hours of this rack arriving in your data center.
So in order to be able to do that, it's important that the storage be completely self-configuring.
We also wanted to be sure that we had the ability to have
a robust, reliable, distributed storage service built into the rack. So each compute sled has got
10 U.2 NVMe drives. And on top of that pool of storage, we have a robust storage service that we've developed that we call Crucible, also again in Rust, that allows for a volume to be stored in triplicate.
So this is like an EBS-like storage service using three-way mirroring.
And then, of course, as you all have done storage and you know that the challenge with storage is not storing the bits and reading them back.
It is dealing with all of the issues that come up in that path.
And, you know, the ability to re-silver data and the ability for that system to operate in a degraded state and catch up and so on.
So the architecture that we've got for that service is we create single ZFS file systems
on top of each of those U.2 drives.
The reason we use ZFS in that,
we don't just use the raw device,
is that ZFS gives us a lot of capabilities
that we really like from a snapshot perspective.
I like having file systems as it turns out,
but then also ZFS does end to end checksumming,
which is really, really important.
And we device level checksumming is really fraught with peril.
So we, you've got a ZFS file system on,
on each of those 10 drives per sled, 32 sleds in the rack.
And then the storage service lives on top of that, mirroring three ways.
So your data is going to hit three different ZFS file systems effectively.
And that's an architecture that has allowed us to deliver high performance while still delivering total absolute robustness.
And really the robustness obviously is the constraint
and performance is the objective.
We will in time also have ephemeral storage
kind of offerings where an instance
can have an ephemeral storage kind of offerings where an instance can have an ephemeral volume that will
be effectively provisioned onto that same compute sled, but won't have any of the guarantees.
So this is to allow for software that is doing that replication at a higher layer.
And there you don't have the constraints of having to go to three different machines.
You're not going over the network.
So you get some performance back, but then obviously you are now responsible for the robustness of that.
So just so I'm clear.
So a VM that runs on one of these sleds, its volume storage is not necessarily allocated to that. Sleds, SSDs?
Yep, that's right.
Its storage is, and this is frankly just like,
and much more like when you go to provision in the public cloud.
You go to provision in the public cloud.
Yeah, who knows where it's going.
That volume exists in the cloud, right?
Yeah, yeah, yeah.
And so between, so you've got this this networking
serious backplane networking you can switch the top of rack switch and stuff like that between
racks are just a normal ethernet uh cabling kind of thing is yeah so we've got the we've got qsfp
ports coming out the front that will go to your to to you know your networking core. So yeah, it is Ethernet coming out the front.
We have definitely learned a lot.
I mean, I guess this makes sense
and I shouldn't have been surprised,
but you would think that like a QSFP module
is just kind of like an industry standard component.
And like, I don't know, it's like an Ethernet jack.
It's like, no, as it turns out.
And it's like, no, no, sorry. If you're
going to have an Arista switch, you need to have an Arista certified QSFP module. And so, you know,
it feels a little bit like a racket. Can't really tell how much of this is, you know,
quality assurance versus like revenue assurance. I'm not sure which of those,
but so you needed the right QSFP modules, right? So we've done a lot of actually our,
it's actually been vindicating of a bunch of our other decisions
because one of the problems that we had
when we were developing software,
elastic software to live on commodity hardware,
the validation problem becomes unsolvable at some level, right?
Because you've got so,
you afforded folks so much flexibility and freedom
that it becomes very hard to validate configurations,
especially when things operate across purposes.
And this is part of like,
when vendors are operating at,
pointing fingers at one another,
it's like, they're not necessarily wrong at some level, right?
It's like, well, it actually is kind of, and often like they're both right. And like,
you know, you are each pointing fingers at the other and you're both right. It's both of your faults. But part of the challenge is that they don't necessarily, it just becomes impossible
to kind of do that integration. So the, and part of the reason I talk about the QSFP
modules, because with the QSFP modules, we have had to do that much more traditional, like I,
you know, we need Arista certified, Cisco certified, you know, we need the Intel QSFP modules,
all of these QSFP modules, and then we need to verify that they all interoperate. And it's a
real challenge. I'm like, boy, thank God we are only doing this here
and not at literally every layer of the stack
because it is so tough to go validate it.
And because part of validating it,
it's not, I mean, validation is easy when it works.
It's when it doesn't work,
you're gonna be like, okay, let's go figure this out.
And QSFP modules, like every element
in the computing stack, are wildly complicated.
And there's a whole lot of sophistication going on there.
But that's kind of how you connect to your broader network.
We do have the capacity for the racks to directly connect to one another. One of the kind of neat things is that because we've got these 32 ports on the front
and you're very unlikely to need all,
you're not going to need all those 32 ports
to connect into a broader network.
You can use some of those ports
to connect to another oxide rack.
So we've got a model that will allow someone to grow to,
I mean, I think in principle, you could grow to a larger number of racks.
But we're thinking on the order of four to eight racks without really requiring new dedicated core networking, which is kind of neat.
But that network is the interface.
The QSFP modules are the interface.
Yeah, yeah, yeah.
So, I mean, if you're competing against the cloud
and all these kinds of things,
I mean, they've got so much marketplace services
that make them very interesting
to most development organizations and stuff like that.
How do you feel that's going to develop with an oxide-based cloud, let's call it?
Yeah, so we actually don't really view ourselves as competing against the cloud for whatever it's
worth. We are really competing against the extant on-prem infrastructure. We are competing against Dell, HPE, VMware, Cisco, Arista.
We are competing against ODM Direct.
We are competing against the Supermicro.
That's who we are competing against.
It's even worse.
Well, maybe not, but I mean, at least the cloud
is only like four or five of these guys.
I mean, you're talking about every compute vendor
in the world. Yeah. Yeah. I mean, we are, but like part of the opportunity for Oxide is that,
that each of them only is only delivering a single slice and then they're trying to monetize it.
You know, the, it's funny because one of the things that like, we don't even think about
just because of like, of course the software is all built in. And of course, like there's no separate licensing
for the software. So, you know, one of the, one of the early reactions from customers is like,
oh my, thank God there's no license manager. It's like, no, of course there's no license manager.
But it's like, because we've had, you know, the X-Ray customers walk us through the licensing
and it's like, it ain't simple. And you've got a bunch of different,
you know, at every kind of new kind of software enablement,
you need something else.
And you need to, you know, it's a PO or a license.
It's like, it's, so yeah, on the one hand,
we're competing against a bunch of different folks.
On the other hand, their numbers are not necessarily
a strength from an end user's perspective because
it's making it really hard for them to actually deliver a service and then when you think about
it too you're not just competing against the hardware vendors but also the software vendors
like a vmware yes yeah uh and i have doubled your competition in one fell swoop. Yeah. You have.
I mean, yeah, again, you have.
But you but the you know, people are pretty frustrated with the state of affairs.
And, you know, I've not encountered anybody who's like, God, you know what I really love is that I get to go to seven different companies and I get to integrate it.
Like that's what I love.
Said no one ever, right?
Yeah, exactly.
Said no one ever.
And I think what folks want,
a robust offering internally for their cloud
and they're not seeing a way to do that right now.
And there've been various efforts and they've all seeing a way to do that right now. And they are, and, you know, there've been various efforts
and they've all been a real challenge.
And so I think, you know, one of the things that,
just to get back to, sorry to your initial question,
but like AWS or public cloud services,
we do want to allow for a platform
for those kinds of services,
but that starts with an EC2-like service, an EBS-like service, an ELB-like service, a VPC-like service, which is what you find in the Oxide rack.
And we believe that those services around elastic compute, elastic storage, elastic networking, these are the services that we should expect in a modern computer. What we are doing at the most
fundamental level is changing the definition of the server-side computer. What does a computer
look like? What should we expect to be included? And, you know, AMD and other microprocessor
vendors have done a terrific job of changing our expectations about
what we get out of the socket. I mean, there's so much now that's been pulled on package and that
so much that we don't have to deal with anymore because it's been pulled on package. We get that
when we buy that AMD Milan or that AMD Genoa. We believe that the computer, the rack scale computer
deserves that same treatment and that you should expect that when you buy a computer, it knows how to provide these elastic services, that you're not having to provide that elsewhere because it's part of your expectation about what the machine includes.
I agree with all that, but, you know, I mean, the challenge with on-prem computing these days is they're competing against this cloud, which has all this capability, all these services, and all these functionality that people can just tap with a credit card.
And so you want to do the on-prem service, which means it's going to be cheaper and it's going to be more reliable and more latency sensitive, et cetera, et cetera.
But without those services there, I mean, arguably VMware is successful because it's got all this other software surrounding it.
Well, and so, yeah, that's why, I mean, part of the way we really judge the product is what is the latency from hardware arriving to provisioning VMs?
And because provisioning a VM requires provisioning network, provisioning storage,
it requires all that. It requires that metaphorical credit card swipe. And the latency today is,
in the existing on-prem world, is pretty bad because of all that integration weeks months
right right and i and you know and i mean it starts from the very very very beginning but
just to get really physical about it because it doesn't it comes in many you got to de-box this
thing right you've got a you're a rack and stack so the first thing you're going to be doing is
like someone is doing a lot of de-boxing these one use and two use whatever you're going to use cabling oh shit
and the you know one of the things that was kind of an important you know there have been some
really important moments for oxide that were kind of key validation moments where we all kind of
held our breath um and we obviously designed the system to be able to pass these milestones.
But there are a couple of them.
One is, well, one actually, just as a brief aside,
one of the gutsier things we've done
is we have eliminated the front side cabling.
So in a hyperscale system,
you are going to blind mate power in the back and you're going to have cabling, network cabling out the front of theling. So in a hyperscale system, you are going to blindmate power in the back,
and you're going to have cabling, network cabling out the front of the cold vial.
And as we were designing the system, one, and kind of talking to folks, surveying the state
of the art, some of the connectivity folks were like, out of curiosity, why are you not
blindmating the networking as well? And the hyperscalers don't do it that way.
So I was like, I didn't think that was possible.
I mean, is that possible?
Can you get it reliable?
It's like, oh, that's definitely possible.
And the fact the hyperscalers will tell you,
this is the connectivity better telling us this,
the hyperscalers will tell you
that's a better way to architect it,
but they can't take a clean sheet of paper.
So we're like, okay, that's exciting.
So we, that's what we did. That's what
we did. We blindmated networking and these compute sleds slide in without any cabling.
So they, when they slide in and lock in, they are, are blindmating on to at once DC power.
They are blindmating on the high speed networking. They're blindmating onto the management network, and that all happens when it slots in. It attests to the other compute sleds and is immediately available
to provision VMs. So the validation of that was a big hold your breath moment.
Another big hold your breath moment that is mundane but very important for this point of the latency of rack arriving to provisioning for SPM is, again, super mundane, is can we safely ship with the sleds in the rack?
So can we design, because that requires you to engineer a crate, actually, to make sure that it can absorb shock and so on, and that you are not endangering those connectors and
that you can actually ship with this thing with the connectors and the sleds in place.
And the reason that was so important is like, you got to get rid of these other boxes.
You can't have the rack come in one box and the sleds come in 32 other boxes, right?
So, uh, Brian, just so you know, I worked at Storage Technology before the Sun acquisition, and we worked in the Iceberg project.
And we had a crated storage system, effectively.
And a couple of these guys fell off the back of the trucks and stuff like that.
And they came back.
They were sort of bent, but they continued to work, of course.
But yeah, yeah, yeah, yeah.
You've got a major challenge to try to.
And in your case, it's a full rack, right?
It's a serious size.
It is a full rack.
Yeah, definitely.
Actually, I didn't realize you worked at Iceberg.
Total shout out to Iceberg.
That's awesome.
I was definitely a system.
I grew up in Colorado.
Oh, great, yeah.
Followed storage tech from a young age.
I have now passed on.
My kids are now Broncos fans.
This is if I passed on.
Good for you.
So, yeah, well, they've not even been to Denver,
and they're still as disappointed on Sunday as I am.
So, no, exactly.
When you're engineering, that crate is its own feat of engineering. And,
but fortunately blessed with an absolutely terrific operations team and,
you know, have folks who understood that element of the problem.
It was funny because we had a,
we were doing these iterations of the crate design and I'm like the first
crate. I'm like, wow, this is like amazing. And Kirsten,
who was leading up this particular effort on the ops team,
was like, no, no, no.
This one, we've got a lot of work we can go do on this one.
And then the second iteration came like, wow, this one is, okay,
this one is really amazing.
She's like, no, no, we've got some improvements we're going to make.
And we, you know,
I just was not really appreciating how important packaging is and how much
you can go do in packaging to make it robust.
But that's all really important for that,
that in order to be able to deliver that value of rack comes in,
rack is plugged in,
rack is powered on and VMs are provisioned.
That's an important element of it.
You also need.
So Brian,
how long does it take from let's say a rack arriving on dock to provisioning VMs on something like that?
Literal hours.
In fact, we can get that down actually to – I think we can get that down to actually under an hour, but it is literally hours.
God.
That's amazing. And we have spent, that is, it's like a very concrete embodiment of all of the design decisions that we have made.
And so, because I mean, it's like, it's hard to bootstrap a network, right?
You've got to have, we have technician ports out the front of the switch.
And so you connect into the technician port. We've got an entire, like, it's funny, it's like we've got the install software for the rack. The folks that
developed that all don't really, weren't really coming from the on-prem world. And in particular,
one of the engineers, she was at prior to to coming to oxide it's just
like i don't know like i'm just developing some like awesome software to install this thing like
i'm not really thinking like i don't know what is the state of the art i assume that like
i assume that everyone else has something similar like no no we don't and so the install experience
is just eye-poppingly good um where, I mean, you've got this constraint, right? Like you're not going to a browser-based config
when you are direct connected into this.
So the way this works is you actually SSH in
to the technician port,
and you kind of do a very monocom configuration.
And then you're in a captive terminal install experience that is a gorgeous experience.
And it's super tight and clean and debuggable if it goes wrong, right? So it's been really,
really fun to deliver these kinds of pieces that- Now I'm flashing back to IBM 360, right?
No, and actually, one of the things that was very funny, you know, it's been it's always amusing when you kind of watch like the Hacker News crowd.
You know, whenever you have a lot of attention on something you've done, that is both gratifying and enraging.
And it is very one of the comments that we thought was particularly funny is someone called it a mainframe for Zoomers.
Which, you know, it's like not all wrong, right?
I mean, those mainframe systems, there are certain aspects that are valuable, the reason that they have persisted.
Now, we are not, it is not a mainframe at all.
And I've got lots of, there are lots of problems with those systems too. I'll tell you,
like a big problem that I have with not just mainframes, but also with other tightly integrated
hardware software systems. So like phones, I mean, right, your iPhone, right, is that those systems
are well integrated, but also completely opaque. So one of the things that's been very important
to us is everything we've developed is open source. And we have been very transparent about this entire system top to bottom.
So you don't have to wonder, and this is one of the frustrations that I had when a customer of
Dell HP Supermicro is the complete opacity into what software was actually being delivered to me taking a bias upgrade
what's in this it's like well you you know you need to take this bias upgrade you're running
strings on the binary and there are a bunch of urls in your bias it's like why are there urls
in this software out of curiosity trust us you trust us yeah what's like okay so it's like
clearly like this thing is hitting some service at you know it's like, okay, so it's like clearly like this thing is hitting some service at, you know, it's like, who secured that service out of curiosity? Like that now, like my trusted compute base has now expanded to like your website. I mean, that seems crazy. And it's been really important to us to deliver this really high quality integrated experience, but in a way that's completely transparent. So folks can see
exactly what we've done and why. And we're not trying to tell you that we are someplace where
we aren't, or we've done something that we haven't done. And I think we're-
Okay. I only have like 15 more minutes or so. We're actually over time.
I think it's important to talk about the way you've come up with Root of Trust.
Yeah.
Because I think that's pretty unusual in this environment.
I mean, everybody else has kind of tried to do it well, but not necessarily have done well with that.
Yeah.
So we've got a true Root of Trust.
So we've got – and we run Hubris on the Root of Trust as well. We're using the NXP LPC 55 as our Root of trust. So we've got, and we run Hubris on the root of trust as well. We're using the
NXP LPC 55 as our root of trust. And one of the challenges that you have is how do you attest to,
how do you have, you're going to have some module that needs to be signed verified and so on and then it needs to attest to some
secure hardware element that can't be kind of physically tampered with and then it needs to
attest to the kind of the next thing in the chain that needs to attest to the next thing in the
chain right and how do you do that and part of the challenge is you know this kind of the idea
of the tpm it's like everyone's the tPM is kind of like the security guard that everyone's kind of like walked past.
It's like, well, yes, if you ask the security guard, they can ask you a question, but you just ignore them.
It's like, you know, just ignore them.
And, you know, there have been other routes of trust, Microsoft Cerberus, for example, Titan as well from Google,
that attempt to solve this for the server-side computer by interposing
on spy. So there's going to be a spy payload that is going to be retrieved as part to boot the host
CPU. Let's interpose on spy. And the spy interposition is a real challenge. Spy is,
it's too fast to really interpose on in software um and yet not being
like really fast by any modern standards but it's um so you and and then with spy there's also no
way to it's sorry to get kind of uh deep in the weeds here but like unlike with i2c or or now e
spy there's no way to uh you have to provide data on the clock. There's no way to say like, wait a minute,
let me clock stretch. Let me get this data for you. Let me buy myself some more time to attest
this payload. You can't do that. So we were not hugely in love with spy interposition.
It was going to, there were going to be a lot of problems. So what we did instead,
which I think is pretty neat. So there is, when you are developing on a microcontroller,
one of the things that ARM has done really well
over the years is develop this very rich facility
for debugging a target microcontroller.
This thing is called Serial Wire Debug, SWID.
And the-
Okay, yeah, yeah, yeah.
You might know where this is going. So one of the things- Yeah, yeah, yeah. You might know where this is going.
So yeah, yeah.
Okay.
So we ran the squid line from the service processor to the LPC, the five to the root
of trust.
And that allows the root of trust to act as a debugger of the service processor, which
means you can control every heaven and earth about what that SP
does. You can control what it executes. So in particular, what this means is when the root of
trust comes out of reset, it can hold the SP in reset. It can then pull the SP out of reset,
but direct it to instruction of its choosing, namely an instruction that it injects, code that it injects to verify its own payload.
And then it can force the,
and there's no way for the SP to get out from underneath that.
The service processor that can then attest its payload
back to the Ruby trust.
Yep, looks good.
Now I can let you run completely.
And the service processor now is running trusted software
and it can now attest to the bundle
that we're actually going to run on a host CPU.
Yeah, yeah.
The other thing that I think of as extreme importance
is their use of Rust as a source,
as a code compiler, I guess.
I don't know.
Yeah, all of our software's been in Rust,
and that was actually, I mean, in many ways,
it was like the very first design decision we made. The name of the company is very much a tip of the hat to Rust.
Oxide. Okay, I didn't get that till now. Okay. everything now that like you got the big review at the end um the uh rust um you know i i had been
a c programmer my entire career um i had ventured up stack into no js and uh it was um things had
not gone really well um and so i um you know i was kind of contemplating what was next i was
really looking around for what is,
what am I going to spend the back half of my career implementing in?
And I was beginning,
I was really disappointed with everything that was out there.
I was disappointed with Go, disappointed with Java.
I didn't want to have garbage collection for a bunch of reasons.
And I started experimenting with Rust in 2018.
And it's like, wow, this is really pretty amazing.
This is the power of C and of being able
to completely control my system. But getting that memory safety and that integer safety and that
robustness, I just would not have guessed a decade ago that a new programming language would have
such an outsized impact on my own career. I kind
of resigned myself to Bing and C for the rest of my career. But it's been extraordinary. So yeah,
and we've used Rust. I mean, it has been remarkable. We didn't want to use Rust by
fiat. We wanted to make sure it was the right tool for the job, obviously. But it's been the right
tool for essentially every job we've done, which is really remarkable.
We've used it at the very, you know, 8K of SRAM, 64K of ROM, the service processor.
We've used it in the host operating system.
We've got Rust components that are in kernel.
We've been using it for everything at user level.
And then all the way up to distributed systems and going up to the, you know, the web
console that allows you to provision and more TypeScript at that kind of very edge to run in
the browser. But basically everything in between is in Rust. And that's been really, really
important for us. Yeah. Yeah. It'll pay dividends long-term. All right. Well, this has been great.
We can go on like this for another couple of hours, but I think we have to stop at some point. Jason, any last questions for Brian?
No, it's good to see the progress. I've been tracking you guys for quite some time and
congratulations on kind of the first shipping and the release of the product. It's
really exciting what you guys are doing. Very ambitious. The basically the software and hardware, this is definitely no light undertaking. And, you know,
the companies that have done this level of integration have been exceptionally successful.
We mentioned IBM before, you know, when you're talking Z series mainframes, that's, you know,
the perfect, you know, like they just marriage of hardware and software
and it being this symbiotic relationship
that creates a really a system.
And what I like is that you're using it to solve
like actual customer problems, right?
And, you know, Apple, yet another son, you know,
there's a lot of examples in the companies that have done it have been highly successful.
So I wish you guys all the best.
Yeah, thank you so much.
It's exciting.
Brian, is there anything you'd like to say to our listening audience before we close?
Yeah.
Well, first of all, just Jason, I can't resist on your point just because we did have a, when we were doing that initial raise in 2019 um a vc firm did ask us like what
is the kind of the closest historical analog to to oxide and you know a wiser person would have
kind of reflected on that for half a bit but before my uh my mouth was just immediately in gear
and just blurted out the as400 and they're like what and i'm like this is one of those moments
where the brain
is like, did you just say AS400? It's like, oh yeah, let's not say iPhone, pal. Let's say AS400.
It's like, okay, let's give them like an IBM history lesson. This will be really interesting.
But just to your point about the Z series too, right? That kind of, that fully integrated hardware and software. And we're very much students of history
and have kind of lionized those systems.
And how can we deliver the advantages of those systems,
but in modernity and with respect to cloud computing?
And I mean, just do what the AS400, frankly, did for databases.
Do that for cloud computing.
Where the AS400 really democratized the use of industrial compute by being in a form factor that can get in a lot more places.
I don't know that we're going to run the oxide rack in car dealerships, but we definitely see a lot of that same kind of opportunity.
In terms of other sources of information, I know you say we could go on for hours.
We literally could.
There's so much that we've done here.
And I would really,
we love your podcast.
We've actually got our own,
Oxide and Friends,
which has been a lot of fun.
And I would really encourage folks
to check out the back catalog there.
I mean, you, dear listener,
are obviously a podcast listener,
but check out the back catalog
of Oxide and Friends,
where one of the things that has been exciting for me, and I'm a kind of an oversharer to begin with,
so this is very natural for me, but it's been really fun to share some of these things that
companies don't typically share. So when we did board bring up, companies don't talk about what happens in bring up. It's like,
you know, you're in like the labor and delivery room, right? Where
gory things happen in bring up that nobody wants to talk about.
And nobody wants to look at again.
Nobody wants to look at it. It's super fascinating. And the reason companies don't talk about it is they don't want to actually imply
the true terror of bring up because if, if, if you can't bring it up successfully,
you don't have a product. And, um, that to me, that terror is actually pretty exciting and
interesting. Um, so we did an episode on, and we did a couple episodes on our tales from the
bring up lab, where we talk about everything that went wrong and bring up and all the challenges. Um, when we did compliance, um,
we, uh, we, we have the oxide in the chamber of mysteries where we talk about going into the
chamber. And I mean, if people rarely talk about bring up, bring up, you can at least get engineers
to talk about over beers. Compliance is like, no, no, no. What happens in compliance stays in compliance. The truly horrific things happen in compliance. And we wanted to talk about
that. So we, at the best of my knowledge, I would love to be proven wrong because I would love to
listen to other companies' experiences about this. We are the only company to ever talk about,
on the record, about what we did for compliance, for bring up. And I think from the perspective of,
you know, folks that are, you know, listeners of this podcast, you've deployed all this
information infrastructure. You're going to find this stuff super fascinating. You're going to
understand, I think a lot better, some of the problems that your vendors have not been able
to resolve. And we're excited to take people along for the journey.
So definitely check out Oxide and Friends.
You'll find it wherever you can find podcasts.
And we also do it, we record it as a live Discord
so people can kind of join in as well.
I'm impressed.
Yeah, and encourage people to join us.
All right, all right.
Well, this has been great, Brian.
Thanks very much for being on our show today.
Thank you so much for having me. It's a lot of fun and really love the content you're putting out there and the demographic you're after.
You are my people and vice versa. So exciting stuff.
That's it for now. Bye, Brian. And bye, Jason.
Bye, Ray.
And until next time.
Until next time.
Next time, we will talk to another system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.