The Changelog: Software Development, Open Source - Voices of Oxide (Interview)

Starting point is 00:00:00 Well, friends, it is your favorite podcast, the change log. Yes, Jared and I have a special episode for you. Jared and I and team went to Emoryville, California at the invitation of Oxide. We went to Oxide's HQ. They have an annual conference every year. It's an internal conference called OxCon. And they invited us out to celebrate and to peel back the layers to have a good look at the inside of Oxide. So they recently raised a series B round of $100 million, and they also just signed a purchase order, a massive purchase order that is truly helping them cross the chasm.

Starting point is 00:00:41 Now, we also have three awesome conversations for you. First up is Cliff Biffle. He's in charge of all things hubris. Hubris is Oxide's operating system, and Cliff is also in charge of pretty much everything that happens before the CPU of powers on. Next up is Dave Pacheco. He's in charge of all things update. state is the system for which they update the system. It's really important. And last up is Ben Leonard. Ben is in charge of all things design and brain for oxide. And if you like how

Starting point is 00:01:08 oxide looks, I do. Well, that's Ben. A massive thank you to our friends and our partners over at Fly. Check them out at fly.io. All right, let's do this. What's up, friends? I'm here with Kyle Galbraith, co-founder and CEO of Depot. Depot is the only build platform looking to make your builds as fast as possible. But Kyle, this is an issue because GitHub Actions is the number one CI provider out there. But not everyone's a fan. Explain that.

Starting point is 00:01:45 I think when you're thinking about GitHub Actions, It's really quite jarring how you can have such a wildly popular CI provider, and yet it's lacking some of the basic functionality or tools that we need to actually be able to debug your builds or deployments. And so back in June, we essentially took a stab at that problem in particular with depot's get-up action runners. What we've observed over time is effectively get-up actions when it comes to actually debugging a build is pretty much useless. The job logs in GitHub Actions UI is pretty much where your dreams go to die. Like, they're collapsed by default. They have no resource metrics. When jobs fail, you're essentially left playing detective,

Starting point is 00:02:28 like clicking each little drop down on each step in your job to figure out like, okay, where did this actually go wrong? And so what we set out to do with our own GitHub Actions of observability is essentially you built a real observability solution around GitHub Actions. Okay, so how does it work? All of the logs by default for a job that runs on a digital. Depot GitHub Action Runner, they're uncollapsed. You can search them.

Starting point is 00:02:50 You can detect if there's been out of memory errors. You can see all of the resource contention that was happening on the runner. So you can see your CPU metrics, your memory metrics, not just at the top level runner level, but all the way down to the individual processes running on the machine. And so for us, this is our take on the first step forward

Starting point is 00:03:10 of actually building a real observability solution around GitHub actions so that developers have real debugging tools to figure out what's going on in their builds okay friends you can learn more at depot.dev get a free trial test it out instantly make your builds faster so cool again depot dev. The cliff. I don't know why we started having this stuff. What is it you do here, man?

Starting point is 00:04:04 I'm responsible for parts of the really low-level firmware on the computer, so everything from the machine turning on up through fans and power management and all of that. Basically all the stuff that happens before the thing our customers think of as the computer turns on. Right. That's all, well, me and my colleagues now, but originally it was me. Is that like BIOS stuff? Yeah, well, before then, actually. Before BIOS.

Starting point is 00:04:26 So, like, from... What's before BIOS? What indeed? Okay. So these big AMD and Intel processors, and even the big arm nowadays that you see in something like an iPad, there's actually a lot of work that has to go on to allow them to turn on. They need a bunch of different, like, voltage supplies stable, they need a bunch of clock signals set up and devices set up, and you've got to get their flash ready for them before

Starting point is 00:04:46 they wake up so they can get their code out of it. And so there's almost always one or more small processors in the machine that are responsible for doing all of that dirty work that nobody ever thinks about, because it happens in the most between like when you hit the power button and when the screen wakes up. Yeah. So that's our job. Pretty fast. Like 20 seconds? 15 seconds?

Starting point is 00:05:05 Ours is milliseconds. Okay, millisecond. Yeah. To me as a user, it's like 15 seconds. Sure. They mentioned PCB, so you're building your own boards. Yeah. So you're going first principles on a lot of this stuff.

Starting point is 00:05:16 Yeah, absolutely. And you're a first principles kind of guy? Yeah. You like that? I think I actually designed the first board we made. Back before we hired actually qualified electrical engineers. Okay. But, uh...

Starting point is 00:05:27 It was an MVP or was it a... It was just for testing out some of our, like, circuits we proposed for the bigger expensive board because it's nice to make cheap things that if they're wrong, you can throw them away. Right, right. You know. And so Oxite has this writing culture.

Starting point is 00:05:38 I assume that you were kind of attracted to that or kind of helped formulate it. No, it was pretty well established when I got here. Okay. But like the fact that we make people write a bunch of stuff during the application process, the materials, and like that whole packet of like, why do you want to work here, what's an interesting problem?

Starting point is 00:05:56 Because it really saves time in the interviews. But like for a lot of people that's daunting for me, it was like, oh, you want me to write a bunch of stuff. I like writing. I can do writing. Nice. Yeah. Was that a process you liked, this request for discussion, right? RFDs. Had you done that before? Is this something that was... Previous companies I'd worked for had a process around design docs that kind of was similar, but not exactly. We are way more writing focused here. And in particular, most of the docks are living. So like, as we learn what we're doing, we go back and fix the dock.

Starting point is 00:06:28 that we wrote when we were done to try to better reflect. So that way they can also serve as documentation for the next person that comes through. So it's not perfect, but I honestly couldn't really point to what I would change. Like, it's working pretty well, as long as you've got people that are invested in the process and that are comfortable expressing themselves in writing.

Starting point is 00:06:46 Would you call it document-driven? Like, do you think you start with docs or an idea, not so much of spec, but like, fleshed out first before you do... I actually feel like one of the ways people sometimes have a hard time starting here is if they treat it too much like that. So, like, it's not like you need to write a thousand paragraphs in English before you can write a line of code.

Starting point is 00:07:06 A lot of things here start as a prototype. But then, like, if you want to build consensus, if you want to get other people involved, if you want to try to get feedback, that's when you need to write everything down and share it. Okay. So when you're writing on your blog, this is about Rust, right? Lately, yeah. But back then when they found it.

Starting point is 00:07:22 Since, like, 2015. It's been mostly Rust. So how'd you find Rust and why'd you like it? So I was working in firmware at Google doing high-altitude balloon tracking and communication stuff that we were using C, and I've been using C since I was a kid, because my dad was a photo in it. But it's really hard to produce correct software that doesn't contain bugs, particularly on a team with different experience levels, working in C. And I was bringing people in at intern level, up through experienced developers, and trying to get everybody working together and productive as a team. and just like the problems with a lot of work and process,

Starting point is 00:07:58 you can manage it, like car companies do this all the time, but it takes them a tremendous amount of overhead. And I just wanted some way out of that. So there were a couple of different alternative languages I was watching at the time, and Rust was the one that matured at about the right time and got enough things right to be worth spending time on. This is 2019, you said?

Starting point is 00:08:18 This would have been 2015 originally. Okay. Way back. So you said recently, but that was a decade ago. Yeah. I'm old. Just want to point that out. All right.

Starting point is 00:08:28 So you're writing rust blogs. You're writing rust. Here comes oxide. You start working on oxide. You're writing more rust. I think oxide was an interesting opportunity because, like, I don't super care about our product. Ooh. Tell me more.

Starting point is 00:08:44 I care that it's good. Like, I want to make a good product. Sure. I'm never going to buy one of these. You're probably not ever going to buy one. I can aspire to buy one. Yeah, you could aspire to buy one. on a Home Lab version, we can set that aside for now.

Starting point is 00:08:56 Yeah, yeah, yeah, yeah. But it's like a fancy McLaren sports car or like a Ferrari, like I'm probably never gonna buy one. Right. So this product isn't for me. So I had to find other ways to really get motivated around it. And the main things are like, this is a team that really wants to try to do things

Starting point is 00:09:14 right from the ground up, which I can get behind. Like that sounds like a hell of a challenge. The team is amazing, like my coworkers are amazing. You should talk to more of them. We will. Good. And so like you have to look for all the other ways to do this. And sort of the how do you build an engineering org from a team that fits around a small restaurant table to this size and be able to bring in people that don't have relevant experience and be able to bring in people that maybe had a career change but are enthusiastic.

Starting point is 00:09:45 And like building a framework in the software but also in the processes and the documentation to support. expanding the team like that, that's the thing I got really passionate about. In the firmware world, is it written in rust, or is it written in C? Our stuff's all rust. All rust. So there's no...

Starting point is 00:10:04 There's like one thing to C, right? There's like some operating system you have. It's not Uris. It's still a C. Alumos. Helios is our sort of version of the Alumos operating system, which is descended from SunoS. Solaris. That's mostly C.

Starting point is 00:10:18 It's also older, pretty well tested sea. It's not new, potentially buggy sea, so we think it's a lower risk. Does it change much? It doesn't change that much, although we've obviously had to extend it a bunch, but we've been doing a lot of the extending and changing in rust. But other than that, like all the stuff on top of that, all the stuff below that is all rust. That's a good thing. Yeah. We think it's good. I was talking to somebody that said, if they had to not write rust here, like, let's say go, for example. All right. They were like, nah. Nah, I can't. I can't do that.

Starting point is 00:10:51 How do you feel about that? I don't love Go specifically, but there's other languages. Nothing goes bad, but like compared to Rust for some of the things you solve. Not for what I'm doing. It goes not in the picture there. No, it's not. I mean, there really aren't a lot of options on the systems I work on, which are like the 50 cent microcontroller that's, you know, inside of your credit card. Like, there's just not a lot of resources, so...

Starting point is 00:11:10 I never thought I loved that. Yeah. Computer my pocket. What makes Rust uniquely... There's probably a bunch of computers in your pocket right now, actually. At least four. What makes Rust uniquely positioned for? firmware.

Starting point is 00:11:21 So Rust... Or your lower level things. The thing that Rust got took from C and the C family is like the C family gives you really fine-grained control over what the computer's resources are being used for at any given time. So you have tight control over how much memory is being used. You have tight control over whether memory is used at all or if you try to solve a problem through some other ways.

Starting point is 00:11:43 Or even size. Or size. Yeah. Size of code, size of flash required. Strings numbers. Wrists or smaller. Yes. replicates that control pretty well.

Starting point is 00:11:53 Languages like Go are less focused on that and don't come out of the box with as much help in that area. Yeah. Do you mess with Go at all? I have a little bit, yeah. How do you feel when you do that? It's okay. I mostly just kind of feel like a foreigner.

Starting point is 00:12:08 Like, it's not really my native territory. I could get more comfortable with it if I needed to. What are you doing with Go when you do play with it? I do some, like, periodically I'll do, like, they have like the advent of code exercises or like programming exercises. people put out annually, and I try doing them in other languages just to kind of keep my brain stretchy. Yeah.

Starting point is 00:12:26 So I've done stuff like that, but I've never used it in anger, never for anything real. Tell us about hubris. Yeah, so I tried really hard not to write hubris. When I got here, they were, there's this other operating system called talk. It's also in Rust. Okay. Targets the same sort of very low-level, deeply embedded, T-O-C-K, like you'd expect. Like TikTok.

Starting point is 00:12:47 Yeah. But not like TikTok. Not the platform. The way clocks work. Yeah, the original TikTok. Yeah. They were trying to use TikTok, or... Sorry.

Starting point is 00:12:57 He fried my brain. So they were trying to use Talk when I got here. And, like, I had some previous experience with Talk. I know the people that wrote it. Okay. So I got in line and tried really hard to make it work for our application, but we just kept hitting areas where they... Their design intent and the things we needed didn't really overlap.

Starting point is 00:13:15 Like, Talk is mostly at the time being written with educational use cases in mind, so they wanted kids to be able kids university students who are adults okay i'm old to be able to like dynamically reload programs on it as they're working and like nice use case stuff like that but we really don't want that for security reasons we want any code that runs on this better be what we shipped when we shipped it so we put a bunch of work in trying to work around that and then finally in like may of 2020 i think i wrote an rfd that was like guys i think i think we're going to have to do our own thing and like here's a rough sketch of how it might look like, and there were enough people here that had been

Starting point is 00:13:53 involved in operating systems works before, that they all kind of, we pressed our heads together and they said, okay, this might work, like, take a week and see how much we can prototype, and we got a thing working, and then it seemed compelling enough that now it's, there's, what, 64 to 70 computers running it inside every rack, all the little service processors you don't think about? They're all running that. So how big or small is hubris in terms of like line count or whatever? The core kernel is like 1,000 lines of code. But there's a bunch of other stuff you want to make it useful.

Starting point is 00:14:24 But it runs right now on everything from like sub-50-cent microcontrollers that you wouldn't even spot on a printed circuit board because they're just a tiny fleck of silicon up through the big service processor that we used to run the oxide rack, which is like a... It's basically a computer you would have been really excited to own in 1999 or 2000. But now it's $3 and inside of another chip. And does that job. Yeah.

Starting point is 00:14:48 Are there multiple instances of Eubris on a given full rack? Yeah, so every slide has at least two. There's the service processor that is responsible for basically care and feeding of the big AMD chip. Then there's a route of trust that handles security in crypto. That's a separate copy. And then there's so there's those two on every compute sled. There's two in every switch. There's two in every power supply.

Starting point is 00:15:11 And that's everything in the rack. But then a bunch of our manufacturing tools are also running Eubris. So all the little boards we plug into a thing to program it or, you know, to interpose in an interface for testing and be like, you know, I need to remove this fan and simulate the fan controller. All of our tools for that are all hubris-based. Why so many copies?

Starting point is 00:15:32 Is that hard to manage? It's like. Yes and no. Multiple updates, different versions potentially even. That's true. It has pluses and minuses. So like the SP versus root of trust split, which is the main source of the many, many copies.

Starting point is 00:15:47 There has to be at least one of these each board because the board's got to be able to power itself on, because it might be the first one powering on and responsible for turning on all the others. So one makes sense. Why two? And the honest reason why two is we can buy one chip with the features we need to do the service processor. We can buy one ship with the crypto security features we need to do the root of trust.

Starting point is 00:16:07 We can't get them both in one chip right now. Okay. And we can't afford to make our own chips. So when you can, we might have one. We might not. There's advantages to having a thing. Are you alluded to making chips in the future? Are you alluding to making chips in the future?

Starting point is 00:16:20 I mean, we'll probably have to. Yeah. Would it be a collab with like AMD or an existing? AMD returns our phone calls now. Yeah. It's very exciting. Is that new? Yeah, it's pretty new.

Starting point is 00:16:31 Like, hey, Oxi, we'll take that phone call. That's cool. No, it's nice. Probably not. So we have some FPGAs on the newest generation server board, which are basically... What's the FPGA mean? Yeah, so it's a field programmable gate array

Starting point is 00:16:44 is the full nerd expansion. But the purpose of the chip is, it's basically a bunch of, it's a Lego set for integrated circuits. You've got a bunch of generic logic circuits that you can then program to act like another chip, and it's slower and more expensive than making, than the other chip would be. But if you can't afford the million plus dollars to get started making your own chip, this is like a way to fake it, essentially. We have one of those on the next generation server kind of playing around with some things we would do

Starting point is 00:17:14 if we made our own chip. Mm-hmm. And Hubris is open source. Yeah. Does anybody else use Heuris? I've heard from about five other companies that are using it in production. Really? Really?

Starting point is 00:17:24 Yep. And I can't remember which one of them, let's be say that publicly. Also, I can't remember them anyway right now, but I can check my notes. Sure. Volvo is really interested, but we don't have the certifications that they would need as a car company. But I've been talking to somebody about what that would take. They can contribute, though, right? It's open source.

Starting point is 00:17:44 They could. The thing they would need to contribute, unfortunately, is a bunch of money for consultants to, like, go through the certification process, which they're not excited about. So, that's so fine. That's fine. Well, they probably have, maybe they have more money than you all now? They just got some money. But maybe, I don't know how well. I think so. I bet they do. I have no idea.

Starting point is 00:18:04 It's Volvo. Yeah, no. Car companies, they burn money, don't they burn money? Yeah. There's, like, three other startups I've heard from that are using it in products. That's pretty cool. I'd like to get more people using it, but, like, there's some work we need to do to make it more friendly to people that aren't oxide. Right.

Starting point is 00:18:19 Because right now, if there's a tradeoff we have to make, and, like, one thing would make us ship faster and the other thing would make it more general for other customers. Yeah. We almost always have to pick the we ship faster option. Right. One of the fun things about being here in the building with everybody is every once in a while, a fan just goes crazy. This is not a person who enjoys, like a literal fan just starts to woo, like a... What's going on when that happens? Yeah, it's a great question.

Starting point is 00:18:45 So, one of two things is going on. Okay. The good thing is somebody just ran something on that machine that's like boosted all the CPUs. It's like when your laptop starts trying to take off. Sure. So something made the CPUs go really fast, everything ramped up, the machine got hot,

Starting point is 00:18:59 fan turns on, cools the machine back down. That's sort of the working as intended. What's probably happening here is that something's crashing. So we have a chip on the board that's a hardware watchdog that if it doesn't doesn't receive regular instructions on what to do with the fans, it assumes the worst, and it ramps the fans up to make sure we don't overheat. To avoid, okay.

Starting point is 00:19:20 This means that if you're doing a firmware update on the service processor, the Hubris-based service processor that's responsible for sending those messages, and it's gone for more than watchdog, however, the setting is seconds, then the chip takes out, ramps the fans up, and then it wakes up and finishes doing its update, and then the fans go back down. So this is why, I don't know if you saw earlier, but whenever those fans go up, like, I'm like, you're like the watchdog for the watchdog. I get yelled at in chat. But like, one of the computers is crashing.

Starting point is 00:19:47 You're one of the people that works out of the office generally. Yeah. And like, I might have wrote the code that's messing up. So you better get out of your seat and go do something. So frequently I'm like, it's not me. It's not me. This one's not me. So as being somebody who's regularly in the office, but most folks aren't,

Starting point is 00:20:03 what does OXCON do for you in terms of your camaraderie with your colleagues or the excitement level? Like, how do you feel about it? I get to find out how tall it. Everybody is. Tall? You can't tell that on the computer. That's true.

Starting point is 00:20:16 Right? Like Aaron? Aaron's like damn near seven feet tall, but he looks normal on the computer. So that's been interesting. Other than that, gosh, I don't think the company would work without this, honestly. I mean, how would you even, it doesn't even feel like you're at a real company. It's like you're watching a TV show of a company. Good point.

Starting point is 00:20:37 Like the office. Like the office or Silicon Valley. Oh gosh. Are you a fan? I worked there, so I couldn't really make it through the show. Like, my boss was personally parodied on the show. I worked there. And he loved it.

Starting point is 00:20:51 Did he? Yeah, so Astro Teller at Google X was the inspiration for the HuliX guy in the show. And, like, he was, he thought it was hilarious. He had a showing of the episode. With the monkey guy. The guy. Okay. Yeah.

Starting point is 00:21:03 So, he didn't have a monkey, the real guy. Oh, man. That was good. But, yeah, so I was watching the show a little bit. And I'm like, why are you making me watch this? Like, this is my day job. Like, I'm gonna watch Game of Thrones or something. Yeah, I can understand that.

Starting point is 00:21:15 So you've never gotten past season one, or even one episode? No, I've made it like three episodes in. How about now? You feel better about maybe since you're free of that world? I'm good. You're in a different world now? No? I'm good.

Starting point is 00:21:30 Sadness. He lived it. Yeah, I'm all right. I was listening to Brian tell one of his stories, and I was like, oh my gosh, that was literally in the show. Like something he described from the stage today. regarding money, funding, and I was like, that was literally copied from the... I mean, it's real life, but it's there, and so I can understand that. Yeah.

Starting point is 00:21:50 For me, it's entertainment because I haven't worked for Google and I haven't done your life. Yeah. For you, I can imagine how it's PTSD. Yeah, I also don't watch a lot of TV, so I'm kind of picky about what I spend my time on. Get right on. Yeah. But, no. So hopefully we don't do anything to get a show made about us.

Starting point is 00:22:08 Or you do, and it's good. Or that could happen. Yeah. That could happen. Like severance. Yeah, like severance. I just started watching that. That's a hell of a show.

Starting point is 00:22:15 That's worth your time. One thing I've been thinking about is churn. Is there any churn here at all? We've had people leave. Yeah? Yeah. Without being, you know, TMI, what are some of the reasons have they been negative? Or has it just been just mutual separation?

Starting point is 00:22:32 I actually really like basically everybody who's left. So, like, for some people, this work environment doesn't work. honestly, it barely works for me. Like, the all-remote thing, I actually took this job because I turned down two other offers that were fully remote because OXide at the time wasn't.

Starting point is 00:22:47 Yeah. And this was in February 2020, so you can guess what happened next. Yeah. So, oh, well. Literally next. It's a good thing. My workers are great,

Starting point is 00:22:54 but, like, the whole remote thing doesn't work for some people, and sometimes you can't ever really get in the swinging things. Yeah. We've had folks where, like, this turns out to set off other, like, past work trauma.

Starting point is 00:23:07 Like, we've all got, like, work PTSD from some shitty former boss. And if things are happening here, that's too much like that, you might get freaked out and decide, do you need to leave? Which I totally respect. We've had folks like Arien, Aryan joined before me and was involved in bringing me over here. And he just left last year because he's like, I've been here for five years. I feel like I've done all the startup-y stuff I can do here.

Starting point is 00:23:30 I'm going to go do a new startup. Just time. Yeah. So mostly good reasons, not like this place. sucks, I'm out of here. Because there's a lot of upside. You can probably find people to think this place sucks.

Starting point is 00:23:43 Yeah? Yeah. I'm not totally sure who, but I'm pretty sure you could. One person? Probably one or two. Two people. Yeah.

Starting point is 00:23:49 I don't know. Out of 80, that's pretty good odds. It's not too bad. How do you feel about the uniform compensation and stuff? I think it's amazing. It's worked this long. It was part of the reason I joined. Because I came in all like knives out,

Starting point is 00:24:00 like expecting to negotiate. Yeah. And Steve's like, so what we're doing is we give people stock according to this formula and we pay everybody the same amount of dollars. I'm like, well, That saved me a lot of stress. Like, sold.

Starting point is 00:24:11 Yeah. Probably the opposite of your time at Google. Yeah, 100%. Oh, my God. And you've been doing this. My only real concern there is that one of the things I really like doing is bringing people in who either don't have a lot of experience in the industry or, like, are just out of school or just out of some other job.

Starting point is 00:24:28 Right. Mentoring them up. Right. And like, are we comfortable bringing in people that are basically interns and paying them pretty good Bay Area salaries? Maybe we are. Yeah. But like I do want us to be able to bring those people in because that's how we get the next generation of us.

Starting point is 00:24:43 That's a good point. But if we can do that while keeping the compensation uniform or at least fair, that would make me really happy. Yeah. What does it do for your personal ability to show up to not worry about compensation? A lot. As much as you had to before. Oh my God. So like not having to worry that some of my coworkers are getting screwed over by having not like hardball negotiated in their interview.

Starting point is 00:25:04 So I became a manager at Google of a team that I had previously been on. been on and at that point at the next promotion cycle I was able to see everybody salaries for the first time and that was how I found out that there was like a hundred thousand dollars a year difference in salaries among people at the same level on my team wow it's a lot of money yeah and it was mostly us guys that had the higher numbers yeah it's kind of crap so like I don't have to worry about that here which is great and I feel like people are a little more comfortable talking about both job conditions and also kind of like

Starting point is 00:25:34 financial stress like some people have been really open about like, you know, my husband lost his job and, like, we've got the new kid, and so it's kind of rough right now. And I feel like people are a little more comfortable sharing that because we all know what we each other make. So it's not like you're going to reveal that, like, oh, wow, you're being really overpaid and now all your colleagues are mad at you, right? So it's got its perks. This may not be accurate, but one thing I thought about was the fact that it seems like you all are owners of the company. Like everyone in there that works here owns small owner. Right.

Starting point is 00:26:03 You all have equity. Some may have more, some may have less, and the compensation is. the same across the board, but what changes is you have a different job. One person has a CEO job. Yeah. Now, that person may have more equity, but that's because they also started the company. It's because they were here early. The compensation is a little different.

Starting point is 00:26:18 The day-to-day, the check, the reason you show up, it seems like an even playing field, and y'all are sort of owners in a way, as much as you can be owners, equitably. I really don't like hierarchy. So, like, when I'm managing people, I view managers just a job. Like, it doesn't mean that I'm above you or, like, in control of you or better than you. It's that I'm going to do the manager things. You can do the engineering things, and we're both happy. And, like, I'll develop skills here.

Starting point is 00:26:44 You develop skills there. But I kind of feel like we've got the same thing going up through Steve. It's like, Steve is the CEO. He does the CEO work that we don't want to do. Who wants to do that job, right? That job he described on the stage, I was like, wow, that is a hard job. Like, our sales team, they do the sales that I don't want to do. They're good at it, clearly.

Starting point is 00:27:02 They actually make different amounts of money. They're the one corner case Because salespeople We have an incentive thing Where the more they sell The more money they make Which they were excited about And we don't have a problem with

Starting point is 00:27:13 So yeah I kind of, I do think it helps With the sort of sense of like We're all in this together Yeah Which is good 100%. Yeah

Starting point is 00:27:21 Well, we'll let you get back to it All right Thanks for chatting with us It's been awesome Yeah, my pleasure Thanks for me too much Appreciate it Thank you for

Starting point is 00:27:33 Well, friends, the news is out. Our friends over at CodeRabbit, codivot.AI, they've raised a massive series B, and they've launched their CLI reviews tool. It is now out there. I've been playing with it. It's cool. The bottleneck is not code. The bottleneck is code review.

Starting point is 00:27:52 With so much code happening, so many people coding now, so much code being generated, and so many things competing for developers' time and attention to maximize. code review still remains a bottleneck, but not anymore. CodeRabbit, CLA code reviews, code reviews in your pool requests, code reviews in your VS code, and more. Teams now have a true answer to what it means to code review at scale. Code review at the speed of AI, and CodeRabbit is right there for you. You'll learn more at coderbid.aI.I. We'll link up their latest blog announcing their series B and their announcement of their CLI review tool.

Starting point is 00:28:33 Again, codrabbit.aI. Okay, friends, next up is Dave Pacheco talking about oxides update. Here we go. What is the update on the update? Well, let's see. We're working on shipping the first version of what we call a cell service update for the oxide rack. So today, the process for updating the oxide system, including the control plane and everything, involves a support process where our support engineers are getting on the system through a debug interface, and that works pretty well for a lot of things, and it's very simple, but it doesn't work for a lot of customers to have an oxide person involved in the whole support, I mean in the whole upgrade operation. And so we want that to be something that they can do through the API, just like they can do all the rest of the infrastructure stuff. But that is kind of a big deal because that means that the control plane is driving the update,

Starting point is 00:29:37 which means the whole control plane is online during the update. And so, you know, people use different metaphors for this. It's replacing all the parts on the car while you're driving down the freeway, replacing all the parts in the plane while it's in the air or whatever. But it's just a lot of work to make that work. How long have been working on that? I've been working on it for about two years. How long has OXAD been working on that?

Starting point is 00:29:58 We needed an update process for our MVP. And so that started before that and finished before that because we launched before that. But that's the process that we use today in the support-based process. This is like an upgraded rewrite. Kind of, yeah. So the idea behind the first version of update

Starting point is 00:30:14 was what we called the minimum upgradable product, which was Muppdate. And so the idea here is that, you know, it's an MVP. We're shipping it really as fast as we can, but there's a lead time between when you deliver software to the factory and when you actually get it at the customer site.

Starting point is 00:30:29 And that's a period in which we can continue working on the MVP. And so what we needed was for there to be enough in that first thing that we could update it to whatever software we wanted once we got to the customer site. And that became the minimum upgradable product, which is Muppdate. And that's the procedure that we have today. So the priority there was about having a robust support procedure for recovering the software on any one of our compute sleds. And that's something that we knew we needed even separately from update. And so that's why it came first was this idea that like we need the ability to recover a sled, it's one of our compute sleds, no matter what state it was in, and we can use that

Starting point is 00:31:03 to do our initial updates. And that's what we've been doing for the last two years. So in that sense, we've been working on it for a while, but in terms of being able to have the control plane do a more operator-friendly update, that's been about two years. Gotcha. And actually, even that, there are a bunch of building blocks involved in that, like what I've come to call dynamic reconfiguration of the system. So having the ability for any component to come and go while the system is running is kind of a prerequisite for that, but also allowed us to deliver other important things that customers would expect to be there, like the ability to remove a sled, replace a sled, add a new sled to the system. So the first year of the update project was really

Starting point is 00:31:40 building this foundation that we use for these other support procedures as well. Gotcha. What exactly is an update? Describe, is it big? Is it small? Yeah, no, that's a really good question because you think of the control plane is like, it's the control plane, it's just like one big piece of software or something like that. But actually, every single one of our, so we have, in a rack, we've got 32 sleds, we've got two switches, and we've got a couple power shelf controllers. Every one of those has a service processor, a root of trust, and the root of trust has its own software and the bootloader software.

Starting point is 00:32:13 Then on all the 32 sleds, we also have a host OS, which comes in two parts for, you know, historic reasons around bootloaders and stuff. Then we have all the control plane software on top of that, including storage software, which is one per disk. What all this means is that when you update the software in an oxide system, you're updating literally hundreds of components, and we're kind of doing it one at a time, and you're also going through all these intermediate states

Starting point is 00:32:35 where you're running some of the old software and some of the new software. So you're asking, like, what is it upgrade? We're replacing all of the software running on everything in the system, and it's a lot of different things. Simply deal, then. Yeah, and that list I just gave doesn't even include a lot of stuff that for us gets bundled, like CPU has microcode. NICs have their own firmware.

Starting point is 00:32:52 For the update system, that's simplified because it all gets part of the host OS. but there's a lot of software in the system, and it's updating all of those things. But the whole idea of what we're doing is that operators don't have to think about any of that stuff. So our release process puts together a giant zip file, it's like two or three gigs of data. You download that from us.

Starting point is 00:33:12 You can look at it if you want. You can validate it, whatever. But then you upload it to the API, and you hit Go, and then the system goes in turns for probably a couple hours up front, and then you come back, and then the whole thing's updated.

Starting point is 00:33:24 So the idea is that the operator is only thinking about this policy. They're not thinking about all those other things that are involved. That's nice as a zip-fall. I mean, several gigs, though. It's a lot of software, yeah. I mean, even the bandwidth costs on that, do you measure that? Does it matter? As you grow your customer base?

Starting point is 00:33:39 I mean, obviously not. I mean, because you're getting paid lots of money, but you got to worry about those things, right? In the limit? Speed to get it. It's a good question. You know, it's not something we've been focusing on for the most part so far. It's the kind of thing where the customer is currently going to be responsible.

Starting point is 00:33:53 currently going to be responsible for getting that from whatever our download site is, whether that's GitHub to the rack. So that, yeah, so they might get that on their laptop, downloading it from GitHub or something like that. And then they'll upload it to the rack. Nice. Yeah, that part's a release. All right.

Starting point is 00:34:07 That makes sense. And then uploading it to the rack is over their network. And we've kind of been assuming it wouldn't be an issue. That's a fair question, whether they would consider that issue. I just say about file size, I guess most people can download a couple of gigs pretty easily without it being a major problem. Some people can't. It's definitely a lot.

Starting point is 00:34:24 I'm like, I can't. I guess if you're- Most of their customers are not. Yeah, they're probably not having this problem. Yeah. I'm solving different problems. What are air gaps? So some people opt for air gap.

Starting point is 00:34:32 Yeah, that's really important too. And that's what this model works pretty well for air gaps because you're downloading it to, you the customer are downloading it to your laptop and then uploading it to our thing. No connection to the rack definitely doesn't care. It doesn't even know if it's connected to the internet. That's fine. It doesn't care. Right.

Starting point is 00:34:47 You could imagine a nicer experience where the rack was connected to the internet and could see, oh, oxides just published a new thing and I'm going to download it. Maybe I only download certain parts that I need or I download it one part at a time. Maybe that helps some of the bandwidth stuff. But actually, it's not something most of our customers are interested right now. Most of them are actually more interested in the, I'm really not connected to the internet. And I don't want you to, and I'm doing this because I care about my security and my privacy and my data.

Starting point is 00:35:11 I definitely don't want the rack talking to the internet. So that's why we've done it the way we've done it. We talked to Cliff earlier about hubris. So when you look at hubris, that's the operating system. You mentioned a couple chips, every different devices. on there, when the update comes through, is that on top of Hubris, the API of Hubris? Or is that, how are they compared to each other,

Starting point is 00:35:30 this update and Eubris? Yeah, so Huberus is a sort of, it's an operating system. We use it in a couple different components in the service processor and the root of trust on all of these systems, on the sleds, the switches, and the power shelf controllers. So it's one of the things that we update.

Starting point is 00:35:46 It is also true that in order to update everything, we talk to the service, or in order to update much of the system, we end up talking to the service processor, which is talking to that hubris thing. So we end up using the current version of hubris through the service processor to be able to update the service processor itself, the root of trust, the root of trust bootloader, and the host OS as well. All the control plane stuff is on top of that and doesn't go through hubris.

Starting point is 00:36:11 Is this a novel problem that you all invented given your architecture? That's a good question. Yeah, a lot of the details are specific to our architecture and pretty novel. It's the sort of thing that I expect, you know, cloud providers today have their own bespoke software for, and in fact, large deployments of on-prem stuff will have their own bespoke software to do a lot of this stuff. But a lot of it is also stuff that people kind of just don't update. Like, how often do you update your bios? How often does a company running on-prem software update their bios?

Starting point is 00:36:39 Probably not all the time. But, you know, our model involves like delivering a lot of value through stuff like that, and we do need to be able to update that stuff. I remember back at one of my past jobs, we did have to go update the BIOS on like 64 systems or something like that. And you've been in the BIOS thing, right? You're like clicking through the thing and like how do you do that on 64 systems? And like at the time, this wasn't a productionized thing. But at the time, I didn't know the I term or whatever has this mode, which is like send all my keystrokes to all the other panes.

Starting point is 00:37:12 You just caught me that right now. Someone just like opened up 64 panes and was just like enter, enter, tab over. Oh, my gosh. That would be that person. That's a 64-X developer. It's a real problem, right? When you have this software, it's such a low layer that's not really designed to be interacted with by automation. Everyone that's had to do this has had to come up with their own way to do it, basically. Well, one reason why you don't update your bios very often is because you've got to reboot your machine.

Starting point is 00:37:36 And I know that ultimately your guys' goal is like no reboot update, right? That's right. That's not what you're working on now, though. That's going to be the next phase, and it's going to be a much smaller part of the problem for us. At least, you know, we expect. It's going to be easier? You never know. until you're done.

Starting point is 00:37:48 It's not going to take you two years. Right. So we've done a lot of the pieces involved in that. Most of all the stuff we've been doing so far is like the orchestration, it's foundational stuff. So like our system is based on what's called the plan-execute pattern, which means that before taking any action, the system generates a new intended state of the world, which we call a blueprint. And then it goes and executes that blueprint.

Starting point is 00:38:08 And like all that was really important foundational work for building a system that can be operated autonomously, which is also really important for the air gap thing. Yeah. test all kinds of things that can happen with just the planner part without even worrying about the execution stuff. Right. And then we can go test all the execution stuff given whatever plan we want without having to have gotten a system into exactly that state.

Starting point is 00:38:29 And it also lets you do all kinds of things like ask the system, what are you going to do next before it does it? And why are you doing that thing? You know what I mean? So these are really important operational things that we just need to have. That's the kind of stuff that's taken the first two years. So you laid a lot of groundwork. That's right. And so now we're talking like doing what we're calling nondisruptive update. So this is doing updates without rebooting the customer VMs.

Starting point is 00:38:50 We're still rebooting the sleds. So do you move the VMs? Exactly. We're going to live migrate the VMs. And now that's a question, that's, I don't mean to oversimplify it, but, you know, that should be just a question of policy, which is like we're flipping a bit in that blueprint that says this sled needs to be evacuated. So let's say you have an oxide rack, 16 sleds. Is that typical? Yeah, 16302.

Starting point is 00:39:09 All right. So I got 16 sleds and we need to run an update. This is in the new world when this exists. and I got each sled has I don't know 30 VMs on it okay now we're doing math in our head uh oh he's gonna do a math problem no I'm not going to do

Starting point is 00:39:23 and it's a couple times to run update so I go put it on my thumb drive off my laptop or whatever plug it into the rack and it's going to run it's going to live migrate VMs off of a sled one at a time update that sled reboot it and then move some stuff back to that one

Starting point is 00:39:43 probably something like that Yeah, that's where actually... So, distribute that load evenly across the other ones in the meantime. That's where a lot of the complexity does come in with the non-disruptive update is, first of all, how do you like mechanically move these things around? It's like a bin packing problem. But then there's also like, how do we make sure that we have the capacity to do that, right? If we're going to start doing it up, if you've totally filled every sled.

Starting point is 00:40:06 If you're running your thing at max, there's no place to put it. You have nowhere to move them. Right. And so... Buy another rack. Right. Well, that's the thing is like, at scale, People actually don't care about this problem.

Starting point is 00:40:16 Because, like, keeping a couple of sleds capacity free when you've got 100 racks is, like, a very small fraction of your cost. And it actually makes sense for a lot of reasons. It also allows you to sustain failures and, like, put that stuff over there. But when you've only got one rack, that might be more a problem. So then there's a question of how do you create an experience for the operator that communicates clearly what the tradeoffs are, but also gets this input from them, which is like, what do you want to happen? Do you want me, the rack, to prevent you from using all your capacity so that you can update it? Or do you want to have the possibility that you go do an update and we just say,

Starting point is 00:40:47 sorry, we're paused right now until you can tell us to just reboot all these VMs or whatever you want to say. Right. That reminds me or that makes me think of like failed updates, you know, in the self-service world when this version's out, not the, what do you call it, Undistrupted? Non-disruptive, yeah, non-disrupted. And with the current iteration you're working on now, can you guarantee that an update will finish. That's not what I thought you were going to say. Finish? No, because there's things outside of our control.

Starting point is 00:41:19 Like, for example, one of the biggest challenges in self-service update, because the control plane is running, we have these intermediate states I mentioned where you've got new version of software talking to old version of software. And how do you avoid that becoming a problem on our ability to change our own software because you have to do backwards compatibility like forever, right? And one of the ways we've addressed that is to say that there will be an order for of the updates. So we will always update, for example, the host OS before we update the control plane that talks to it. Because the reverse never happens right now. And so that's fine,

Starting point is 00:41:51 but that means that if you're doing an update and one of the sleds is like out to lunch and we can't talk to it, we don't know if we've been able to update it. We can't actually keep going and update the rest of the control. Yeah, so we got to tell the operator, look, you either need to what's called expunge the sled, which means to remove it from the control plane and we'll pretend like it's just caught fire, like it's failed, we've moved everything else elsewhere, or you've got to figure out what's wrong with it and bring it back. And that would be a support call, probably a support call. And then the only thing go, they just unplugged it and then just plugged it back in or

Starting point is 00:42:18 whatever. Those sorts of things are always outside of our control. Yeah. Now, what did you think I was going to say? I thought you were going to say an update that explodes, that like you start doing the update. That was my next question. And the control plane is now down.

Starting point is 00:42:30 Right. What do you do? And like, that's my nightmare. That's been my fear for the last couple years. That's why I'm going to... That's why it's taken two years and I'm not done yet. Yeah, but seriously, that is why we have spent so much time on this, like, having the automation take these careful steps where every one of these steps we know is safe. We've taken, you know, as an example, we've got a cockroach DB cluster that's, like, storing all the control plan database data.

Starting point is 00:42:52 We've got five nodes. We definitely don't want to bounce a sled that's hosting a cockroach node while that cockroach cluster is already unhealthy, right? That's just like a thing we want to make sure we never do because that increases the risk that we actually lose quorum on the cockroach cluster and it's dead. and we're in trouble. So we have all these kinds of safeties built into the automation, all this testing, this whole pattern, and all this stuff. So that's like one angle. Obviously, testing is another example,

Starting point is 00:43:17 but that's kind of a given. Sure. But it's a hard problem, really, because part of what we do, part of what is involved in an upgrade is making backwards incompatible changes to data formats, like database schemas and things like that. And once you've done that,

Starting point is 00:43:34 the old software can't read the new thing. So rollback is really not possible. So what a lot of software does is they'll have like a point of no return, whether some call it a finalizer or a deferred update or something like that, where you basically get the whole thing kind of working before you've committed in that way, and then you ask the operator like, does everything seem to be okay? And then they hit the button and it's like, okay, fine, you go. But even then there's like, there's still risk there because whatever it is you're activating

Starting point is 00:44:00 by taking that last step hasn't been tested before that. And there's kind of no way to get around that. And that's kind of a future problem for us right now, but it's something we're going to have to deal with. So it is your nightmare. Has it ever been your nightmare? Meaning it's happened? Like in my career, yeah. Well, like specifically for Oxide and updates.

Starting point is 00:44:20 No, no. Any updates gone wrong? No, but we haven't started doing that in production yet, so there's still time for that. How do it in production now? We're using that. The process we used when, yeah, the manual thing when we, so the way, the manual process that I was talking about at the very beginning. works, basically shuts down the whole control plane, replaces all the software, and brings it back up again. How long is that process?

Starting point is 00:44:41 It's actually not that. It's shorter than the other way. I feel like that's safer. Why? I mean, I get the whole reason. Down time, man. Can't have it. It's downtime and it's the self-service aspect. Although you could imagine a self-service version that looked more like that. But then the thing is like if it's self-service, you're talking to an API, what's running that API while the thing is down? You don't have, there's nothing, right? Yeah.

Starting point is 00:45:02 So that's why you kind of, that's why you got to do what we've done. It's got to be done. Yeah. So downtown's required now. Future is, you said, non-destructive? Non-disruptive, yeah. Meaning the VMs get migrated around versus shutdown, but you still reboot sleds.

Starting point is 00:45:21 That's right. You're still rebooting controllers and control planes and stuff like that. Yep. But that's not visible to them because we have enough. Is there a world in which that isn't even required? Which part? What if you didn't have to reboot it? anything.

Starting point is 00:45:36 Oh. I think that's pretty dicey. It's definitely a thing that people have done. Yeah. Hot swap? Yeah, there's like, there's types of updates or like patches to like a running legacy kernel where you kind of like write the new one over here and then you jump over there.

Starting point is 00:45:52 But then the state... It's just a pointer. Right. It is. It's all software, right. Like who says we have to reboot, you know? Yeah. That is definitely harder to do with stuff like the SP and the ROT.

Starting point is 00:46:05 Part of it is that the ROT's job is to attest to the software that's currently running. Yeah, if you're going to change it out from underneath it, it breaks the out of the station. Yeah, and then there's also this risk that, like, you're now in a different state than you would have been in if you had actually bounced that thing? So have you created a time bomb for yourself where if that thing loses power and actually does power back on, is it going to do the same thing that it was doing? That's the thing I always, like, bifurcated code paths where you're like, this is the thing we do sometimes and this is the thing we do other times is totally the kind of thing that results. in something failing at runtime. Catastrophic failure. Yeah.

Starting point is 00:46:39 All right, fine, bad idea. There's a, the thing you asked about, like, my nightmare, right? The upgrade just, like, explodes and more toast. Another thing that we've done there is try to create a lot of guardrails around the types of changes that we can make to the software so that we know if we're making a change that's going to break things. So, for example, if you're changing the database schema, we know that you're changing that, and we've operationalized that one, so that's kind of fine. if you're changing like an internal API, we make sure that you're doing it in a way

Starting point is 00:47:09 where none of those intermediate states will expose us to a situation where those components don't speak a common version. And like that's something I imagine other organizations do do. I haven't actually really seen that before, but I think it's really important. So, because that's the way that I've seen this fail in the past. It's like, someone goes and makes a change to the API.

Starting point is 00:47:27 They're like, I'm not changing upgrade. So like, they test everything, everything seems to work. And you go deploy it and like it blows up in the middle. It's like we've tested the end point, we've tested the beginning point, and we just got unlucky in one of these intermediate states that wasn't tested. And so we've tried to identify the kinds of changes that would cause those problems and then detect those at CI time for us. And they build time if we can't.

Starting point is 00:47:48 That's cool. Any newer novel testing strategies that you've had to come up with as far as this? I don't know, fuzzine or terministic testing or anything that's... That's probably the biggest one. The other thing I would point to is that sort of distinction between the plan and execute stuff. We haven't actually gotten to this kind of thing, but one of the things we want to do with that is like property-based testing on the planner

Starting point is 00:48:10 where you're basically like sending all kinds of different inputs at it and putting constraints on what kinds of outputs can happen and see, you know, make sure it never does anything crazy. Yeah. You're supposed to give a talk today. Yep. They got rescheduled. That's true.

Starting point is 00:48:25 We're not going to be here for it. Right. So you have to spill the beans. Yeah, so let's see, what was I going to talk about? So I've been doing this for a couple of years. We have a surprising number of new faces. And so part of that talk is literally the stuff we were just talking about, like, what is update? What is it?

Starting point is 00:48:42 Well, we've got a couple hundred components that we've got to replace and, you know, pick your metaphor or whatever. And so part of it is just like laying that out. And like what we do today, the stuff we've been talking about, why this is a problem for customers, what we're doing, current status of that, which is like we're planning to ship the self-service part very soon now. And then non-disruptive is coming after that. And then the rest of it was probably, I don't know. know if that's really interesting to a broader audience, but it's kind of reflections on what it's been like to run a project for such a long term. And I don't know, maybe it is more generally interesting. But like, I have a lot of fears about update at runtime, but my big fears about

Starting point is 00:49:20 update as a project was that it would feel perpetually a year away. And we would make decisions day to day and week to week that ensured that it continued to be a year away. Because when something's like a month away and somebody asks you to do something else, you're like, Sorry, I got to do this thing. Like, we're shipping it in a month. But when something's a year away, it's very easy to be like, well, here's a really important problem over here. And, like, it's hard to know what the next thing is to do on update.

Starting point is 00:49:44 That's been really a challenge the whole time is, like, what's the next step? There's so many steps, and there's so many circular dependencies in those steps. They're like, well, I got this other important problem over here. Maybe I'll just, like, kind of solve that. And that's fine. Sometimes that's the right call. But if you make that call, over and over again. Even 20% of the time, right?

Starting point is 00:50:02 Exactly. you just never get there. Your timeline stretches. How close is it? How close? You said very soon, didn't you hear him? Yeah. Update.

Starting point is 00:50:09 I'm just kidding with you. Yeah, I mean. Within the next month or? That's the plan. That's not a year away, though. No, no, no, no. This was two years ago I was worried about that. Two years ago, I think I was like perpetually a year or two away.

Starting point is 00:50:22 I saw the roadmap this morning. Yeah. I saw last year's roadmap. It's not a roadmap. Did you dip them to see like how, you know, where the bodies are buried? One of the update. Yeah. And then the priority is.

Starting point is 00:50:32 this year. That's what I was top left, which was first. It's ongoing. Accurate. But almost ready. Yep. Almost ready. That's cool.

Starting point is 00:50:40 Why you? I know, so we had a side conversation. Yeah. He's been with Brian almost his whole career. That's true. To Oracle, the acquisition and then joint. Okay. And then what makes you uniquely positioned for this to ask, this quest?

Starting point is 00:50:56 I don't know if I'm uniquely positioned, but right after we shipped the MVP, this was one of my big worries about the products and I was like this is something I think I have a lot of experience with in terms of building distributed systems and reliable automation and things like that and so I thought this is a good opportunity for me to swing in and try to create the vision that I want us to get to and then be able to execute that so I was interested and I asked and that's how a lot of stuff works around here so what has it been like casting the vision for it and getting feedback from the team because I'm thinking culture, I'm thinking the process of getting feedback on any idea

Starting point is 00:51:33 because you're laying out all this groundwork and you're doing all the work to kind of get to a direction. What does it like to put that idea out there, that vision out there, and get that feedback and start moving on it? It's great. I don't know if you all talked about the RFD process, but... The RFD process, yeah.

Starting point is 00:51:49 Yeah, so when I started on this, the first step was writing an RFD, I think it's 418, if you want to find that. And it's basically... Is this public? These? Some of them are and some of them aren't. I don't know if that one is.

Starting point is 00:52:01 It probably could be if we want it to be. There's nothing particularly sensitive in it. But it's kind of laying out where we are. Lay of the land, this is where we are. These are the problems. Here's where we're trying to go. And like it was very specific in some ways. It was like this idea of plan execute pattern and the automation has to be safe and all this stuff.

Starting point is 00:52:19 But it was also very like, we had a lot of stuff to do and I don't know what all the pieces are yet. So that was the first step is getting everyone aligned on the vision. on the vision. And that RFD itself was a team effort. I, you know, I drafted this first version, but people are looking at that. And I think broadly, there were no surprises there. Everyone was like, yeah, this all makes sense. And then that process just keeps happening. You know, you get more and more specific designs and say, okay, let's get some feedback on this. And I do enjoy that part of it. And I enjoy the collaboration. And it goes pretty well. Like, it's not the sort of environment where you're, like, worried about what so-and-so is going to

Starting point is 00:52:53 think about this and is someone going to like be unproductive about it or something. Sure. We know that Rust is a foundational language here, obviously. Yeah. We talked about this being somewhat of a novel problem. How is Rust uniquely positioned to help solve this problem? What about Rust makes this problem easier than another language that you may choose to do this with? Yeah, so the big thing for me about Rust that's been that I really love and I think that's been huge for oxide, is its ability to help us ensure things, especially at build time, that need to be true of the system. And so that sounds very vague. But what I mean is you can catch so many problems early. And everyone talks about the obvious ones. We talked about this earlier, but everyone

Starting point is 00:53:36 talks about the obvious ones, like it's the borrow checker will help you find memory safety problems. But it also allows you to create abstractions for the rest of the team that can't be misused, right? So you can say, I'm creating this thing. Like, maybe you represent it with an object and you say, you can't do these two operations concurrently on it. Well, that's the thing that we can enforce in the type system. And, like, you literally just can't compile the code that would do that. That's awesome.

Starting point is 00:54:04 And it sounds so low level, but that's what a lot. Like, it's extending that same idea and applying those same tools to do it that allows us to say, if you try to evolve the API in a backwards incompatible way that won't work at runtime, at upgrade time, you might get a, depending on how you do it, you might get a build failure, you'll at least get a CI failure. And so, I don't know if we have a minute to talk about it. I mean, so one of the first things that I built here with Adam Leventhal

Starting point is 00:54:32 is something called Dropshot, which basically lets you write an HTTP server and then generate an open API spec from the code. And then we feed that into something called Progenitor, which generates the clients for it. But that alone means that if you make an incompatible change to an API, even before we've done any of the versioning stuff, your client fails to compile now, which is, like, hugely valuable.

Starting point is 00:54:53 And, like, that's true, not just because of, like, it doesn't have the operation in it, but, like, you passed in a noon with three variants, and it now has four variants, and you need to, like, accommodate for that fourth one or something like that. I call me Enums. Yeah.

Starting point is 00:55:07 Yeah. So, anyway, we ended up extending that. So now, like, that's how we do this versioning stuff is we have a bunch of these open API specs that are the ones that this thing supports, and then we know if you've changed it because we know if it generates a different thing, then you've changed it incompatibly. So Rust has facilitated all this stuff.

Starting point is 00:55:24 And sorry, the last thing I want to say about that with the Dropshot thing is, like, you've got your rich, structured types in Rust, and you're like all happy because I'm in Rustland and I've got my strong types. That carries all the way to the client because of the way it goes, like Dropshot just takes those types and puts them in the Open API spec, and then the client generates faithful things on the other side. So you basically get that strictness all the way through there. We gush a little more on this, then, speak to your confidence in the code you write because

Starting point is 00:55:53 it's rust. I'll just, easy shot you that one. Yeah, I mean, I'm reluctant to make any bold claims about the correctness of the code I write, lest I immediately walk back in there and there's some horrible thing happening. The fan spin up, that's why. But I mean, I'll say this, there's a lot of changes I've been able to make where I'll go work on the code for like four hours and when it compiles, I know it's already correct. I know I haven't broken anything.

Starting point is 00:56:16 And it's not as simple, you know, I know there's the cartoonish version that's like if you, if it compiles, it works. And I'm not talking about that. I'm talking about something that's like, it's either a refactor, this happens a lot with refactors, or I'm building a new thing in terms of these things that already exist and I'm plugging into the middle of it. And it's like, there's no way for this to be wrong at this point because like it's, it's, it fits neatly into the narrow interfaces on both sides of it. It's correct. And that is huge. It's so huge. No nightmares.

Starting point is 00:56:47 My previous experience was Node.js, and it was the complete opposite of this. And part of the reason I love Rust is by the time we got to the end of the road at my last job, when we were using Node.js everywhere, every single JavaScript function we had started with like 30 assertions about the types of all of its arguments. And I'm like, why are we doing this? The computer can do this. It's what the compiler is for. And so I have so much more confident in those things now.

Starting point is 00:57:11 And that gets back to why I think Rust is so valuable for this problem space is that these things are so complicated by allowing us to encode all these constraints into build time constraints, that allows us to evolve the software so much faster and with so much more confidence. Someone can come in here and make a really big, complicated change, and you're not wondering, like,

Starting point is 00:57:32 well, what about, I wonder if they missed some caller or something like that. It's like, no, you haven't. You've covered every single case. And it's huge. It's huge. Right on, man. What else? Anything?

Starting point is 00:57:44 Tell us about, how much? Oxcon. Oxcon. Oxcon is awesome. So this has been something we've been doing for a while. And this is something we, it's very similar to something we did at Joint, these sort of company-wide meetups, engineering-wide meetups. You know, for a remote company, it works really well to be remote basically all the time, but it is also so valuable to have that time in person together to get to, there's so many conversations that don't happen if you haven't scheduled to meet for it. And we have other ways of trying to have those conversations anyway, but it's just really nice to have the FaceTime with people. And to also, like, the small talk and, like, you're going to get dinner and you're just, like, talking to people about whatever it is. You learn more about them. It's a really good time.

Starting point is 00:58:24 So there's a mix of the structured time that we have, you know, Brian and Steve talking about all the exciting stuff over the last year. It's inspiring for people. It gets everyone ginned up to talk about everything that we've got to go do. And then, you know, there are other good company-wide sessions talking about important projects and stuff like that. But then there's also all this breakout time that I started by talking about, which, as with any conference, the hallway track is almost, it's at least as important as the best. It is the best. Yeah, right? So that's huge.

Starting point is 00:58:54 Not really. Hallway track only. Keynotes only, and then hallway track. That's our move. This is so much more, there's a lot, it's getting more, like, mature and professionalized. We never had a stage before. No. We never had, like, pro A.V. before, so this whole week has felt like a little bit of a dream.

Starting point is 00:59:11 It's like, Akcon already feels different. And then we've got the stage there. And then it, like, rained in the Bay Area in September this morning. And I was like, am I dreaming what's going on here? Yeah. But it's a good time. Thanks, Dave. Thank you.

Starting point is 00:59:22 Yeah, appreciate it. Awesome. Well, friends, last up is Ben Leonard in charge of all things designed for oxide. Here we go. One of the things that stands out about oxide is its design. I think you can have a good company, successful company, but every company that's successful is, set apart by its design. Can you talk about how you came to be here to OXide and the design story

Starting point is 00:59:50 behind the roof? Yeah. So I was working at a branding agency for Pentegram. Oxide was my last branding project. I was planning on leaving to work as a freelance. I'd been at Pentegram for four or five years. I was ready to move on. And it just so happened that Oxide were looking for a designer just as I was leaving and it felt like a unique opportunity to. There's a, what would happen is you'd work on these brands, and then you'd, like, you'd throw over the fence, and then you'd check, you'd check in on it, like, two years later, you're like, what have they done to my baby? They destroyed it. My boy, my boy. And so it was, yeah, a unique opportunity to continue it on. And I'm, I'm kind of, I was chatting to you about this

Starting point is 01:00:32 before. I'm, like, relentless that I like, I like to fiddle with things. And it, I get, working on it. Yeah, I kind of get bored of it as well, so I just want to, I just want to tweak something. So we, I mean, recently, I made like, we made a modification to the, to the logo that most people wouldn't even notice. It's slightly thicker, but it's something that's bugging you this inside of time. Yeah, yeah, yeah, yeah. When Steve was up here, I was like, Steve's shirt, has the thinner logo. Yeah, yeah, yeah.

Starting point is 01:00:56 And on stage here, it's much thicker. I noticed that. You, you, right away, right away. I noticed turning. Oh, yeah. Yeah, yeah, yeah, for sure. So I'm always kind of, yeah, always a feeling with stuff. But then, um, the beauty of oxide is I get to work across every.

Starting point is 01:01:11 So I'm working across most generally most of what I do is working on the on the on the product. So in my case it's the web console. So you're working on the design system, the UI. I'm also working on the brand and the marketing and the sales assets and all this stuff and industrial design. The beauty being is that those all inform each other. What you tend to happen is you have your product team and your creative team and they're very, they're very distinct. So on, this is a bit in the weeds, but like our, our, uh, this is a bit in the weeds. our this we have the same UI design system that drives every piece of design so that we

Starting point is 01:01:47 have the same like we have the same colors we have the same UI elements on both the website and the web console they're kind of continuously continually each other and there's I mean it's it's difficult because I think there's been a like a shrinking of creativity in in tech design that is branding stopped being what it was is that the product itself so the proliferation of SaaS

Starting point is 01:02:16 the product itself became the design so the UI language of the of the product became the design so like linear is it is a great example of this which is the design of the product and the design of the creative are one and the same

Starting point is 01:02:28 what that can mean is you have the world of design within product is much smaller and so what that can mean is you have everyone's always complaining that all tech looks the same now and that's probably because product is much smaller and if your brand looks like your product then your world is much

Starting point is 01:02:43 smaller so I'm kind of I'm always figuring out how wide I can go with the the branding and keeping it still like it comes from the same world as everything else and yeah it's a it's there's yeah the industrial design the creative the product there's so much to there's so much to work on so you have this right behind us it with this rack behind us look like that at all if you weren't here? It'd look a bit like that. Same shape. It'd be tall.

Starting point is 01:03:15 Same shape, yeah, it'd be similar. If I wasn't here, it'd be really, it'd be squat, it'd be half the size. No, I think... Less green, probably. It might be less green. Or any more green. We went back to, I worked with, so we consulted a little bit with, um, my, my old agency Pentagram, um, industrial designers there, and they kind of, we work together on that a little bit.

Starting point is 01:03:37 One thing that I'd learned from working on a hardware product previously is the first version was, it was a small run that they distributed to people, right? And so they produced this thing which is initially really magical. And then as soon as they mass produce it, they think, okay, we're going to have to make this cheaper, easier, more practical to make. And then the next version looked really, really bad. Because essentially they were just trying to recreate what they've done and, okay, we can't do this complicated tile system. so we'll just print it directly on it. It was a PCIE code, right? And what it was is it was compromised in the worst way.

Starting point is 01:04:16 So I think like going into it from the beginning thinking like what's the, what are our limitations like we're going to make, we're going to make thousands of these like how from the beginning. I don't want to do things that just exist in this kind of first, this initial run. And so figuring out what the, what those compromises are. But working with hard, like industrial design, working with hardware is hard. I haven't done very much of it, but yeah, you're, as soon as you're touching materiality, you're, you're dealing with. Things are different.

Starting point is 01:04:46 Yeah, I mean, we were speaking about, like, color, color is the bane of my life. Like, we have, we have painted dry bays, we have powder-coated metal, we have, like, plastic pieces and, and color matching the green, I think, is. Yes, that's probably got to be your thing, right? Yeah, and even. Is that a little off sometimes? Oh, it's a little... You're like, get out of here. It's a lot of all the time.

Starting point is 01:05:09 And some of it's outside your control, so you have to think about how to handle that. So what you do is you avoid putting elements next to each other that might be different. Yes. Like, if there's a little bit of separation, then visually you can get away with, like, with a little bit of difference between those things. We had, so one, I mean, one example is originally the rack, it was like, it was a black, but it was a slightly bluey black. Like blueie as in the cartoon blueie? Exactly. Oh, okay, blue and black.

Starting point is 01:05:40 It was like a blueie black. This regard. It was a cool black, cool as in like color temperature, not like cool black. I thought you liked it. Yeah, yeah. But the, my, my thought was as soon as we start integrating multiple components, then I'm trying to make sure like the, I'm using the exact right black. Oh, because, uh, and yeah, so.

Starting point is 01:06:04 That's hard. You're trying to, yeah. Although I don't know how much of it is me trying to avoid upsetting myself. Like these things that no one else noticed. Someone else like, they're like, whatever. Ben's upset. Wow, I don't know. This is like trauma-driven design.

Starting point is 01:06:20 Right. Yeah, yeah. You guys have TDD. We had the same. Do you get into the PCB design then too? Like how much or is it simply this? I don't want it simple. Simply.

Starting point is 01:06:31 But like, do you get. Just the easy stuff back there. You get to step into that world where, like, How is this, because I think about Apple. Apple has done a great job of branding everything. Yeah, yeah, yeah, yeah. Look at their CPU, right? Their latest M1 chips or the latest M-series chips.

Starting point is 01:06:43 Yeah. It's a design to everything they do. Yeah. You get into that as well. I mean, Apple's a good example because they're stunning inside as well. A little bit. Occasionally I dip my toe in and I get shouted to leave. So I, the PCBs are currently there.

Starting point is 01:06:58 The PCBs are green. And I think once I came into a channel and I was saying, Hey, guys, can we, make the PCBs black. I'm thinking, okay, you change, you change the color of PCB. What you just, you just, I don't know, you used to say, you just order black ones instead of green. Right. I was told that's not the case. I think it does have impact to like, thermo and stuff like that. What do they tell you? I, they told me some information, none of which I retain. You're like, sorry I asked. Maybe they said it. Maybe they, maybe they went telling the truth.

Starting point is 01:07:27 Maybe they were just saying it because they were like, yeah, yeah, yeah. Black was cool. I've seen black PCBs. But there are issues with, yeah. There are, there are manufacturing ramifications, which mean that it's, yeah, it's a bit more complicated than that. I think with the rack, you have to take, there's a balance. We can't go full Apple and just invests so heavily in details which costs money for a little benefit. But what we can do is make sure that something which is an item which is underappreciated. I like I like taking boring objects and boring designs and boring boring things and making them

Starting point is 01:08:06 making them better. I think it's easy to come into it's easy to work with Nike and make and do some beautiful design for something that's already like really compelling but I think it's it's much better to take something which is for which design does not usually kind of feature and try and elevate that. So outside is big on the values. You know you go to the website there's values, I'm sure you know all the values.

Starting point is 01:08:30 They inform all the decisions, how you act. Do you have like a design language or value system or something that drives your design for oxide that's separate or maybe even congruent with that? I think there are

Starting point is 01:08:45 ideas. So I think my approach to branding is you have one idea which should then filter through into everything and that's how you make holistic thing. It's not necessarily that this looks like this. it's that it comes from the same place. So with Oxide,

Starting point is 01:09:01 Oxide is an old idea, like, brought new, right? So it's this, it's like this old crazy idea that you own, you own your own hardware. And so you see that in the design language, you see the ASCII. I mean, the green is like this kind of terminal green. Right. Like it looks, it has like this kind of retro edge to it.

Starting point is 01:09:21 And so the design language is like old, meets new. And yeah, so I think that's, like nostalgia is a key part of what i mean you see over there we have those the longest nostalgia is a key thing but you don't want to you don't want to be so nostalgic you don't want to lean so much into nostalgia it's kitsch um so yeah i think that sort of referential um referential thing to kind of old computing is a is a maybe that's why it speaks so well to us because we're all about that yeah yeah yeah yeah where the old meets the new is interesting precisely what gets you excited

Starting point is 01:09:55 about what you do here? Like, what makes you be like, hell yeah? I like to... Or do you say hell yeah? I don't know the British equivalent of hell yeah is. I don't think with that... Yeah, I don't think with that...

Starting point is 01:10:10 Yeah, we're not that emotional. Yeah. We'll get you properly excited. Now you're talking to language. I'm properly excited. No, I... I just... I love to do a bit of everything.

Starting point is 01:10:26 And it's the variety, I think, that really drives me. I think it's why, I mean, I've been here more than four years now. I think that's what variety is what keeps it interesting. And like, upside to the company where you get to do that, you have hardware, you have software, and obviously, and you and everything kind of in between. And yeah, yeah, I get to, that's the thing that keeps me, keeps me interested, is the, you know.

Starting point is 01:10:46 I would expect the opposite. Because at an agency, you think that's where the variety comes because you're on to the next project. Like new, new company. new design language, new brand, whereas you decide to settle down with one brand and do that for years and years, I would expect it to be less variety, but you found that it goes wide. I mean, I think certainly working at agency, I think you're always going to, it's always going to be much more varied than working a site-up, but I think for me, it's, there's this

Starting point is 01:11:14 interesting tension between what's staying the same and what's different, and so there are things in the oxide brand which have retained logo more or less, the colors. but I think there are things that are kind of constantly changing and obviously like I think OXIDA's company is changing such as design needs are changing as an example I think we are growing and so like marketing and sales needs are growing and so previously there was very little collateral

Starting point is 01:11:42 outside of the rack outside of the web console and the website and that was it so you had that kind of small world but then when you're kind of entering the world of sales and marketing then that kind of opens the door to a bunch more design. More stuff. That thing of the door, what we came in, was a card with like a chip on it. Did you design that?

Starting point is 01:12:04 The coin, yes, yeah. Does it come off the card? It does, yeah, it has a little, yeah, yeah, yeah, yeah. It's got two sides. You're afraid to. One that I can actually use the coin for the coin. Yeah, one to sell on eBay, yeah. Yeah, exactly.

Starting point is 01:12:16 Yes, yeah, yeah, yeah. That's fun stuff, right? Yeah, I mean, that's what, I think those are the palate cleanses. So the big pieces of merch tend to, they tend to get made just as I finish something that's like, that's like drained my creative energy. Website is the big one because, I mean, I'm, I'm usually coding the thing too. So I've designed it. I've coded it. It took so long that I'm ready for something else.

Starting point is 01:12:40 So we have, I don't know if you saw those, the little kind of rack stickers. Those I designed right after I'd done like a run of the website. And so those things are a nice little, they're a nice little. cherry on top. They're a way to kind of refill my, like, creative cup a little bit. Are you excited about the growth? I imagine with new footprint, we're doing video. We talked to a little fourth wall breaking here behind the scenes.

Starting point is 01:13:03 We talked to Toro about being able to put motion graphics into place and collaborating on that. Is that exciting to you to see the different areas you can do? Yeah, for sure. I think, like, there are, I have a list as long as long as my arm of things that, like, I will eventually get to. Like, motion is one of those things. There are things that I want to do, but you don't necessarily have the reason to do.

Starting point is 01:13:22 Yeah. So I think you can, I was chatting about this before, which is like the size of the gamma of like creativity of things that you can do. Product design is much smaller because it needs to be functional. People using it every day. There's also like the zeitgeist of the way that people use products and you need to be, you need to be designed in a way what's expected. Yeah.

Starting point is 01:13:43 And websites a little larger and then the kind of creative space is even larger than that. And then you just think about what are these like one-off excuses to do something different? So like product launches, all those things that you can you can get a bit more experimental around. We love the work you do. Yeah, it's great work. It's beautiful work. Thank you very much. I think the company's awesome.

Starting point is 01:14:06 I think the design really, in my opinion, is like it's the glue. You can have a great product. You can have great software, but it's the final piece that says this is just super awesome. basically the design to me is what sets it apart yeah I mean the design the design is what's like says you care about every yes yeah it shows intention it shows trust it shows all these things yeah it's just really good work so getting to go to oxides headquarters for this conference oxcon was years I would say in the making didn't happen overnight and over the years we become closer and closer

Starting point is 01:14:46 friends with brian and steve and the rest of the team there behind oxide and just going to get a chance to celebrate with them this moment this series b moment this massive new order moment this crossing the chasm moment i know for jared and i was quite awesome so glad to share that with you glad to get this on the pod glad to peel back the layers with cliff and with dave and with ben these different parts of oxy these integral important and different parts to truly see was on the inside of Oxide. Now the good news is, I guess even better news, really, is I think there's more to come. They light working with us. The election will be done so far for them. And I think you're going to as well. We're producing a 10-minute inside oxide documentary for their YouTube channel. So it's kind of cool.

Starting point is 01:15:32 In the works, in the edit booth right now, in motion as we speak. But man, I've seen previews. I've seen the teaser. It's epic. A massive thank you to our friends who sponsored this podcast, our friends over at Code Rabbit. our friends at Depot, and of course, our friends and our partners over at Fly. Check them all out. They love us. You should love them, if you will. We appreciate that. Of course, the beats are by our good friend and longtime friend, brother, Braemaster Scylinder. Love those beats. Okay, that's it. This show's done. We'll see you soon.

The Changelog: Software Development, Open Source - Voices of Oxide (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.