Microarch Club - 110: Rick Altherr

Starting point is 00:00:00 Hey folks, Dan here. Today on the MicroArchClub podcast, I am joined by Rick Alter. Rick has worked at nearly every level of the computing stack, and we touch on many of them in this conversation. We start off at Rick's upbringing in the Midwest and how they learned about how machines work by tinkering on a theater organ. From there, we get into Rick's career, beginning with working on hardware performance analysis tools at Apple during the PowerPC to x86 transition, then moving to Google to work on everything from system software in the data center to flight control systems for internet satellites to open source FPGA toolchains. Rick's exposure to the hardware-software interface subsequently led them to security

Starting point is 00:00:38 research, including their discovery of a widespread vulnerability in baseboard management controllers, and their collaboration with Laura Abbott to identify multiple vulnerabilities in an NXP hardware root-of-trust device while working at Oxide Computer. We round out our conversation with a deep dive into Rick's current work on quantum computers at IonQ. Rick does a great job of explaining quantum computing in relation to classical computing, and outlines some of the current challenges of interfacing with these powerful machines. As evidenced by their willingness to talk with me for nearly two and a half hours, Rick is passionate about opening doors for other folks in technology,

Starting point is 00:01:14 and they regularly make time to do resume review, mock interviews, and mentorship sessions with anyone who is interested in talking with them. If you're interested, look for the link in the show notes. With that, let's get into the conversation. All right. Hey, Rick, Thanks for joining the show today. Yeah, glad to be here. Absolutely. Well, we've obviously chatted a little bit before the show, and I've also followed some of your work for some time, including some of your talks and appearances on other podcasts. Actually, earlier today, and I think I mentioned this before, I was listening to your On the Metal episode, which definitely had a lot of good tidbits in it. But also, one of my co-workers in my day job, Chris Gamble, has a podcast, The Amp Hour. And you were on that, I think, a few years ago now. So I've definitely run into you in a few places. But super glad to have you on the MicroArch Club. And I think we'll have a lot of fun stuff to get

Starting point is 00:02:32 into today. Yeah, sounds great. Awesome. Well, I know you've listened to a couple of the episodes, so you know a little bit of my preference for kind of starting off in the beginning with folks. I'd love to learn what your introduction to computing was and maybe what some of the first machines that you interacted with are. Yeah, I mean, that is a little tricky in that it kind of depends on how you define computing. So like my family has a history of auto mechanics. And so you get into things that are like, not quite computers, but kind of are computers, you know, like how a transmission, like an automatic transmission works is a hydraulic computer effectively. with learning about electronics and mechanical systems through my grandfather. Before I was

Starting point is 00:03:27 ever born, like when my mom was a teenager, he ended up buying a Wurlitzer Hope Jones unit orchestra, which is a theater organ, and installing it in their house. And so I learned growing up, like, this was just the thing we had. It was there we had for demonstrations to local school groups and things. And I was small enough that I could climb through to go in and fix all of the stuck electromagnets and do all sorts of adjustments and things. And learned a lot about how to build relay logic type things and that stuff. So I learned a lot from my grandfather, who was very much a perpetual tinkerer in electronics, mechanical things, all sorts of stuff.

Starting point is 00:04:14 And then really get on to computing. Jeez, I was probably like six or seven. My dad ended up getting a job as a computer salesman because that was an actual job at one point. And so I got to periodically go into the computer store after school and just kind of hang out and learn a lot. We had a 286 at home, got to play the video games and things like that, and really just kind of took to it. And so I spent a lot of time, you know, learning about programming languages and how the system worked and as much as I could. And that really progressed up through high school and everything else. You know, I eventually got my own computer so that I would stop breaking the family computer. And my first job actually ended up being working at a ISP locally. So I was,

Starting point is 00:05:11 they were like, in the late 90s, in the rural areas, you'd have these combination like computer stores slash web development firm slash ISP. And so I was more on the computer repair side of things, but I was also doing the ISP side and got a lot of exposure to just how the Internet works and that kind of stuff. And yeah, so that's a lot of where I started. And then, you know, ended up going to college for originally for a computer engineering degree, but then decided pretty quickly that I didn't like doing all the math for the electrical work. So switched over to computer science. And yeah, a lot of it just kind of kept kept going from there career wise. Gotcha. With the the earliest machine that you had for yourself, what what model was that uh i believe it was a 8086 clone i don't remember exactly what it was it was not new right it was right definitely like a hand-me-down that nobody wanted because everybody had 386s kind of

Starting point is 00:06:18 thing absolutely is the uh so so you kind of got started in maybe programming your family's computer and then eventually your own computer was that mostly in um you know a higher level language like basic or something like that which it may be comical to call basic a high level language now but as opposed to getting down into assembly and in the lower level parts of the hardware yeah i mean growing up like i i grew up in rural rural Ohio, Northwest Ohio. I didn't have a lot of people documentation that came with DOS. And that was it. I didn't really have access to a C compiler or any of those things. That really didn't come around until I was a teenager when finally that kind of thing was available. And by that point, I already had my own computer and was working in an ISP. So I had internet access. That was

Starting point is 00:07:22 a very big change. But before that, yeah, I was writing complicated basic programs, trying to understand more of how the hardware actually worked, but was very limited just by what was available. I mean, even going to the local library, there wasn't any information, like programming books were not a thing that the local library was going to carry. Right. And at your high school was there any sort of computer science education or was this strictly kind of like outside of school there wasn't until like my junior year at which point um like they had had a computer lab but it was very focused on learning to type using Word and Excel, kind of more basic computer user skills than development stuff. So there were no computer science classes.

Starting point is 00:08:14 But as we got into my junior year, the person who ran the IT staff knew that there were a couple of us who actually had a lot of interest in computers and things. And so he worked with the local community college to have us do the Cisco CCNA courses. But we had to show up at 7am before school. And like we did CCNA before we actually started our regular school day. And so we did all that neither one of us actually ended up taking any of the exams for it. But we definitely got all of the exposure to like, here's how you set up routers here, you know, all that kind of stuff. And, but that was the extent of it, you know, that we didn't do any AP courses or anything else. Mostly, it was just a small set of people that happened to have computers and that kind of interest sharing skills and things behind the scenes. At that stage, like in high school, I probably was more known for running the server that hosted TI Net and ti news for the calculator programming stuff than anything else uh yeah that's really interesting the uh i feel like one of the

Starting point is 00:09:33 things that maybe gets discounted by by folks who don't have that experience so i i grew up in the southeast united states and uh there is just a really big geographic difference in terms of computer science education. I graduated from high school in 2015, and there was no computer science at the school that I went to. And it was a fairly good academic high school, but it wasn't really top of focus. And I do think in the past five years or so, there's been even more of a shift. But, you know, I think there is a pretty big geographic difference, potentially between the coasts and somewhere like the southeast or the Midwest. Oh, absolutely. I haven't looked at what they're doing at my school recently, but certainly I

Starting point is 00:10:19 wouldn't expect them to have much of a formal computer science curriculum. It's just it's not top of mind. You know, they graduate maybe 100 hundred students a year and that's, that's not like a private school or anything. That's just the county school system like that. Right. Right. Well, so, okay. So then you, you go to college and you decide to study computer science. What was that decision like, especially coming from coming from you know a background where there was a lot of folks that were into that or it wasn't necessarily I don't know if it was seen as a kind of like viable career path well my parents saw that from you know my dad's job working in computer sales like it was obvious that computers were an up-and-coming thing

Starting point is 00:10:59 and my dad actually had interest in computers going back to when he was graduating from high school and going into college. He just couldn't afford college at the time. And so he never actually got a college degree, but he got a little bit of exposure to it through kind of the math to computer science path that was more common back then. And so they understood that there was utility there and that it was an up and coming area, that it was probably a good place as far as career-wise, and that I had a lot of interest in it. It also helped that my brother had decided when he was going to go to university that he also decided to go to computer science. And later, my sister also went for computer science.

Starting point is 00:11:36 So all three of us went for computer science. But it was really just a, I like doing this, and I don't know what else I would do. And my parents were fairly supportive of that. I was going to a state school, so tuition and everything was pretty affordable. And it was just like, sure, why not? We'll see how it goes and see what happens. And it turned out it worked really well for me. I knew I already enjoyed computers. see what happens. And turned out it worked really well for me. You know, I really do.

Starting point is 00:12:09 I knew I already enjoyed computers. I didn't actually know what working in computers would look like. But it seemed like I was asking a lot of the right questions to go ask more and figure out what where I wanted to work in that space. And did you find that the computer science curriculum was fairly strong? And was that kind of a foundational part of how you went on in your career? Or did you feel like that a lot of the kind of skills and things like that that you gained were mostly outside of the classroom? And those were the biggest contributing factors to what you ultimately went ahead and did? It's a mix. So I went to University of Cincinnati, which is a pretty well-regarded engineering school. Certainly not Ivy League or anything, but reasonably well-regarded.

Starting point is 00:12:52 And their computer science program is decent. I felt the curriculum was... When I started, I was ahead of the curriculum. There was definitely a lot of areas where I learned more through the curriculum. But just like Nathan, your past guest, talked about his school did co-ops. Actually, University of Cincinnati is where co-ops was created. So all of the engineering schools do a – usually it's a one quarter on, one quarter off. Computer science tended to do more of a six month on, six quarter off. Computer science tended to do more of a six-month on, six-month off. And so I ended up through that getting an internship at Apple. And so I went from like, I really like computers and I'm learning about programming and some of the details of how the

Starting point is 00:13:37 computer works to I'm working in a group that actually does performance analysis of up-and-coming processors for Apple computers. And so I ended up just kind of leapfrogging everybody else at the school in terms of what I understood and everything. So it was this mix of like, would I have done well just being in the curriculum there? Yes. But honestly, the co-op really brought a lot more to the table in that regard. That makes a lot of sense. How did you end up getting into the performance analysis kind of aspect of Apple? So my friends and I dispute exactly the exact sequence of events here. But essentially, what happened is there was a centralized office that helped me with co-op placements.

Starting point is 00:14:22 And they were generally well known as being absolutely terrible at their job. So often it was you had to negotiate your own internships or co-ops with a lot of companies. And so they often were in the area, you know, in the Cincinnati area, it'd be like going to GE or Procter and Gamble or whomever. And occasionally someone would end up with these ones far, you know, somewhere else. And it just happened that one of my, one of the other students that I knew got an internship by, like, somehow his resume ended up at Apple. And as far as we understand, it wasn't supposed to have happened, but it did. And he ended up getting an internship from that working with their compiler team. And from that, it was, hey, he's coming out and it's a six-month internship, not three months during the summer. they could have interns throughout the year, effectively. So that team started to spread the word internally, that they could pull more people

Starting point is 00:15:30 in from the school, which then turned into, well, hey, who would you refer as someone to come in? So that person referred a friend of mine who ended up being more, he was an electrical engineer student. And so he ended up coming in on the hardware development side and working with the architecture performance group team, which is, and he ultimately referred me as someone else who could come in more on the programming side and the system analysis side. So that's how I ended up there. But it sort of built around, there was just a manager that really understood how to get internships moving through the system and get them brought into this organization and kind of fan them out through that. We actually ended up with a lot of people there. There's probably a

Starting point is 00:16:14 good seven or eight year timeline where people from University of Cincinnati were coming into the Mac hardware group via that. Oh, wow. That's awesome. What did you work on during the co-op? And then ultimately, did that end up being what you worked on when you joined Apple after university as well? So the very first thing that I got as an internship project was validating that the performance counters on the PowerPC 970 actually counted the micro architecturalitectural events

Starting point is 00:16:45 that they claimed to count. Which turns out, I had no idea what a performance counter was. I had no idea what most of these events were. And I had no idea how I was going to write code to trigger these particular events to even do validation. So that was a very challenging project. I did actually ultimately figure it out. And through that, I actually ended up writing the documentation for IBM on how their performance counters worked, because all of their documentation was wrong. And they just eventually were like,

Starting point is 00:17:17 here is the FrameMaker source code, please update it, and we will ship whatever you write. And so that taught me a lot about how the processor worked very, very detailed inside. And that was on PowerPC 970, which is the G5, which is a very complex, out of order processor. There was just a lot going on. And so I kind of, that was how my internship kind of leapfrogged me ahead of my classmates was just like, I had industry experience with bleeding edge processor, micro architectural design right up front. And so I ended up staying with that team for a long time.

Starting point is 00:17:59 Actually, they hired me on and I worked there for, what, four years? And like I started in that area, I moved a little bit more over to developing the tool. So that same team, that team was kind of this cross-functional thing where it was part of the Mac hardware group, and it was definitely focused on understanding the performance of Mac hardware systems. They were looking at the new systems that were being developed to kind of inform and work with the vendors on how we might modify things to improve performance. But they were also working with third parties

Starting point is 00:18:33 on how to tune software for those new platforms. And so the third part of it was developing all the tools for all Apple developers to be able to do performance analysis work and come back with it. So I ended up moving more into the tool development side. That team notably released the Chud tools, the computer hardware understanding development tools, but it was actually a reference to the terrible 80s movie.

Starting point is 00:19:03 But the whole idea there was these are the tools that you can use to figure out how your application actually is running on the hardware. Beyond sort of like your basic statistical sampler of CPU time, this was, how do I actually dig in and find out what's happening at micro architectural level, what's happening throughout the system and raising that information in the way that you kind of expect from Apple. It's like, here's happening, you know, throughout the system and raising that information in the way that you kind of expect from Apple. It's like, here's your code and here's the tool tip that tells you exactly what's happening on this line. Right. That was kind of the thing. And so I ended up being tech lead of that team for a couple of years through kind of the later versions of that

Starting point is 00:19:40 software. And I worked on almost every product that Apple shipped from 2004-ish to 2009. So you were right in the middle of your time there was the PowerPC to x86 switchover, is that right? That is correct. Okay. Yeah, that happened like right as I was joining full-time, my last internship to full-time. I joined full-time in 2005. And yeah, we had development units that were hidden inside G5 cases and all sorts of things. It was kind of a big deal. Very big change for our team, certainly. You can imagine that when you're doing the performance

Starting point is 00:20:25 work at that level right digging into the chip details going from power pc to any other architecture is is a massive change in the tooling uh to understand you know the the conceptual model of how the processor works probably isn't that huge of a change but the exact details of how you control performance counters how you actually know the the subtle details of what's happening through execution units etc and took a lot of time to build up that knowledge again for the new platform absolutely did did y'all have kind of an abstraction layer uh over the power pc that allowed you to kind of like write a new back end if you will for it or was it mostly oh no not not at all. It was entirely written to support PowerPC and nothing else.

Starting point is 00:21:13 So, I mean, and keep in mind that with Motorola slash Freescale and IBM, we had a very close relationship to the extent that if you install the Chud tools, it actually ships full cycle accurate simulators for G4 and G5. Oh, wow. You know, it's for running instruction traces of small snippets, not like booting a whole system. But we had that level of relationship with them where that was just there. We shipped the entire assembler reference. So when you were looking at an instruction, you could just click show me the instruction and it would pull up the PDF to that exact page kind of thing. And working with Intel, you know, it took a while to build up

Starting point is 00:21:59 all of that because we had just never assumed that we would change to a different architecture. That's kind of the classical problem with changing architectures is you just build up all these subtle assumptions about how the system would work. And a lot of folks are more familiar with the like, oh, you're running on big onion and you're switching to little onion, you're going to have a lot of pain. There's also just a lot of pain for, I wrote my entire stack assuming that all my instructions were 32 bits. Oops. And then we turned around and did it again as the iPhone started development, where now I had to do ARM. I ended up writing an ARM disassembler over a weekend to support the iPhone program.

Starting point is 00:22:44 We were trying to to make that happen as fast as possible and she wanted it and at least we had done it once with x86 at that point but yeah going to three different architectures that we supported was the code base started to get quite ugly at that point right right did did y'all have uh more or less heads up with the iphone in terms of the transition coming it sounds like y'all have more or less heads up with the iPhone in terms of the transition coming? Because it sounds like y'all might have not had a ton of heads up on the PowerPC to x86 transition. Apple is, as you is well known, very tight-lipped, both externally and internally, and very siloed. And so the x86 development started as a small team somewhere else, not in the hardware group, kind of doing their exploratory work. And then eventually the hardware project started, and then eventually the performance team

Starting point is 00:23:29 came in. And the iPhone kind of happened the same way. It started with some folks from iPod kind of speccing out what the system would look like, etc. And by the time we got pulled in, they already had engineering units, not form factor, but prototype board units. They had a software stack that was booting. They had been doing a huge amount of work without us. And so we got brought in late, wherein it was actually, we're seeing problems that we need your tools to understand what's going wrong. And we said, well, now we have to go actually figure out

Starting point is 00:24:05 how to pull out performance information out of the system that you're running on. So give us a couple months. You don't have a couple months. iPhone was specifically to the extent where I had to sign a separate NDA just for iPhone. And I actually had two office locations because I had one in building five,

Starting point is 00:24:24 which is where the Mac hardware group was, which is my normal office. But then building two was where they were actually doing all of the iPhone development. And so I had to go up there to a different desk for any time I was working on iPhone related stuff. Interesting. Were there any, you know, performance bottlenecks or anything like that or any bugs that you all encountered i guess uh you know not just in the transitions but just in general and being the authors of this tooling where y'all had you know a vendor or a third party or someone like that come in and say hey like what's happening here your tool is telling me that

Starting point is 00:24:59 you know there's a performance bottleneck but i don't understand any particularly uh interesting ones i mean we worked with the adobe folks on photoshop a lot and so there were always you know subtle things that came up there but they also were quite well versed in the what could happen right like how their software worked and we're like we're working on very core loops i do remember a lot of game developers coming in um you know the mac is not particularly viewed as a game platform at least it wasn't especially back in the power pc days but i remember sitting down with the developer who was doing porting one of the rock band games and we're just like why is this burning three cores 100 and still struggling and we looked and

Starting point is 00:25:43 it was the game was written assuming the Xbox platform and that it was just going to have three cores. And so they had just written three busy loops that one on each core that ran, you know, it was like one was disk IO and one was network IO and one was everything else. And we're, this does not map well, right? Like it was entirely assuming it was going to run on a Windows NT based system and how critical sections worked and all that and was very terrible. But I also worked with, like on G5 was used to build a supercomputing cluster. So I ended up getting a lot of, you know, here's a application they're running, you don't get the whole application, but you get this tiny piece of it, like iterate on that. we used to do a lot of demos for

Starting point is 00:26:26 uh like we would work a open source program to show how you would go through the process for talks at different conferences and things and i remember one time we we looked at an open source program called celestica it's a star system simulation where you can travel to different planets and see the simulation of the orbital motions and things. And we had all these different options. We'd show how you could replace calls to sign with table estimations and basic stuff. And at the end, as we're getting down further and further into like, here's how you can reorder your instruction scheduling to be able to be tuned for G5 behavior. We're like, and now you're at 16,000 X performance.

Starting point is 00:27:13 Did y'all end up making any, any upstream contributions? Oh yeah. Yeah. We always did. That was the thing is we would do the work with the, to do this and we would send it all upstream. And often they,

Starting point is 00:27:24 they accepted it. You know, they were just send it all upstream. And often they accepted it. They were just like they were very gracious on it. We did that with a lot of projects, actually. Awesome. So two questions kind of wrapping up your time at Apple. One, from my research, it looks like the Chud tools are still used today by Apple developers. I'm not entirely sure. Like, I honestly haven't paid attention to the Apple platform for a long time.

Starting point is 00:27:50 When I left in 2009, there was a big push to fold the chud tools into instruments. And so in some ways, when you're using instruments and you're using time profile in there, it's actually using all the same infrastructure. It's just kind of through a different UI. I see. there, it's actually using all the same infrastructure. It's just kind of through a different UI. So, you know, people are using a lot of the pieces, but not necessarily through the front end. Absolutely. And so when you did decide to leave, I think you said 2009, was it mostly you wanted to go work on something else? Or was there, I know you ended up going to Google, was there something particular about Google that was alluring at that time? No, actually, I followed a friend who left from Apple and went over to Google.

Starting point is 00:28:32 And so I just had a good path there and kind of knew there were interesting things happening there. But I was looking for something different to go do. The architecture performance group had been located in the Mac hardware division and got split up and moved. The Chud tools ended up reporting to developer tools. And we just had a lot of culture clash. Developer tools from the CoreOS group was very much of the opinion that you were there to work on software and it was there to work with the operating system.

Starting point is 00:29:03 And you didn't actually care what the hardware was too much. I see. And of course, the entire thesis of Chud was, if you know what the hardware is, you can do a lot more. So we just kept running into issues of like, I need a kernel extension to do these things. You can't install a kernel extension. Sorry. Gotcha.

Starting point is 00:29:23 So you follow your friend to google um you know like naively in my mind i think you you also did some performance work at google but um you know you're kind of going from this focus on you know apple especially at that time uh i imagine was mostly you know a single machine that you're kind of analyzing and seeing how uh the the cpu or other components um in that system are working. And then Google is, you know, both large scale companies, but Google is very different in that you're working with large network systems where you're kind of, you know, analyzing perhaps at a data center layer.

Starting point is 00:29:56 Was that kind of like a, did you find that the skills translated quite a bit? Or was Google kind of like walking into a new environment for you? Some skills translated. But certainly, I mean, it's hard to answer that question, really. And anybody who's had the privilege of working at Google will relate and everybody who hasn't, it's just hard to explain. Not only is the scale of the systems that you're working on so vastly different, right? Like you said, it's not looking at one or 10 or a thousand machines anymore. It's looking at tens of thousands of machines, hundreds of thousands of machines. And so there's that angle, but also everything is managed in-house through bespoke systems and they're not intended to be shipped to customers.

Starting point is 00:30:46 Whereas with Apple, yeah, it was all a big proprietary stack, but there was definitely a focus on you have to ship things to customers. And so you can do some hacks, but you have to make it a little pretty to make it acceptable to the customer. With Google, as long as it worked, it didn't matter too much. You needed it to be reliable, but it didn't have to be pretty, necessarily. But it also meant that you didn't have to abide by any industry standards. You didn't have to, like, the design space was wide open, as long as you could justify why to do it that way. And so, you know, you put those two together and you get into odd decision matrices where you're like, we're doing it completely different than everybody else does, but that's because it saves us this amount per machine, which times the number of machines makes a huge difference.

Starting point is 00:31:35 We used to talk about like the team I was in wrote a lot of the software that runs on every single server in Google's data centers. So there's system daemons that manage the hardware and do telemetry and health monitoring and a variety of other things. But we at one point did a calculation where we figured out if anyone on our team saved one megabyte of RAM, we paid for the entire team's salary for the entire year. Oh, my gosh. No, I mean, it became difficult to actually save one megabyte of RAM because we had already done a lot of that optimization to fit in there. But that was like the stakes, right? The scale of things changes how you think about the problem a lot. Right. That's actually interesting. The episode that comes out tomorrow is with Matt

Starting point is 00:32:26 Godbolt, and he spent some time working at Google as well. And one of the things that we talk about on that episode is kind of like having that scale where, you know, like saving a meg of RAM is having that large of a cost implication provides a lot of justification for working on some interesting things that, you know, just wouldn't make sense economically at other companies. Did you have any examples? I mean, maybe that is one right there, right? Of like saving a bit of RAM.

Starting point is 00:32:54 But did you have any examples of things that you worked on or you saw folks working on where they got to kind of like go down that optimization rabbit hole because of the sheer scale of the systems y'all were working on? Oh, absolutely. It happened all the time. Things that come to mind. So you know how when you type in on Google search, it auto completes for you? Right. I had nothing to do with that feature. However, that feature was burning so much CPU time in one cluster that was depending upon a search indexing that was running in another cluster that they kept coming to me and saying, your software is saying that there are problems on the system and taking machines out, which

Starting point is 00:33:38 is causing our service to be unreliable. We can't actually launch our service publicly. And that led down a digging through so many layers of the system and ultimately figuring out that there was a bug in the CPU scheduler in the kernel where I was not being given enough CPU time to actually do the work because the system was so heavily overloaded. I've never seen load averages in the 500s ever again. Right, right. overloaded like i've never seen load averages in the 500s ever again right right um so there were things like that but also things like uh you know how the pci express has some a lot of pins on it and some of those pins are not commonly used right there's like j tag and there's other things and there's actually ones that are reserved for future use. Well, we justified why we should dedicate a pair of those to be a USB pair so that we can have USB out to our expansion cards to talk to different things.

Starting point is 00:34:38 I worked on a system where we used SRIOV. I don't know how familiar folks are, but with PCI Express, you can do IO virtualization where you make a single PCI device look like multiple devices. So you can share them with virtual machine, or guest virtual machines or things. And we have used that feature to emulate MRIOV, which is where you have multiple computers attached to a common PCI Express fabric.

Starting point is 00:35:11 And then we use PCI Express switches. So I figured out and worked with a team where we ended up booting eight machines off of a single network card that they all shared. You know, it's just like, not a problem that anybody else is ever going to look at, but it was a way of looking at how would you deal with bandwidth problems and the scale of how much cabling and deployment of the system, etc. So yeah, lots of edge cases. I mean, it just probably goes on and on and on. And some of that work ended up coming out through Open Compute Project, you know, eventually Google joined and some of the design work that came out, even OpenBMC is a case where, funny story on that one, we were working with Rackspace on Barrel IG2 is what they called it, Google called it Zaius. It was a Power 8 system, IBM Power 8. And because it was a power 8 system IBM power 8 and

Starting point is 00:36:05 because it was going to be public through Open Compute Project and because Rackspace was going to use it they wanted to have a common BMC or Baseband Management Controller that you would normally find on a server because that's how Rackspace's infrastructure is designed but Google's is not

Starting point is 00:36:22 and we were trying to figure out how we were going to support it internally. And my boss was in charge of talking with AMI about their software stack, getting a license for MegaRack to run on this. And I made a friendly wager with him one day that after he had been trying to work with them on getting just a price quote, I said, I bet I can actually get Linux to boot on the BMC faster than you can get a quote from them. And I did it in two days. And it took like two and a half weeks for him to get a sales quote. So by the time he got a quote, we actually had a fully booting Linux stack. And that's kind of how we ended up working with the OpenBMC folks. And I ended up working in that space of getting a bunch of the big industry players at the time, Facebook, Google, Microsoft, IBM,

Starting point is 00:37:16 all to come together and actually form that as a proper project under Linux Foundation to, you know, here's how you actually build an open source management stack for these systems. But that was also driven from a need for, you know, solving some of these problems at scale and starting to work with other players. But you see a lot of these things. We built, you know, 48 volt to point of load voltage regulation. I think no one else does this. But, but you know that's going from like 48 volts to directly to your cpu core in one stage of conversion but the efficiency is makes a lot of sense when you're at our at the scale of google right so so you talked about uh bmcs and we're definitely going to get back to that in the future but uh can you talk a little bit about just like the general architecture of uh of google servers and what role a BMC is playing, I guess, in Google servers, but also more generally? Yeah, the concept in like a server is that the main CPU is usually doing some work that's owned by some team. And they have applications that they're dealing with and whatever. But often,

Starting point is 00:38:27 it's a separate team or organization that's responsible for the hardware management. So it's like keeping the machine physically turned on and running and reporting the health information about hard drives and fan failures and that kind of stuff goes to your IT or data center operations folks, whereas application type failures go to the team who actually is using the machine. And so a long time ago in a faraway plant, Intel worked with a couple of their companies to create IPMI, which specifies how a BMC exists on a system. And it really starts as like doing environmental control and power monitoring and having a way of querying that type of information from the host processor. So it's about offloading all of that.

Starting point is 00:39:13 And one of the key aspects is that you can access that information even if the host machine is completely locked up or turned off. So you can also do things like reset the system. You can power it off. You can power it back on. And over time, that feature set has grown more and more do things like reset the system. You can power it off, you can power it back on. And over time, that feature set has grown more and more and more to the point where

Starting point is 00:39:35 you have an integrated KVM support. So you can actually just go to a website that's run by the BMC on that server board. It's actually an entirely separate computer on the same motherboard as the main system. But you can go to a website that's hosted by that, get a UI that loads and gives you a display of what the actual console output looks like from the, you know, if you hooked up a monitor on the VGA port, you see the exact same thing. And you have keyboard and mouse control. And you can select like an ISO image, you know, like a CD image and have it be mounted as a virtual CD-ROM drive on the remote system.

Starting point is 00:40:10 So you can do like complete from scratch OS installs remotely. And so there's a lot of useful features for that. I mean, it falls under a lot of larger category folks talk about as lights out management. You know, the idea that the lights are turned off in the data center. Nobody has to be there. This is how you know how you manage and interact with the systems without having to physically be there. And also knowing when you actually do have to go physically touch them, you know, oh,

Starting point is 00:40:35 a fan failed. I got to go replace that. So is the whole interface like the HTTP API that it sounds like you're explaining there, is that part of IPMI or is that just kind of vendor-specific implementation? So IPMI is old. IPMI came out of the late 90s and there was a couple revisions of it. And it is its own bespoke protocol. It's something that you talk over TCPcp and um have very specific data frames and stuff and so that's when you run like ipmi tool on a linux machine or something like that you're you're speaking these this custom protocol to that device so the web uis became

Starting point is 00:41:18 as a way to like introduce a more friendly way of doing it to the web browser once the web had really become a big thing and people realized that you could run a web server on one of these as a friendly thing. But yeah, it's more bespoke. There's been a change in the past 10 years of, there is a later standard called Redfish, which is kind of reimagining IPMI as a R api so it looks more like you'd expect for a web interface as an api um but it it kind of has the same data model as ipmi it's like you ask the system how many sensors do i have and then it comes back and it's like oh well which ones of those are temperature sensors you know like right is the uh are ipmi or redfish or any any alternatives if there are any are they ever used outside the context of data centers um or is this you know mostly a pretty focused uh protocol

Starting point is 00:42:14 um i mean it's it's mostly used for dedicated server hardware um whether that's using a data center or not kind of depends i mean i have a the machine I'm on right now happens to have a BMC and can actually do all this stuff. Do I use it in my house on a regular basis? No, not really. But yeah, it tends to be focused on that out-of-band management, lights-out management type situation. And so that mostly finds its way in server gear. Sometimes you find it in like industrial control equipment that's based on pc stuff you know like there's various applications for it but it doesn't

Starting point is 00:42:52 show up in your usually doesn't show up in your average pc although there was a There is a parallel in laptops. Intel has what they call AMT, which is not a BMC, and it doesn't really speak IPMI, but it offers most of the same functionality. So you can do remote management of laptops. And again, it's intended for an IT department to be able to fix your laptop for you from wherever you happen to be. Gotcha. One of the things that kind of stuck out to me in your description of BM from wherever you happen to be. Gotcha. One of the things that kind of stuck out to me in your description of BMCs was you led with the like organizational component of it.

Starting point is 00:43:32 And obviously there are some technical benefits to the isolation. And like you said, you know, the lights off management and that sort of thing. But I think that is always an interesting way to, you know, approach any system design as, oh yeah, there's a different team that does that and is responsible for that. So we literally have, you know, approach any system design as, oh, yeah, there's a different team that does that and is responsible for that. So we literally have, you know, a different computer for them. I haven't heard it explained that way. So but it makes a lot of sense. I mean, it's a long term thing where I don't know that that was the original intent. Right, right. However, a lot of solutions end up growing according to how organizational

Starting point is 00:44:05 lines fall. And so that's how a lot of the feature set has developed for BMCs over time, which has interesting implications because the security model did not evolve the same way. Right. Well, that was a little bit of foreshadowing, perhaps. But while we're still at Google, you spent time working on these servers and the data centers and that sort of thing. I think I also read that you spent some time working on flight control software. Is that right? Yeah. After spending five or six years working on servers and hard drives. I was the manager for the hard

Starting point is 00:44:46 drive team and, you know, had done a lot in that space. I wanted to try something different. And Google being Google, it has a lot of different areas that are going on. And it just happened that around that time was just before the whole shift to being Alphabet and having the other bets, but X had existed for a while, and there were some of these more grandiose, very different ideas, not the typical Google development things. One of those was building a satellite internet project, so a LEO constellation of satellites that provided internet access. If this sounds a lot like Starlink, that's because it is. This is actually a predecessor project. And through a series of unfortunate events,

Starting point is 00:45:34 the team forked and part of them went to create Starlink. So common lineage there. But yeah, I was working on flight control systems and a lot of the high level architecture for how the system would work. You know, when you're designing a satellite, you kind of have the part of the bus, they call the satellite the bus, the part of the bus that's responsible for keeping the satellite where it needs to be, and doing all of the flight related things. And then you have the payload, and they're supposed to be entirely separate systems like they should have no

Starting point is 00:46:08 interaction really and so i was mostly focused on flying the bus um with but a lot of that was just figuring out what did it need to do like what were the sequencing that needs to happen what sort of orbital maneuvers would we need to do i learned way more about orbital mechanics than i ever had intended to ever learn um but uh i was only with that project for geez maybe a year or or so before that project um got turned down um and that's around the time that starlink got started up and uh but one of the pieces that we had built or started to build out of this was, even though I was mostly focused on the flight control side officially, I was working with a lot of folks that I knew from the server teams and the networking teams on, there's an interesting data routing problem.

Starting point is 00:47:00 How do you actually build a network that goes through multiple satellites? Because you not only have the classic networking problems like in Wi-Fi of what's my quality of my link? You now have a problem of everybody's moving. And so the potential set of links that you could form is constantly changing. Right. So I was working with them through that problem because it was just kind of fascinating. And we ended up building that system. And we ended up calling it the first spatial temporal software defined network. And it literally is, it's a cluster system that just kind of evaluates where everything is in space and evaluates all the potential RF links based upon

Starting point is 00:47:42 the radios and pointing capabilities that each of them have and comes back with solution sets so you can predict in the future where it's going to go. And that ended up following through a couple of different follow-on systems of repurposing it because we built the system to track anything. We didn't care how it was moving. So it was like, oh, we're not on satellites anymore?

Starting point is 00:48:00 Well, guess what? It works on aircraft. Oh, you're not on aircraft anymore? Well, the Loon folks with their balloons, they could use that too. You want to do between balloons and aircraft? Sure, guess what? It works on aircraft. Oh, you're not on aircraft anymore? Well, the Loon folks with their balloons, they could use that too. You want to do between balloons and aircraft? Sure, why not? So that project actually continued on and eventually got spun out. And some friends of mine still run that as a company. I can't remember the name. But yeah, so that's actually a product that they're trying to sell now on the market. Yeah.

Starting point is 00:48:33 So obviously, Starlink has gone on to be successful by a lot of measures presently. What was kind of some of the blockers at that point in time? Were there any technical blockers or was it mostly organizational in terms of that project's fate within Google? I mean, yeah, there were a lot of challenges. Not only, like, from a technological standpoint, there was a lot of how do you actually build Leo to ground radio comms that tolerate the, you know, basically how do you build the ground station or like the user access

Starting point is 00:49:05 terminals, not the gateways. But there were a lot of challenges with just the technology for building the antennas and stuff to hit a price point. You know, like it would be trivial if I could have you spend $5,000 to buy a steerable antenna, but that's not practical for that type of system. So the idea of getting to that, you know, panel that's not practical for that type of system. So the idea of getting to that panel that's actually a phased array of all sorts of antenna elements was a lot of technical development that was going on. And Starlink has definitely moved in that direction and shown to be successful in that area. There's a lot of other aspects in terms of the software backend of not only the data

Starting point is 00:49:44 routing, but the flight planning. We had a lot of discussions around how do you actually keep in touch with all of the other organizations that you need to about potential collisions in space? There are processes established for this, but they were never designed for the scale of the number of satellites in space. And so they don't have the ability to do that conjunction analysis that fast. So there were problems associated with that. Yeah, lots of different things. But from a technical standpoint, a lot of it was, how do you just build the radios? How do you build the power systems?

Starting point is 00:50:22 How long is the satellite even going to last before you have to intentionally de-orbit it? And a lot of those turn directly back into business problems. If I have to replace a satellite every month, that's expensive. Right. Yeah. So you mentioned also kind of like collaborating with the server and networking folks during that time. I'm not very familiar with the architecture of satellites and especially internet satellites, if you will. Is it essentially like a flying server? It kind of sounds like I couldn't help but draw analogies

Starting point is 00:50:57 between what you were talking about with the flight control system and the BMC that we were just talking about, about doing similar operational tasks for the server. And it's not a bad parallel. The main difference is that, you know, a flight computer has to be radiation tolerant. It has to not fail effectively. Otherwise, your satellite's completely useless. And that's the worst case, right? Like, the absolute worst case for a satellite is you lose control of it and it's just there and you can't intentionally de-orbit it like if the payload fails i can intentionally de-orbit it and get it out of space out of space right i can get it to burn up in the atmosphere and that's a much better result than having it just be there as inoperable.

Starting point is 00:51:53 So the flight computers and everything related to flying the bus tended to be specced for much higher reliability and a much longer lifespan so that you were pretty confident that you'd be able to fly the thing even if you lost the ability to use the payload to actually run the data comms. I see. And I can't help but uh but think about as well in this you know you mentioned the episode with uh nathaniel earlier um and we talked about their uh ct scanners and naturally what came to mind to me was how are you updating these systems uh partly because i work for uh an iot company uh during the day so you know we're always thinking about OTA updates and

Starting point is 00:52:26 that sort of thing. What was the software update process like for these satellites? We didn't actually get that far in terms of the development. I mean, it was like, try to make the system work first and figure out what all components you were going to have and then figure it out. But it was definitely a very key thing on mind of a firmware update that kills the machine is not okay right like you have to be safe no matter what happens um and so there there were a lot of thoughts towards how do i actually get the system to come up in a mode where i can use one of the radios to basically be a serial port to the console port and be able to upload things over x modem if i have to kind of stuff gotcha gotcha so uh after working on that you already mentioned the the open compute project and your involvement there and maybe somewhat uh kind of

Starting point is 00:53:17 like tangential to that i know you started getting involved in open source FPGA tooling. Obviously, there's quite a bit of FPGA usage in data centers for things that are kind of like similar to a BMC or where they're doing some sort of operational task. Once again, Nathaniel and I talked about that on the Oxide sled. But, you know, how did you get interested in FPGAs and then especially get interested in the development tooling? I mean, from my perspective, it's hard not to get interested in it when you suffer through using it. But what was your path to kind of getting exposed to that? I had used FPGAs a little bit in college. And frankly, with the way Google did their server designs, there wasn't much involvement. The network group tended to do a lot more with FPGAs,

Starting point is 00:54:17 and I just didn't have a lot of exposure to it there. But as we got into working on BMCs and things, one of the things you find quickly in that space is there are two companies that make BMC processors. There's A-Speed and there's Nuvaton, and that's it. And part of that just comes with, it's a very strange set of hardware capabilities that you're looking for. It's not about the software. Why can't you use an STMicro for it? Well, I need like 15 I2C controllers. And I need six PWM-controlled fans with tachometers. And I also need custom protocols that the processor vendors make up. And that's just not a thing that you find in general.

Starting point is 00:55:03 And so there was always this thing of, well, the A-speed parts are not great. They're not well designed. And the new Vatan ones are expensive. So what if we made our own? And, of course, the answer is that's ridiculous. Nobody wants to make their own chips. You can tell that was a certain time period at Google. Right, right. their own chips um you can tell that was a certain time period at google right right and there's an announcement today about uh some new arm arm server processors at google so

Starting point is 00:55:32 right so that's changed over time but at the time i i you know had discussions with like bart sano and bart was just like i've done chip design before we're not doing chips um and it was just like, I've done chip design before. We're not doing chips. And it was totally reasonable. So there was always somebody that was like, well, what if we used FPGAs for this? And so somehow through the open compute work and through those kinds of discussions where we kept wanting to use FPGAs to replace these BMCs, I got connected with Tim Ansell.

Starting point is 00:56:01 And Tim was trying to get together a pitch to the internal startup incubator at Google to develop open source tooling for FPGAs. And I thought, you know what, this sounds really cool. Like, I understand where this problem is. And the FPGA tooling is absolutely horrific. So sure, why not, right? Like, this is an area where i can go in and they'll give us a year to work on this and and see how where we get and that was really the introduction to it was just a i kind of understood where the problem was um and figured i could help and they were happy to have me come join there were it was only three of us really working on it. And a lot of it was actually not so much building the tooling as it was figuring out how to reverse engineer the chips themselves, which is a very different skill set. Um, right. And I think you

Starting point is 00:56:57 worked on, uh, project X-ray the, the Xilinx bit stream reverse engineering? Yeah. So, I mean, Project X-Ray has an interesting history. Claire Wolf, who did the Project Ice Storm for the Ice 40s, actually had started Project X-Ray much earlier and figured out that these are big, complicated devices and it was going to be really hard to do. And the Ice 40 was an easier target. so this was actually picking up that work and and going with it and this was really the the reason for having the the project be done through the startup incubator was to give us time to actually go figure out how to do this and so it was three of us working on project x-ray which was you can think of it as coming up with arbitrary like designs that probably didn't make any sense um you know feeding in verilog or or specific device like placement things and then setting

Starting point is 00:57:56 really strict constraints to force very simple circuits to be placed in very specific locations on the chip so that we could then take the resulting bit streams and compare them against various permutations and see what bits changed. And from that, infer what behavior each bit actually was doing. So you'd have to settle out, was this something where this section had always changed? that's actually the ecc over this section of the configuration rom or you know the configuration data or is this actually the bit that tells me uh use this input to this lut kind of thing and so a lot of it was writing tools that did that analysis and um i wrote a lot of tooling for actually picking apart the the multiple layers of encoding that they they do for the bitstream that actually gets stored in Flash to get to the actual configuration data.

Starting point is 00:58:51 And yeah, I mean, we got pretty far, but it turns out Xilinx 7-series FPGAs have a lot of different tile types, and there's a lot of bits in them. And they're still working on that project, actually, many years later. So it's, it's a tall order, but it was a lot of fun. Absolutely. So did you also get involved in any way and kind of like some of these modern HDLs? Because obviously, you know, you already mentioned Verilog. There's System Verilog, VHDL, but it seems like also on Hacker News every week, there's a new kind of like HDL that someone has done as part of their PhD thesis or something like that. Did you have any involvement in those areas? Because I know that SimbaFlow was right. Was that the name of the whole project? Yeah, that was a top level name for doing Project X-Ray and trying to use available open source tooling to actually build out a tool chain around it. So it's do the reverse engineering work and then be able to use that to build out a tool chain. And there's there's just this small community that actually works on this kind of stuff.

Starting point is 01:00:05 And once you get introduced to that community, you also find that that's where all of these modern HDL folks come from. So that's where I met folks like White Quark, who does the Amaranth. introduced through that path of, hey, I'm working on Project X-Ray and getting into IRC chat rooms and talking about the tool chains and what's going on. And so I know some of the folks that do some of them. And I keep track of it because ultimately, when you approach it as like a compiler problem or a software engineering problem, Verilog's a terrible language. And system Verilog is the C++ of HDLs. And they're useful, but they often, they give you a lot of foot guns. And the modern HDLs, some of them do better at some areas than others. And, you know,

Starting point is 01:01:03 they all have their different trade-offs. And I think it's really just a renaissance of trying to come up with an HDL that better represents the problem at hand. But there's kind of this fundamental challenge that happens there, too, of because the tooling is kind of nascent in the open source space, like Yosys works, but has its limitations, and XPnR has its limitations, then everybody falls on a, well, if my modern HDL generates Verilog, then I can feed it into the vendor tool chain. And it's kind of like the JavaScript situation. You know, you have TypeScript, but TypeScript really just compiles down to JavaScript anyway. Right.

Starting point is 01:01:43 So you worked on that for a year, and then that was kind of when you started to wrap up your your time at google right yeah i mean the the way that the startup incubator worked it was you you do your work for a year and then you come in for a review and then you do a review every six months to get renewed or not um and just after doing that first year, we did a lot of work, but I was having difficulty seeing how the project would succeed with what the goals were being outlined and where to go with it from a big business perspective. And having spent nine years at Google at that point, like I had watched the company grow in a lot of different ways and kind of change culturally and just decided it was kind of time for me to move on to something

Starting point is 01:02:25 smaller. As you may have gathered, I at this point had worked at two extremely large companies and just had an opportunity finally where I could actually go work at a startup. Like I'm at a point in my career where that works okay. And so I started just shopping around for what was out there. And we didn't really touch on it, was what was out there. And, you know, we didn't really touch on it for but while I was working on server designs and things, I got heavily involved into hardware and firmware security aspects, and more into PC security in general. And so I ended up, you know, finding some startups that were working in that space and moving over that way.

Starting point is 01:03:08 What kind of drew you into the security realm? Was it just like once you start understanding a system really deeply, you kind of have this natural kind of thinking about, you know, I wonder how this fails or I wonder where the holes are in this. Was that kind of the impetus for you? I always had the aspect of looking at it from, you know, how are things going to fail? Because that's what health monitoring and like, you know, automated diagnostics is. And, but there's a slightly different angle when you start looking at how operating as a public cloud vendor works. And that was really a, like, there were two big shifts that happened while i was at google one was the aurora attacks um which is a you know well documented like campaign against many different companies by um threat actors that are believed to be associated with the nation um and that affected like they were targeting my team it It was one of their targets, right? So they were trying to figure out how you could do persistence by getting access to firmware and,

Starting point is 01:04:15 you know, building backdoors and things into firmware. And so that, that just kind of spooked things in a lot of ways and really caused a lot of rapid changes to how we developed software and firmware and how we deployed things to production, et cetera. So I got a lot of ways and really caused a lot of rapid changes to how we developed software and firmware and how we deployed things to production, et cetera. So I got a lot of front experience too. We were pretty free and open how you do all this and then very closing it down. But then as Google became a public cloud provider, there was another thing of, well, now I'm running untrusted code. And so how do I actually provide isolation? How do I do all these different things? And I was not actually working on that, those set of problems, but I was working on the hardware they were running on. And so we'd get these questions of, you know, how do you bootstrap from nothing? And how do you know

Starting point is 01:04:56 that it's trusted? And it mostly tied into, you know, manufacturing operations. How do I know that when I receive a server from the manufacturer at the data center door that it wasn't tampered with at some point? And we knew that we were a big enough target that we had to pay attention to all of these cases that a lot of folks just don't, right? Like when you buy a laptop, you probably assume that you're not interesting enough for a three-letter agency to intercept it and do something to it before it arrives at your door. But when you're Googled it, you don't get to do that right you have to think through some of those cases and how would you actually find it um so that's a lot of where my introduction came into it uh and working on some of the things like i worked on how uh titan was used as the root of

Starting point is 01:05:41 trust and measurement in the systems and the server designs. And I co-wrote one of the white papers on how that system works. And so I had a basis there from that perspective. But then I ended up joining a startup that was much more focused on application security. Like, how do you defend against people sending malicious attacks to your application running on a server? Which was just a very different field for me, but it was also at a startup where I was working with like 10 people. So it was trying something different. Right.

Starting point is 01:06:16 So talking a little bit about the Titan system that you were mentioning there, can you talk a little bit about how that system works and maybe some of the strategies you'll have at the hardware slash firmware level for mitigating some of these attacks? Yeah. I mean, a lot of folks often think about, you know, UEFI Secure Boot or Intel Boot Guard as this way of defending sort of what we ended up calling the first instruction integrity. When I press the power button, how we ended up calling the first instruction integrity. When I press the power button, how do I know that the first instruction

Starting point is 01:06:49 that the CPU executes is actually trustworthy? Because once that instruction executes, I have no control over... Like, if I can't trust that first instruction, I can't trust the second instruction either. Right. And it's a very difficult problem, actually, because the PC architecture has grown over time, and this was never a concern in the early days.

Starting point is 01:07:08 So all of your firmware is stored on a flash chip that is outside of the chipset. It's on the motherboard, for sure, but if you have to consider somebody coming through and having physical access to a machine, they could change the contents of that. Or if you have an attacker who managed to gain access where they could write to that, then how would you know? How do you detect tampering? And frankly, a lot of the question is not so much about defending. It's not prevention, usually. That's certainly a goal, but the assumption is that someone's going to get through. And so the question becomes, how do you detect and how do you mitigate or remediate? And so that's a lot of what the Titan design was about. It was, how do I do opportunistic prevention of someone writing to the firmware?

Starting point is 01:08:00 But also, if someone managed to get something in, how would i actually be able to power down the system power it back up and know that i had gotten full control of the system back and that i didn't have some tampering in you know the system firmware somewhere so in that design uh it's actually a an interposer on spy so essentially it it sits between the host processor and the spy flash and watches and actually, it does more than watch. It actually intercepts all of the traffic, all the read and write requests over the spy interface for the contents of your firmware. It makes a decision about whether or not it should actually pass it to the flash and whether or not it should get the contents back to the host system. So it essentially gets to make decisions about, are you allowed to write to flash? And it also does cryptographic verification. So when you write to the device, it actually is

Starting point is 01:08:57 building up a hash of the firmware and then validating that that matches a signature before it will ever allow the right to actually complete. And so by acting as this intermediary and also having control over the system reset line, when you power on the system, it actually does a hash over the contents of the flash, and then it verifies the signature chain, and only then does it actually let the system boot. So you get to a level where attacking the system requires much more physical access. It's not as simple as, oh, I get root access to the machine. Now I can just write to the flash. Nope, it's not going to let you.

Starting point is 01:09:35 And there's nothing you can really do about it other than physically get hands on it. And even if you do, even if you physically change the contents of the spy flash, it's still going to notice that you did that. Right. even if you physically change the contents of the spy flash it's still going to notice that you did that right so you'd have to actually physically manipulate the the titan component right to actually make any headway there right and of course there's a lot of technical limitations about this spy is very fast um i mean it's usually only like 33 or 66 megahertz which doesn't seem particularly fast but the way the protocol works, when there is a read command, you have two cycles before you actually have to send data. So you have very limited time to make any decision or do anything. So there's a lot of small things there of getting that system to work. And it's really a bolt-on solution. You know, it's not great. And this is why you see other systems like TPMs

Starting point is 01:10:28 are another system that Google doesn't actually, didn't use up to that point. I don't know if they do currently or not, but it's another bolt-on system, right? It's another way of adding something to the PC infrastructure without radically changing how the system works to kind of add additional security

Starting point is 01:10:46 properties to the system. And you see that kind of trend over time that it's how do I take the PC and add something to it to give me a little bit more guarantee about how it works. Gotcha. So as you moved into some of these security focused startups, I think you had kind of more of like a researcher role, perhaps, than you previously had been more of an engineer. Was there a practical distinction between those two? And what were the differences, if so? There was. I mean, certainly, as I moved into working at Eclipsium, which is much more focused on firmware, there was a need for someone who actually understood the system infrastructure and could do research work

Starting point is 01:11:33 of how could I break the system, right? Like, how do I get ahead of attackers and the things that they could find, what vulnerabilities they could find? But there was also a need for somebody who understood that and could translate it into how you would build defenses for it. So you'll hear security folks talk about red team, blue team, and now there's purple teaming because why not have both? And red teaming is figuring out the attacks, right? It's offensive trying to figure out how to break systems. And blue team is how you defend.

Starting point is 01:12:07 And so they needed someone who spanned those two teams. So depending on the day, it was sitting down and writing in their application how to detect whether a system was configured in a way that would allow for certain vulnerabilities. Like looking at chipset registers, looking at flash configuration, et cetera, and reporting on that information so that you can make decisions about the risk of your fleet of machines. But there were other days where it was, hey, I've got this machine. Let's go see what we can break on it. Right. Like that's. Right.

Starting point is 01:12:38 Yeah. Was it mostly vendors or companies that were running data center scale compute that would hire Eclipsium to come in and kind of like do this analysis? Or what's the business model for that type of business? I mean, the business model for the product is more selling to companies that have large deployments of machines as kind of a monitoring system, right? Think of it as like your antivirus deployments, right? Except it's something that's scanning your machines, looking to see, do you have systems that have out-of-date firmware that have known vulnerabilities? That kind of information. But the research side was more keeping the interest in that a lot of folks just assume that their attacks are going to be more application-focused or operating system-focused. And they just don't even need to think about hardware. I bought my equipment from HP. It's HP's problem.

Starting point is 01:13:39 And so a lot of it was, how do we keep finding new things to point out that the vendors aren't actually doing this work? They're building the machines, but they're not actually thinking through the security problems. And occasionally we would have a vendor who said, you know, we did something cool. We want you to actually look at it and see if you can break it. But that was very rare. Most of the time it was just, hey, we bought this thing off eBay. Let's go see what we can find. And you'd be amazed at the stuff that was found. I mean, sometimes you just buy something off eBay

Starting point is 01:14:05 because it looks strange and it turns out it's already, it was a vehicle mounted unit that, or a computer that has a cellular modem and still is registered with FedEx's Active Directory domain. You know, it's just too easy. And other times it was,

Starting point is 01:14:23 we bought the server and we started poking around and we were just like, it looked like maybe there was something that you could do over here and we'll just keep poking at it. And then you find, oh, I can do a authentication bypass on the BMC and better and use that to raise awareness so that people thought about buying the solution of having a way to know what vulnerabilities were in there and how to do remediation. Gotcha. That makes a lot of sense. So kind of like talking to BMC vulnerabilities, one large thing you were a part of or kind of discovered while at Eclipsium was this USB Anywhere vulnerability. And you've given some talks on it, and we can certainly link those in the show notes as well as the report itself. But do you kind of want to run through what USB Anywhere is and also how you went about discovering it yeah usb anywhere is a vulnerability related to you bmc virtual media so we kind of mentioned

Starting point is 01:15:33 this earlier you know bmcs offer this lights out management capability and one of those things is i don't want to have to walk into the data center to stick a CD into a machine to reinstall the operating system or to update the device drivers or whatever. So instead, it has the capability for from your web browser, mounting a CD image as a virtual CD-ROM drive on the server somewhere else. And I was always just curious how that actually worked, is really how it started. I knew from my work on OpenBMC that the hardware level was a dedicated piece of hardware that emulates a USB device. So you see this in mobile phones. This is kind of how USB2Go works,

Starting point is 01:16:27 where if you plug in a USB, or when you plug your phone into your computer and it's like, here, you want to download your photos, the phone chip actually has the same kind of hardware where it can emulate any USB device. It's a USB endpoint that doesn't have a fixed device to it. It's not one specific thing. So you write software to implement what it does, how it responds to requests.

Starting point is 01:16:57 So I knew that that existed in the BMCs and that the firmware was actually deciding what kind of devices it was. That's how your keyboard emulation works how your mouse emulation works how your you know a lot of different things but the virtual media i was like how is this getting from my browser all the way there because it doesn't seem like it was transferring the entire uh file over before starting it um and so it turns out that it's a horribly insecure protocol. What actually is happening is that in older one, older BMCs, they would ship a Java application to you. On newer ones, they do it via HTML5, which is kind of even worse in some ways but essentially like in the html5 version there is a javascript library running that in your browser that is an entire scuzzy stack and it is actually and an iso file parser

Starting point is 01:18:00 and so it's actually javascript in your browser, opens the file in your local machine and presents that as a block device to a virtual SCSI device that knows how to answer SCSI requests like it is a virtual CD-ROM drive. And that is connected over a web socket to the BMC. And then the BMC is effectively just forwarding the request back and forth. It's actually speaking raw USB requests over that. It happens to be sending USB mass storage, but it's actually just sending raw USB packets. So this always comes as my example for when folks are like, how bad can it be? I've seen the worst thing in programming. I'm like, have you ever seen a SCSI implementation in JavaScript? And so the Java version is very similar. It happens to be written in Java with a JNI extension, but it's

Starting point is 01:18:57 the same basic problem. And it turns out that, you know, they had just built their own thing. The vendor had a long time ago and had not really updated it. And the way the ecosystem works around BMCs, there's a couple of key companies that make the hardware, Nuvaton and A-Speed, and then there's a couple of key companies that make the operating systems for it. So like AMI and Vertiv and a couple of others. And then the companies that actually manufacture your motherboards just license both of those. They buy the chips and they license the OS, and they do a little bit of customization to it.

Starting point is 01:19:35 And then that is sold to whoever puts the name on the box, and then that's what gets sent to you. And so fixing bugs is a complicated process. And so oftentimes the same bugs show up again or they just never fix it because it's hard to get the communication across all these teams and companies. So in this case, it was doing things like

Starting point is 01:19:53 it was unauthenticated. If you did authenticate, it was trying to use encryption, but it was a very, very, very old encryption that was trivial to break. It had hard-coded passwords in it. It had just all sorts of things. And so my proof of concept for this initially was using a framework called FaceDancer that lets you develop virtual USB devices in Python. And I wrote a backend that

Starting point is 01:20:21 connected it so it would actually connect to the virtual media or virtual media service on the BMC. And the very first time I got it working, FaceDancer's default is to emulate a TI calculator. And so I like logged into the server and I do LS USB and it says that there's a TI 83 plus connected. I'm like, I don't think that's right. Right. But yeah, I ended up doing a demo where I actually plugged in a virtual USB stick across the internet over to a server many states away. And it was actually just a file on my drive locally. And this is kind of terrifying because this was unauthenticated access.

Starting point is 01:21:00 So you could literally, if you could find a way to talk to one of these bmcs you could plug in any usb device you wanted which seems like not a big deal until you think about it a little bit more that's actually terrifying right wow so what was the uh the reception like when you put this out was there um obviously we've had recently some vulnerability discoveries that caused some hysteria, I would say. Was there a lot of feedback when you put this out there? You know, on one hand, there was a lot of folks that showed up that said,

Starting point is 01:21:39 oh, you figured this out too, right? Like this was just sort of an open secret that BMCs... I actually had someone write to me and say that finding vulnerabilities in BMCs was unsportsmanlike. So, I mean, there was that sort of reception and then on the other hand there were a lot of folks who just had no idea that this functionality existed in the first place and then to see that you could abuse it in this way was absolutely terrifying so I did end up getting some national press and things for it but it really didn't have a huge...

Starting point is 01:22:25 I mean, I did some talks on it, but it didn't cause the fervor that some of the more recent vulnerabilities that show up on literally everyone. And part of that was just that most people try not to put their BMCs on the internet. It turned out that as part of this, I have a friend who happens to run an internet exchange, and so he let me have a VM, and I ran a scan

Starting point is 01:22:44 of the entire internet for affected bmcs and came back and so you know part of the the story was actually that i found you know 30 000 plus bmcs that i could just arbitrarily plug usb devices into right because i mean like obviously the the usb part is is bad but you have to have access to it. Right. So it being on the network is kind of the prereq. And I mean, how does that, how does that happen? A lot of people just don't know. I mean, they just, they're like, Oh, I got this cool feature from the vendor that says I can do remote management. Let me just plug that in so I can do it from home.

Starting point is 01:23:22 And not thinking about the vulnerabilities that it might have or anything else. And so there's just, how do you actually do the risk assessment of equipment that you're buying? And that's a perennial problem. You know, it's, there doesn't seem to be a good answer to that. Right, right. So kind of like bringing the security and your previous server work together from Eclipsium, you went to Oxide, which listeners of the show will be familiar with at this point after the last episode. What was kind of the decision like to join Oxide and what was the kind of like role of looking into security there. And I know you and I believe your colleague, Laura Abbott, also identified some not similar vulnerabilities, but vulnerabilities at similar layers of the stack

Starting point is 01:24:12 there. Yeah. What was the decision like to join Oxide and what kind of work did you do there? Yeah. So I had been at Eclipsium for a while and kind of had the viewpoint of it was fun finding vulnerabilities and reporting them. Getting vendors to fix them was hard. And actually building the detections for some of these things was also difficult. And it just wasn't a good fit for me anymore. Like I needed something a little bit more concrete. I like building tools. I like building systems and stuff like that. And so I had gotten in touch with Jesse Frizzelli about BMCs and that kind of stuff as some of the things that stuck out with me while I was still at Google

Starting point is 01:25:06 was folks coming up and saying, hey, will you just sell us your machines? And we don't want just the machine. We actually want a software stack that goes with it. Like, you already know how to run all this stuff. We just want to buy it and use it. And you would think this is small companies or something that we're asking this.

Starting point is 01:25:21 No, no, no. These were folks who were at the scale where they were buying hundreds to thousands of racks of machines from hp or dell on monthly right like they were at the scale where it didn't they were starting to question whether they should actually be designing their own machines or not and so they were looking for a solution and they were looking at open compute as how do i get cheaper, but I need a more complete system. And so this had been in front of mind and I had actually tried to work internally at Google for how could we improve this? How could we provide a more complete software and firmware stack for these machines so that people can actually buy them and use them. And it wasn't very receptive at Google. You know, Google's

Starting point is 01:26:10 position was basically we're designing machines for ourselves and we're sharing with the industry so that we don't have to keep doing the new development work all on our own and that it actually gets picked up by other companies. So when Brian and Jesse came around and they were talking about, hey, let's do this thing, I'm like, I know exactly how to do this. Like, I know that there's a market for this. I've been talking with folks for years about this. I know what a lot of the problems are, but I don't think you're going to get funding because I don't think you're going to be able to convince anyone, any VC that you're going to have to fund you for long enough to actually build the system that somebody would buy. Right. And they succeeded, right?

Starting point is 01:26:59 They came back with, they had got their seed round and it was enough. And it was on terms that was, this is going to take years to build and that's okay. So at that point I was like, all right, I'm in. I came in not really focused specifically on security. I certainly brought that background, but it was definitely just a, let's build this the right way, right? What would open compute look like if you were actually trying to build it as a sold product, as opposed to a repository of standards

Starting point is 01:27:30 and like prototypes for other people to draw their technology ideas from? And yeah, I mean, that's, I spent a lot of work, like I worked on all sorts of different pieces of the stack there because I wasn't particularly focused on anything, particularly focused on any one piece of the system. I definitely had a significant security role there over time, like looking at how to build the root of trust and measurement, how to do firmware signing, how to build out a system where at Power, you can trust the system in much the same way

Starting point is 01:28:05 that Google was designing their systems to do the same thing. But I also worked on fan control, and I worked on, you know, how do you do the lights out management features? And how do you actually power up the 100 gig network cards and all sorts of different things. Right. And so you'd had this kind of experience with VMCs made by two vendors that were not necessarily great. When you had the opportunity to, you know, do it the right way. What was the right way? Well, I don't know that there's any right way necessarily. But there was definitely if I don't have to play in the sandbox of building a pc compatible system if i'm not assuming that i'm just going to boot an off-the-shelf

Starting point is 01:28:50 operating system with no work then i can throw out a lot of things and if i can do that then i don't need these specialized chips i can build it out of commodity chips which by the way nuvatan and asp didn't really want to sell to me anyway because like we're a startup they don't want to sell to us right um so uh just looking at it of how do we how do we build this out in a way that meets the needs and it turned out like one of the things is that bmcs were starting to become seen as a way of building your root of trust of measurement so the repository of information about what was booted on the system um and that's kind of scary because the BMC also is a network accessible device and that's just sort of like feels wrong right i don't want my secure element to also be network accessible necessarily that's odd um so a lot of it was just trying to pick

Starting point is 01:29:57 apart what sort of functionality we actually need do i actually need a graphical console to the machine? And the answer is no, right? Like Oxide's thesis is you don't actually need to know any of the details of the machine. That's not what you're buying. You're buying a rack and the rack has the interface. The notion that the machine, each individual machine is booting an operating system and has management controllers is an internal detail that can be hidden from you. And not necessarily like we're trying to intentionally hide the information, just like you need the output of that information, not all the details of how it worked. And you don't need to write the software to deal with that problem. That's my job. And so being able to split that up into multiple microcontrollers and design the system to have all the features that we always wanted you know it's um there's a lot of we nathaniel

Starting point is 01:30:51 mentioned the ignition system where it's like you can actually control the machines even if they're powered off even if 90 of the board is dead like you can control the power the main power control to the to each tray and that was a in a decision that we made, you know, a design choice that we made of having that level of control of the system as a way of being able to recover from various scenarios. You know, oh, the machine's locked up and the service processor crashed. I can actually hit the reset even harder. Right, right. crash i can actually hit the reset even harder you know that's right right so you mentioned having uh kind of like multiple microcontrollers playing the role that the bmc traditionally would um the so you have the the service processor is that doing kind of like the the bulk of the bmc related

Starting point is 01:31:40 uh management and what other um microcontrollers or you know FPGAs and other chips are kind of involved in and kind of like providing a chorus of of components that are playing the role of the BMC so ignition is its own dedicated FPGA that just does this lowest level of being able to turn the power on and tell you if the initial power stage is functioning and so that was designed to be as simple as possible it should never fail that was the goal and so that becomes its own dedicated thing then the service processor yeah that's the bulk of the functionality it's doing the fan control it's monitoring the power rails. It's doing a lot of different things. Inventory of the system, et cetera. And then the other main piece of it is the actual root of trust.

Starting point is 01:32:33 So root of trust measurement is very similar in concept to what Titan actually was doing in the Google design. But a little bit more integrated because we could. And build on commodity hardware because we could. Which that took a lot of time. It took a lot of time to find a device that we were very keen on not having to sign NDAs to get the documentation for these parts. Like you can go to vendors and get details on secure, actual secure elements, but they are heavily restricted on just getting the documentation on it. Like even getting a brochure on which parts are available is a restricted document, where we were looking at parts that, yes, there are some aspects of them that are under an NDA,

Starting point is 01:33:22 but generally you just had to maybe create an account to download the bulk of the documentation. And the intent was that we were going to open source all the firmware anyway. So we wanted you to be able to have not only the firmware, but also the ability to actually go comprehend what it's doing. You should be able to read the data sheets. So the root trust got built off of another commodity part, and it happened to be one of NXP's LPC55 series parts for a variety of reasons.

Starting point is 01:33:52 I had a very long and detailed document around how we did the assessment of selecting which device based upon its security properties. And that was the whole point. This was a device to manage the root of trust and do key management. You know, it was very dedicated to that purpose. And, and, you know, compared to some of the other options, it likely was, you know, had some security capabilities that were preferable, but you also discovered that there were some issues. What was the process like for identifying? I think there's two vulnerabilities that y'all ended up discovering.

Starting point is 01:34:31 Is that right? Yeah. I mean, honestly, we weren't specifically looking for either one. We were looking through the chips as how they worked and the potential for vulnerabilities and everything from I sent it off to a friend of mine who does decapping and being able to look at it and see, does this have a security mesh on it? Does it, you know, just that high level type things. But there were definitely some features available in the chip. Like it has a physically unclonable function, which is basically it's an SRAM cell. It's a memory cell that intentionally you can't write to. You can only read from it.

Starting point is 01:35:14 And so it comes up in an uninitialized state. But because of the silicon processing, the state is unique to the particular section of wafer that it was cut out of. So you can apply a probability weighting to it and get a consistent readout. So you can say like, every time I power on, I apply this probability to each bit of the SRAM readout. And that gives me a way to generate a unique identifier for this device. And since it's based upon the actual silicon properties, you can't copy it.

Starting point is 01:35:55 And so that was a cool piece, right? We used that, actually. But as far as the vulnerabilities, we weren't digging really that far into it. We were actually just trying to figure out how to use the device. A lot of microcontrollers nowadays have a built-in ROM that does their initial bootstrapping, offers some of the common library functions or power management type things. Instead of you compiling in their code as a library, it's just you call this section of the ROM. And we were just trying to understand how the code signing system worked initially. And we were generating a variety of different binaries that we wrote with the

Starting point is 01:36:40 different headers and we're signing with different key types and things and and just encountered unusual behavior uh we started reverse engineering the rom just because we needed to understand how it worked because the documentation didn't tell you uh and nxp wasn't really particularly forthcoming about how it worked they wanted they just wanted you to use their tools and their tools were insufficient for our process. So we had to dig deeper into it. But once the chat room sort of like, hey, can somebody look at this? This is odd. Followed by a lot of active discussion around,

Starting point is 01:37:16 well, what is going on here? And then, oh no. Oh no. And that really happened for both of them. We were looking for something else entirely, trying to just understand how the part worked and realize that you could manipulate it in ways that bypassed various countermeasures or caused the chip to do things that it's not supposed to do. And then having come from the world of working like at Eclipsium and doing vulnerability disclosures, it was like, okay, I have a process for this. I know how to do this. So here's how we talk to everybody.

Starting point is 01:37:56 Here's how we coordinate it. And honestly, the talk that Laura and I did at DEF CON was entirely driven by spite. The vendor's response was so poor and so frustrating that i submitted the the talk to their call for proposals and you know didn't expect to get accepted but we did so we had fun that's awesome How did y'all go about actually dumping the ROM and then analyzing the code that you got off of there? So on those parts, it's actually the ROM is intended to be readable. So it's actually just really trivial if you just write a program that goes and reads it or use a debugger to dump it. It's just accessible space.

Starting point is 01:38:44 I mean, other chips I've worked on, you have to do clever tricks. You have to find a vulnerability itself to get access to it or something. But then once we had it, it's mostly loading it up in Ghidra and spending a lot of time. Reverse engineering is a skill. A lot of people see it as this very difficult task, and in a lot of ways it is, but it's really a puzzle problem. It's finding patterns and recognizing things and building up an understanding of how things work from small levels to bigger understanding. So we just had a couple folks that we would share it around, and we all had Ghidra, and we would just pull up different sections of it and start working through, oh, this is a mem copy. Oh, this is a strlen. Oh, this lines up with these registers that the datasheet says. And then you could start to infer what the behavior was. From the breadcrumbs that

Starting point is 01:39:35 I had, you could trace what it was actually doing. And the ROM was small enough that you could mostly get through it and understand it. whereas on some other devices I've worked on or other software, once you have a megabyte of code, you can't realistically reverse engineer the whole thing. Right, right the conversation now. Uh, but I am, I'm going to ask you to explain quantum computing now. Um, but, uh, so, so you, you went on and you joined, uh, IonQ. Um, and, uh, I've, I have spent the bulk of my research for this episode, uh, on quantum computing. Cause I'm, I I'm coming in with very little experience. I do have one friend who is doing a PhD in quantum error correction. So I'm familiar with some of the constraints of quantum computing. But for the listeners and for me, talk about,

Starting point is 01:40:43 one, your decision to join IonQ, but also, can you give us kind of a bit of a primer on quantum computing? Sure. So I was looking for a place to work. I had moved to the Seattle area while I was working at Oxide. And for reasons, I was just looking for something in this area. And I honestly applied to be director of security at the company. And in the interviews, they said, oh, well, we're not actually hiring for that role right now. We've decided we're not going to hire anyone for that role.

Starting point is 01:41:25 But your background is fascinating. And we would just like you to interview with some other folks about this embedded control systems. I'm like, sure. I know absolutely nothing about quantum computers. But I just assume that you're going to have physical hardware. And I'm going to understand the security properties of this. And we'll figure it out. And between all of the work that we've talked about, about, you know, processor design and health monitoring aspects and flight control systems and all these different pieces, it's just

Starting point is 01:41:55 like, yeah, I've worked on embedded real-time control systems sometimes, you know, and I have some understanding there. So I really came in knowing nothing and not expecting to end up in this team. As for how quantum computing works, I always get asked this question after like an hour of discussion, and it's great. So here's the capsule summary.

Starting point is 01:42:23 Quantum computing, there's actually a really great tutorial on this online. I think I sent this to you, quantum.country, that kind of walks you through thinking about how quantum computers work from a programming-centric or algorithm-centric model. And it kind of walks you through the basics. But the top-level idea is that a quantum computer is something that can represent or do computation of quantum mechanics more efficiently than a classical computer. The practical aspects are that you're really dealing with this model where somehow you have a thing called a qubit, and we call it a qubit because it's quantum. And it's not actually a bit.

Starting point is 01:43:07 It doesn't have just a zero or a one state. It's a probability function. And often what people express it as is what they call the Bloch sphere, where imagine you have a sphere and the north pole of the sphere is the zero basis state and the south pole is the sphere is the zero basis state.

Starting point is 01:43:28 And the south pole is the one basis state. Or vice versa. It doesn't actually matter. There is common nomenclature for it, but I don't remember what it is. The idea with the Bloch sphere is that it's a unit sphere. So you've got the two poles defined as known states, and then every other point on the surface is somewhere a unit vector based upon spherical coordinates. And wherever you are on the surface defines a probability that if I measure that qubit when you're at that position, whichever pole you're closest to, you have a higher probability of going to it. So if you're at the equator, you have a 50-50 chance of going to either side. But if I'm a quarter of the way up the north, then I'm like 75% chance of going to the north.

Starting point is 01:44:20 So I would be a zero, right? So the output of a qubit is always either a zero or one, but it's a probability. And so quantum computing is mostly around doing operations to move the point around the unit sphere, such that when you measure it, you get some result that collapses to a zero or a one. And an interesting part about quantum computation is when you do the measurement, you destroy all of the other information about where exactly you were on the sphere. You only know the output of a zero or a one. You don't know anything else. But while you're doing the computation, you actually can control very finely where you are in terms of spherical coordinates. You can do sort of arbitrary rotations around the sphere.

Starting point is 01:45:06 So that's the computational basis of how this kind of works and how the algorithms work and other stuff is a whole separate discussion around quantum computing theory. And the quantum computing practice is more of what I deal with, how the machine works. And to that end, you know, quantum computers are kind of at the state that classical computing was in maybe the 40s, where there's a theory of how quantum computing works.

Starting point is 01:45:33 There's a couple of specific use cases that show promise for why you might use it. But nobody really knows how to build a good one yet. And you're trying to figure out the raw technologies that actually make it work. So like with classical computing, there were people doing relay computers. There were people doing vacuum tubes. There were people doing all sorts of things, right? And for data storage, we had Williams tubes and Mercury delay lines and all sorts of different things, right? Because we just didn't know how.

Starting point is 01:46:04 Well, quantum computing is kind of the same way. We know that you can build quantum computers out of transmons and out of trapped ions and out of superconducting systems and all sorts of things. But we don't really know which one of these is going to be a good way of doing it. And so everybody's going in different directions. And similarly, there's no coherent or consistent programming model, I guess would be the right word, around what the correct set of operations would be. We don't know that incrementing one makes sense as an operation, where that makes obvious sense in a binary world. It took a while to figure out that using binary was the right answer, and that when you had binary, you should use two's complement to express the numbers and all those things. We're still figuring that out in the quantum world. So with IonQ, I work primarily – they exclusively worked on trapped ion quantum computers, which the idea is that you take a material, the systems that IonCube produces

Starting point is 01:47:09 use euterbium. And basically you use a laser to ablate a euterbium target to cause a plume of ions to happen in a cryostat chamber that's under vacuum. And you then, that causes this plume, which gets it up into what they call the loading chamber of the trap. And the trap is effectively a series of electrodes kind of made in a line. And you've got the series of electrodes that constrain the ions using both RF and DC voltages to hold it in X, Y, and Z coordinate space. And so you're literally using an RF pattern as well as DC to force this ion to sit

Starting point is 01:47:57 kind of in space and then be able to move it along in a line. And you do this so that you can hold the ion in very specific places so you can hit it with lasers. And the reason you do that is if you remember your chemistry from college or high school, remember how atoms have S shells and P shells and D shells and all these things where your electron states and your energy levels, yeah, that's what we're playing with. We're holding that ion there, and it's an ion, so it has extra electrons. And we're using laser pulses to essentially inject and remove energy to cause that electron to move between different portions of, or to different energy states.

Starting point is 01:48:40 And the reason we do that is we can encode quantum information that way. And then when you get to the measurement aspect, it's using lasers where in a different sequence, you hit it with a laser and it either generates a photon or it does not. So that's how you do the collapsing onto binary zero or one. So the whole system that I work on is basically a very complicated real-time system controlling lasers very precisely to aim and fire waveforms, like modulation of lasers, at individual ions to be able to then cause them to emit photons that we then have photomultiplier tubes that we actually count how many photons come out to figure out what the probability distribution was. Gotcha. Well, I probably shouldn't say gotcha because, you know, I'm sure

Starting point is 01:49:31 that there's a tremendous information loss that just happened over our internet connection here. But the part that I guess, like, I want to, like, press in on a little bit. So, um, you know, I, I, I will give a plug to iron Q's resources page here because the, uh, like background and glossary section is extremely useful. Um, so in this kind of like, uh, period, I, I kind of think of it and this, this may be a, a faulty mental model here so so feel free to correct this but um you know when i was learning i i started in kind of like a software background and then when i was learning about hardware um and kind of like working with fpgas and stuff like that um you know you have this kind of like settle time for whatever uh logic you're trying to

Starting point is 01:50:21 represent in your sequence of circuits um and you you know, you get, there's kind of like this, this time and you, everything reaches its steady state and then you get the output out of it. And so when we're talking about quantum computing and, you know, you're saying this kind of like collapses down to the photon or the, the, you know, mapping onto one or zero, the compute that happens, compute, I'm doing air quotes for anyone on one or zero the compute that happens compute i'm doing air quotes for anyone on the podcast the compute that happens uh in that interim though is is happening at a like non-binary fashion right so the uh i you know i'm gonna use terms incorrectly here i'm just gonna go ahead and put up front and and the uh the process of kind of like when you're talking about moving the point kind of around the sphere there,

Starting point is 01:51:13 is that what we refer to as entanglement? Or is that when you're taking like two of these cubes? Okay, so this is at a higher level, right? Because when you're talking about the sphere, you're talking about a single qubit from a logical level, right? Correct. And then entanglement would be when you're taking two of those qubits and they are influencing each other in a somewhat deterministic way.

Starting point is 01:51:36 Is that correct? Yeah. So, I mean, there's a couple different things there. So, one, when we're talking about this single qubit and like that sphere model and moving around the point that's that's usually what we talk about is a one q gate right it's a okay often when they when describing quantum electronic or quantum computation they use a circuit model and so you think of it like an electrical circuit except it's not it's it's like a timeline where you have the qubit going from the left to the right and you apply

Starting point is 01:52:05 gates in sequence and so it's like something goes into a gate and it comes out the gate but it's actually the same qubit it's just a timeline of when you did operations on it a little confusing but you can kind of think of it like you're drawing a schematic it's just a schematic that can only go left to right um and so a one Q gate only works on one qubit. You have, you know, it's an operation that you do on a qubit, and it changes just that qubit by itself. It has no influence on others. And that's doing some transformation.

Starting point is 01:52:36 It's moving that point around. And the reason we talk about it as gates is there's just different operations you can do. There's like a gate that can move an arbitrary amount, or sorry, you can move pi rotation, you know, pi radian rotation around any axis, right? That's an operation you can do. And you can do like pi divided by two. So like a quarter turn rotation around any axis. And so those become gates. And so there's like a native gate set that the machine actually can do. And then there are sort of standardized gate sets that are how algorithms are developed. And so just like we have in a computer, you might write in a higher level language and you compile it down to the native instruction set of the machine. Same thing happens in quantum. It's just

Starting point is 01:53:20 you have a standardized gate set and then you have the native gate set, and there's a compilation process that happens. With the 1Q gates, it really is just moving around that point around one qubit. Entanglement is, as you said, when you actually cause a coupling between two qubits. You're actually creating a situation where you're actually changing both of them at the same time in a way that causes them to have a corresponding effect later. You're mixing the quantum information between these two. And from a physics perspective, it literally is doing a phase coherent laser modulation on those two simultaneously, which is what makes my job really, really hard is I have to do basically instruction level parallelism of firing waveforms at two qubits simultaneously or at ions simultaneously um and so that that creates

Starting point is 01:54:27 the entanglement there and that that's also used to do things like teleportation which is actually a real thing in quantum um so uh yeah i mean that's that's kind of the level of the operations it's usually one q or two queue gates there are higher queue like three or four but most of the time most of the systems work on either a one or a two queue gate and it's usually a combination of the two you run a couple one queue and then you do a two queue and then whatever um but uh yeah that that's what makes the control system so difficult is like you you're actually having to to sequence all that. And then the whole thing is, just to add a little bit more complexity to it.

Starting point is 01:55:09 Please do. Even when you're not actively running a gate, when you're not actually modulating it, the electron has spin. So there is a phase precession happening just because time is passing. I see. And that is unique per qubit. So each qubit is actually has a phase precession that's happening at a different rate. And in order to get your entanglement to work correctly, you have to actually introduce your operations in a phase coherent manner with the phase precession of the device of each qubit so you're actually having to track the phase of the laser pulses and make your modulation match in phase right so there's no like determinism to it you're just

Starting point is 01:55:57 measuring it and then factoring that into your subsequent operations measuring it would be great but if i measure it i destroy state so no i'm actually predicting what the phase is procession is going to be and having you know just based on how much time has passed okay makes sense yeah somewhat uh essentially you're right you run a calibration process to figure out what the uh phase procession for each qubit is and you do that in a way where you destroy the state but that's valid as long as the qubit is is valid i see i see so moving back up to like sort of the the um uh hardware software interface if you will um for these machines right because i'm kind of envisioning you you know, you're working on the embedded controllers

Starting point is 01:56:47 that you have are kind of like the microcode of this machine that we're going to have instructions for perhaps. And you need to have some sort of instruction set, right? So you talked about this model of circuits that are used to program a quantum computer. What does an instruction set look like? And I assume that due to the wide variety in implementations of quantum computing that it's fairly varied across these machines. It is. In a lot of ways, it's similar to

Starting point is 01:57:21 how the GPU world works, right? Every GPU has its own native instruction set that they don't tell you what it is. What they do is they give you software stack that compiles from standardized shader libraries or GPGPU frameworks down to that instruction set. And that's kind of where we are today. And that's probably the model that will continue for quantum stuff. But like, what does the instruction set actually look like well you do express it at a high level as as sort of these standardized gates and then it gets compiled down to the native gate set but it's still done as a circuit right it's it's like this this time precession circuit model um but that's not what the machine executes you know as we just talked

Starting point is 01:58:02 about it's like tracking phase coherent laser modulations um among other things right we also have to aim the lasers and so there's a whole separate series of things going on there to to steer them um along with dozens of other things because we can also move the ions around um if we need to we. We can change their locations. So the control to the machine, the native instruction set of the machine, is actually much more low level, but it needs to be incredibly time consistent, right? Like we actually have to have all of these operations happen as simultaneous as we possibly can in a phase coherent way. So like, this is how this system works. It's actually pretty common in the trapped ion world. And actually, it's somewhat common in the superconducting

Starting point is 01:58:50 world of you have the same problem, you can effectively think of it as you have a whole bunch of arbitrary waveform generators. And those are going off and being connected as to control a variety of different types of devices, but you basically just have arbitrary waveform generators. You hook them to a common clock so that they are all locked to the same reference, so they are now phase coherent. And then you are running programs where each instruction,

Starting point is 01:59:20 each one of these waveform generators is getting its own instruction stream. So they're all running independent programs, but they're running off of the same clock tick so they're executing the next instruction simultaneously across all of them that does map pretty well onto like the gpu kernel kind of model uh at least conceptually it does um i mean there's a lot of delicacy about how it works um right but but yeah i mean in a lot of cases it's it's a you can conceive of it as a vliw machine um where you just have an instruction per per awg channel or per you know you need. And that works. That does work up to a point until you actually need to do some sort of conditional behavior.

Starting point is 02:00:12 So kind of state of the art is, well, I can compile a circuit and I can execute it as long as it's a straight line code, right? Like I have a basic block that's just executing straightforward. And then I do a measurement at the end, and that's fine. But if I need to do a measurement midway through the circuit where I actually have moved some of the ions out of the way, so I'm only measuring a couple of them, but the other ones retain their quantum information, and then I want to change the behavior of the circuit based upon that intermediate measurement, well, now all of those processors how do you do control flow right which is again similar to how the gpu problem if you've ever done gpgpu

Starting point is 02:00:54 programming uh you are basically writing giant simd type operations it can be mimdy but if diverging threads right if you have a program where some of the GPU processors are going, what, take a branch one direction, but others go the other way. If you have a data dependent branch, that's a performance penalty there. In the quantum world, that's disastrous, right? It's really hard to model and figure out how that's going to work. And so that's an active area of development. You know, What does an instruction stream look like for these machines to allow that type of operation? Because as you mentioned,

Starting point is 02:01:32 you have a friend that's working on quantum error correction. You need that kind of model. I need to be able to do some amount of work and then be able to look at a couple of the qubits to decide whether or not I need to do error correction. Right. So you can't like essentially it's unacceptable to introduce a stall because like the system is just gonna have looked totally different like while you were not doing things or if you introduce a stall everyone has to stall and you have to know exactly like, you have to be keeping track of the phase precession of every single qubit during that stall.

Starting point is 02:02:06 That makes sense. Yeah. Okay. So, naturally, and this is also a little bit informed by listening to some podcasts, interviewing folks from IonQ as well. like there is a, the GPU comparison is helpful here in that like, you don't want to do all computation on a GPU, right? So like, there are some operations that a GPU could be extremely useful for, right? There, I'm sure there are, I'm not aware of all of the different ones available for a quantum computer. But you know, know like there is a portion of in the the sequence of a program right there is a portion of it typically that is going to be uh you know

Starting point is 02:02:54 massively parallel or something but there's usually a lot of like setup or tear down afterwards if you will and so what i've heard a lot in this kind of quantum computing space is folks talk about hybrid models where you're kind of like mixing classical and quantum computing. And, you know, that brings up a lot of different questions, both on the instruction set front. Right. So how do you model instructions that span across that? And then also on, you know, an interconnect front, you know, what is the speed between your classical and quantum computing there? Is the work that you're doing examining that space as well? That is exactly what I'm working on. It's literally that exact problem and trying to understand what levels, as you said, how close do you need to have the classical and quantum

Starting point is 02:03:45 and how do you exchange data and control flow between them um as you said like as we got to you can't stalling instructions is on a quantum computer is really hard and for more reasons than you might expect right not only do you have to track the phase precession, but we didn't really cover this earlier in the physics part, but when you actually initialize the state of your qubit, it's only going to maintain coherence for so long. So basically the quantum information will deteriorate over time. And so you only get a second or two before the state's gone, right? That's as long as you can run the quantum program. And then there's also another dimension of the problem, which is that quantum computers are effectively analog.

Starting point is 02:04:35 They're very similar to an analog computer in that you have infinite precision because you're ultimately moving that point around the sphere. You can literally move it in as tiny increment as you want, but there's error, right? Just like when you had an analog computer and you build an integrator, there's error introduced just by noise of the op amps and other things going on in the system

Starting point is 02:04:54 or the tolerances and resistors. Same problem in the quantum space. So you end up with fidelity. And so the number of gates that you can apply to a qubit introduces cumulative error until you get to a point where you can't actually discern what the output is anymore. So this is kind of the driving space of how much computation can you do is how long do I have for coherence and how many gates can I run in that amount of time and still get a reusable result out. So box all of that with, and by the way, while I'm doing that in that one second interval,

Starting point is 02:05:27 I want to take a measurement and run some computation on it and come back and change what I'm going to do. So I can't stall for, you know, a minute to go run a different classical operation and come back to it. I can't, you know, waiting a hundred milliseconds for a web service request to come back is just a tenth of my coherence window. Right. and actually treat it like the quantum operations are just specialized functional units, and be able to do sort of micro architectural design of passing data between registers and queuing it up so that, oh, it did the measurement, and then that's being queued into an ALU that's going to turn the result around. And that's one way of doing it. Or do you have a separate processor that's right nearby that can punt the data back and run some straight line classical code and then push a result back the other way? And that's a very open area that is trying to figure out. And the answer is not clear. Because sometimes you want to have a model where the classical computation is what we call between a shot. So between those coherence windows, you maybe run five different runs of the circuit, which

Starting point is 02:06:50 each has its own one second coherence window. And then you look at the results of those five to decide what you're going to change about it, right? You're doing some sort of statistical analysis. So you do a little bit of classical and then you do more quantum. But there's other times with like an error correction where you actually want to do

Starting point is 02:07:07 that mid-circuit measurement and insert it and different time scales get involved. So yeah, this is a problem that I'm actively working on and the industry as a whole is working on of how do you introduce

Starting point is 02:07:18 these capabilities? How do you even build, like what does it look like conceptually from like an instruction set standpoint is one part of the problem. What does this look like from a language perspective is a whole separate problem. What does this look like in a micro architectural standpoint is a whole other problem. And then there's all the resulting problems that we create in the physics when we actually do classical computation and we're not running quantum stuff. Right. do classical computation and we're not running quantum stuff right so in in the absence of kind

Starting point is 02:07:47 of like this uh generalized computing model or you know an instruction set or something like that is most are are most of the um present day applications that are run on quantum computers are they essentially like you're just rewriting the micro architecture uh to do a very specific thing or is there kind of like a interim state that we're in right now that allow us to you know do useful things on quantum computers without having a fully generalizable compute interface so the very high level like circuit level with standardized gates is is where most of the algorithm research and like useful work is done and so then it's internal compilation phases and things where it gets changed in the native get set and then basically does get compiled into what would look like a microcode um you know it's like a vliw

Starting point is 02:08:38 style microcode that's running um right and uh so that's usually hidden by whatever provider is you're using right because most quantum computers today are as a service, not outright purchased. So you can submit a job via one of the common frameworks like KissKid or, you know, things like that to AWS's bracket service and it'll run on a simulator or on a quantum computer um and it basically you know once it gets to the simulator or the quantum computer that's where the compilation to the native instruction set happens so it's it's a little bit isolated similar to how gpgpu works right you you write it against more of a common language and then it it does that compilation behind the scenes for you but um you know But the expression of how you interleave classical operations with quantum operations has traditionally been more of the GPU kernel model,

Starting point is 02:09:38 where I'm going to run... I write a Python program that builds a quantum circuit and then sends that off to the service. It runs it, it gives me the result back. And then I look at the results in my Python program and I do, you know, a little bit of work and then I submit another job. And so there's a lot of active work around how do you move that lower? How do you intermingle those classical operations more and make this more expressive, right? Like most of the time, you don't want to write just a quantum circuit with a separate classical piece around it. That's, you know,

Starting point is 02:10:10 most of the problems that you start to look at are a little bit more complicated and intermingled than that. Makes sense. So I think we're going to have to have another check-in with you on quantum computing. Perhaps I'll do even more research and come with some more questions. But I definitely appreciate you being willing to dive in depth on that with us here. I did want to kind of wrap up by talking about some of the things I guess you do outside of, you know, your day job responsibilities. One of the things we chat about kind of like leading up to this chat is mentoring. And so I feel like that, as well as, you know, being on podcasts and something like that, one of the things that I feel like that I've observed

Starting point is 02:10:59 through watching your work and your involvement in communities is your interest in like bringing other folks along on that journey as well. So I kind of just wanted to give you an opportunity to talk a little about, you know, what role that has played in your career and why that's important to you. Well, I mean, as we covered, I grew up in rural Ohio, like I literally grew up in a around farmland. And so the chances of me ending up working in Silicon Valley at some of the, you know, biggest names in computing is, is a wild story. I wouldn't have believed it if you had told me. And I recognize that a lot of that is luck and a lot of that is privilege.

Starting point is 02:11:40 So even though I had all that luck and all that privilege, I also recognized that I didn't have anybody giving me any direction there. I just kind of stumbled through that whole path. And it was difficult. I just didn't have mentors. I didn't know people. And I couldn't ask a lot of questions until much later in my career where I built a bigger social network. So quite a few years ago, I guess it'd be like eight, seven or eight years ago, I started offering just public signups for people to do mostly mock interviews initially, but it kind of expanded into resume reviews and mentoring sessions of just, I've been in this industry for a long time. I've been a hiring manager. I've interviewed hundreds of people. I've worked at the big companies. I've worked with VC. I've been in this industry for a long time. I've been a hiring manager. I've interviewed hundreds of people. I've worked at, you know, the big companies. I've worked with VC. I've worked with startups, et cetera.

Starting point is 02:12:31 Like, come ask me your questions. Like, you know, where are you at? Are you trying to change from a different industry to come in here? Are you trying to graduate from a school and figure out where to go career-wise? Are you a junior engineer that is kind of reaching a limit with your current job and you want to bounce some ideas off? I'm just happy to sit down and chat with folks. And it's really just to be that resource that I wish I had had. Some folks do similar things, but they actually ask for, you know, pay to compensate for their time.

Starting point is 02:13:07 And for me, it's just, no, I, I'm just, it's free, right? I, I set apart, set aside a certain amount of time each week and, and have a sign up and folks just do, and I show up and we just have chats. That's awesome i'm sure that's uh hugely impactful for a lot of folks and you know alongside that as i mentioned earlier um you've also been pretty instrumental in going out and kind of sharing things that you've worked on i know we already talked about uh your defcon talk um but, I came across a talk where you were, um, uh, detailing some of your work on engine motors. So things outside, uh, of your,

Starting point is 02:13:52 although that does seem somewhat related to, uh, some of the control systems you've worked on, perhaps. Um, but what's kind of your, uh, your motivation for for for sharing your work uh more broadly as well i mean largely i just i i enjoy working on various things right like cars is a hobby for me i'm not a race car driver you know that's not my profession but do i enjoy driving on a racetrack yes do i like tinkering with cars absolutely So when we get into where does that overlap with electronics and software and things like that, a lot of it is just, I find it fascinating, and I want to show other people what's cool about it, and why it's an interesting problem. Right? Like, I suspect that there's a lot of folks who are just like, oh, my car is there, it takes me from A to B, and that's fine. And it's like, well, actually, there's a very complicated control system running under the hood that you don't know about, and all the different subtle things that are happening. And I don't know if you're familiar with the 99% Invisible podcast. But the ethos there is finding architecture or design kind of out in the world, or in different ways that are subtle things that you might not have really considered. And there's a similar thing here of, well, let's go find some of the engineering and things that you may not have seen that I found because I found it interesting and I'm aware of it.

Starting point is 02:15:19 And I just want to share that to have you think about it next time you are in your car, that there's actually a whole fleet of computers keeping track of how the engine itself is running and making that work. And, you know, it's a similar thing with coming and talking about quantum computing. Don't expect that a lot of folks are going to actively write quantum computing programs, but I'm sure you're all very interested in how it works now. Right, right. Absolutely. I'm sure you're all very interested in how it works now. Right, absolutely. Well, one of my goals here is definitely for those folks who are curious about how things work,

Starting point is 02:15:51 a lot of times kind of like below the surface of the interfaces that we commonly interact with, that folks like you will come along and be willing to kind of go a little deeper down the stack and talk about what's really going on. So I appreciate all the times you've done that elsewhere, and I appreciate you coming by the show today and sharing your expertise on a lot of different topics here. It's been hugely informative for me, so I'm sure a lot of folks are going to get a lot out of it. Yeah, well, I'm glad to have been available to uh come talk and and share things absolutely all right well i think we can uh probably wrap it up here but rick thank you

Starting point is 02:16:32 for joining and uh have a great rest of your week thanks

Your Ad Here

Microarch Club - 110: Rick Altherr

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.