Microarch Club - 110: Rick Altherr
Episode Date: April 24, 2024Rick Altherr joins to talk about working on hardware performance analysis tools at Apple during the PowerPC to x86 transition, building flight control software for internet satellites at Goog...le, discovering vulnerabilities in baseboard management controllers, and much more. We also spend an extended portion of the conversation on Rick's current work in quantum computing, including comparing and contrasting with classical computing, and examining some of the challenges of interfacing with these machines today.Rick's Site: https://www.kc8apf.net/Rick on LinkedIn: https://www.linkedin.com/in/mxshift/Rick on Mastodon: https://social.treehouse.systems/@mxshiftRick on GitHub: https://github.com/mx-shiftRick's Mentoring Sign-Up: https://calendly.com/mxshiftDetailed Show Notes: https://microarch.club/episodes/110
Transcript
Discussion (0)
Hey folks, Dan here. Today on the MicroArchClub podcast, I am joined by Rick Alter. Rick has
worked at nearly every level of the computing stack, and we touch on many of them in this
conversation. We start off at Rick's upbringing in the Midwest and how they learned about
how machines work by tinkering on a theater organ. From there, we get into Rick's career,
beginning with working on hardware performance analysis tools at Apple during the PowerPC
to x86 transition,
then moving to Google to work on everything from system software in the data center to flight control systems for internet satellites to open source FPGA toolchains.
Rick's exposure to the hardware-software interface subsequently led them to security
research, including their discovery of a widespread vulnerability in baseboard management
controllers, and their collaboration with Laura Abbott to identify multiple vulnerabilities in an NXP hardware root-of-trust device while
working at Oxide Computer.
We round out our conversation with a deep dive into Rick's current work on quantum
computers at IonQ.
Rick does a great job of explaining quantum computing in relation to classical computing,
and outlines some of the current challenges of interfacing with these powerful machines. As evidenced by their willingness to talk with me
for nearly two and a half hours, Rick is passionate about opening doors for other folks in technology,
and they regularly make time to do resume review, mock interviews, and mentorship sessions with
anyone who is interested in talking with them. If you're interested, look for the link in the
show notes. With that, let's get into the conversation.
All right. Hey, Rick, Thanks for joining the show today.
Yeah, glad to be here.
Absolutely. Well, we've obviously chatted a little bit before the show, and I've also followed some of your work for some time, including some of your talks and appearances on other podcasts. Actually, earlier today, and I think I mentioned this before, I was listening to your On the Metal episode, which definitely had a lot of good tidbits in it. But also, one of my co-workers in my day job, Chris Gamble, has a podcast,
The Amp Hour. And you were on that, I think, a few years ago now. So I've definitely run into you in a few places.
But super glad to have you on the MicroArch Club. And I think we'll have a lot of fun stuff to get
into today. Yeah, sounds great. Awesome. Well, I know you've listened to a couple of the episodes,
so you know a little bit of my preference for kind of starting off in the beginning with folks.
I'd love to learn what your introduction to computing was
and maybe what some of the first machines that you interacted with are.
Yeah, I mean, that is a little tricky
in that it kind of depends on how you define computing.
So like my family has a history of auto mechanics.
And so you get into things that are like, not quite computers, but kind of are computers, you know, like how a transmission, like an automatic transmission works is a hydraulic computer effectively. with learning about electronics and mechanical systems through my grandfather. Before I was
ever born, like when my mom was a teenager, he ended up buying a Wurlitzer Hope Jones
unit orchestra, which is a theater organ, and installing it in their house. And so I learned
growing up, like, this was just the thing we had.
It was there we had for demonstrations to local school groups and things.
And I was small enough that I could climb through to go in and fix all of the stuck electromagnets and do all sorts of adjustments and things.
And learned a lot about how to build relay logic type things and that stuff.
So I learned a lot from my grandfather, who was very much a perpetual tinkerer in electronics,
mechanical things, all sorts of stuff.
And then really get on to computing.
Jeez, I was probably like six or seven.
My dad ended up getting a job as a computer salesman because that was an actual job at one
point. And so I got to periodically go into the computer store after school and just kind of hang
out and learn a lot. We had a 286 at home, got to play the video games and things like that,
and really just kind of took to it. And so I spent a lot of time, you know, learning about programming languages and how the system
worked and as much as I could. And that really progressed up through high school and everything
else. You know, I eventually got my own computer so that I would stop breaking the family computer. And my first job actually ended up being working at a ISP locally. So I was,
they were like, in the late 90s, in the rural areas, you'd have these combination like computer
stores slash web development firm slash ISP. And so I was more on the computer repair side of things, but I was also doing the ISP side and got a lot of exposure to just how the Internet works and that kind of stuff.
And yeah, so that's a lot of where I started.
And then, you know, ended up going to college for originally for a computer engineering degree, but then decided pretty quickly that I
didn't like doing all the math for the electrical work. So switched over to computer science. And
yeah, a lot of it just kind of kept kept going from there career wise.
Gotcha. With the the earliest machine that you had for yourself, what what model was that uh i believe it was a 8086 clone i don't remember exactly what it was it was not new right
it was right definitely like a hand-me-down that nobody wanted because everybody had 386s kind of
thing absolutely is the uh so so you kind of got started in maybe programming your family's computer and then
eventually your own computer was that mostly in um you know a higher level language like basic
or something like that which it may be comical to call basic a high level language now but
as opposed to getting down into assembly and in the lower level parts of the hardware
yeah i mean growing up like i i grew up in rural rural Ohio, Northwest Ohio. I didn't have a lot of people documentation that came with DOS. And that was it.
I didn't really have access to a C compiler or any of those things. That really didn't come around
until I was a teenager when finally that kind of thing was available. And by that point, I
already had my own computer and was working in an ISP. So I had internet access. That was
a very big change. But before that,
yeah, I was writing complicated basic programs, trying to understand more of how the hardware actually worked, but was very limited just by what was available. I mean, even going to the
local library, there wasn't any information, like programming books were not a thing that
the local library was going to carry. Right. And at your high school was there any sort of computer science
education or was this strictly kind of like outside of school there wasn't until like my
junior year at which point um like they had had a computer lab but it was very focused on learning to type using Word and Excel,
kind of more basic computer user skills than development stuff.
So there were no computer science classes.
But as we got into my junior year, the person who ran the IT staff knew that there were a couple of us
who actually had a lot of interest in computers and things. And so he worked with the local community college to have us do the Cisco
CCNA courses. But we had to show up at 7am before school. And like we did CCNA before we actually started our regular school day. And so we did all that neither one of us actually ended up taking any of the
exams for it. But we definitely got all of the exposure to like, here's how you set up routers
here, you know, all that kind of stuff. And, but that was the extent of it, you know, that we
didn't do any AP courses or anything else. Mostly, it was just a small set of people that happened to have computers and that kind of interest sharing skills and things behind the scenes.
At that stage, like in high school, I probably was more known for running the server that hosted TI Net and ti news for the calculator programming stuff
than anything else uh yeah that's really interesting the uh i feel like one of the
things that maybe gets discounted by by folks who don't have that experience so i i grew up in the
southeast united states and uh there is just a really big geographic difference in terms of computer science education.
I graduated from high school in 2015, and there was no computer science at the school that I went to.
And it was a fairly good academic high school, but it wasn't really top of focus.
And I do think in the past five years or so, there's been even more of a shift.
But, you know, I think there is a pretty big geographic difference, potentially between
the coasts and somewhere like the southeast or the Midwest.
Oh, absolutely. I haven't looked at what they're doing at my school recently, but certainly I
wouldn't expect them to have much of a formal computer science curriculum. It's just it's not
top of mind. You know, they graduate maybe 100 hundred students a year and that's, that's not
like a private school or anything. That's just the county school system like that. Right. Right.
Well, so, okay. So then you, you go to college and you decide to study computer science. What
was that decision like, especially coming from coming from you know a background where there
was a lot of folks that were into that or it wasn't necessarily I don't know if it was seen
as a kind of like viable career path well my parents saw that from you know my dad's job
working in computer sales like it was obvious that computers were an up-and-coming thing
and my dad actually had interest in computers going back to when he was graduating from high school and going into college.
He just couldn't afford college at the time.
And so he never actually got a college degree, but he got a little bit of exposure to it through kind of the math to computer science path that was more common back then.
And so they understood that there was utility there and that it was an up and coming area, that it was probably a good place as far as
career-wise, and that I had a lot of interest in it.
It also helped that my brother had decided when he was going to go to university that
he also decided to go to computer science.
And later, my sister also went for computer science.
So all three of us went for computer science.
But it was really just a, I like doing this, and I don't know what else I would do.
And my parents were fairly supportive of that.
I was going to a state school, so tuition and everything was pretty affordable.
And it was just like, sure, why not?
We'll see how it goes and see what happens.
And it turned out it worked really well for me.
I knew I already enjoyed computers. see what happens. And turned out it worked really well for me. You know, I really do.
I knew I already enjoyed computers. I didn't actually know what working in computers would look like. But it seemed like I was asking a lot of the right questions to go ask more and figure
out what where I wanted to work in that space. And did you find that the computer science
curriculum was fairly strong? And was that kind of a foundational part of
how you went on in your career? Or did you feel like that a lot of the kind of skills
and things like that that you gained were mostly outside of the classroom? And
those were the biggest contributing factors to what you ultimately went ahead and did?
It's a mix. So I went to University of Cincinnati, which is a pretty well-regarded engineering school.
Certainly not Ivy League or anything, but reasonably well-regarded.
And their computer science program is decent.
I felt the curriculum was... When I started, I was ahead of the curriculum.
There was definitely a lot of areas where I learned more through the curriculum.
But just like Nathan, your past guest, talked about his school did co-ops.
Actually, University of Cincinnati is where co-ops was created.
So all of the engineering schools do a – usually it's a one quarter on, one quarter off.
Computer science tended to do more of a six month on, six quarter off. Computer science tended to do more of a six-month on, six-month off. And so I ended up through that getting an internship at Apple. And so I went from like,
I really like computers and I'm learning about programming and some of the details of how the
computer works to I'm working in a group that actually does performance analysis of up-and-coming
processors for Apple computers. And so I ended up just kind of leapfrogging everybody else at the school in
terms of what I understood and everything. So it was this mix of like, would I have done well just
being in the curriculum there? Yes. But honestly, the co-op really brought a lot more to the table
in that regard. That makes a lot of sense. How did you end up getting
into the performance analysis kind of aspect of Apple?
So my friends and I dispute exactly the exact sequence of events here. But essentially,
what happened is there was a centralized office that helped me with co-op placements.
And they were generally well known as being absolutely terrible at their job. So often it was you had to negotiate your own internships or co-ops
with a lot of companies. And so they often were in the area, you know, in the Cincinnati area,
it'd be like going to GE or Procter and Gamble or whomever. And occasionally someone would end up with these ones far, you know, somewhere else.
And it just happened that one of my, one of the other students that I knew
got an internship by, like, somehow his resume ended up at Apple. And as far as we understand,
it wasn't supposed to have happened, but it did.
And he ended up getting an internship from that working with their compiler team.
And from that, it was, hey, he's coming out and it's a six-month internship, not three months during the summer. they could have interns throughout the year, effectively. So that team started to spread the word internally, that they could pull more people
in from the school, which then turned into, well, hey, who would you refer as someone to come in?
So that person referred a friend of mine who ended up being more, he was an electrical engineer
student. And so he ended up coming in on the hardware development side
and working with the architecture performance group team, which is, and he ultimately referred
me as someone else who could come in more on the programming side and the system analysis side. So
that's how I ended up there. But it sort of built around, there was just a manager that really
understood how to get internships moving through the system and get them brought into this organization and kind of
fan them out through that. We actually ended up with a lot of people there. There's probably a
good seven or eight year timeline where people from University of Cincinnati were coming into
the Mac hardware group via that. Oh, wow. That's awesome.
What did you work on during the co-op?
And then ultimately, did that end up being what you worked on when you joined Apple after university as well?
So the very first thing that I got as an internship project
was validating that the performance counters
on the PowerPC 970 actually counted
the micro architecturalitectural events
that they claimed to count.
Which turns out, I had no idea what a performance counter was.
I had no idea what most of these events were.
And I had no idea how I was going to write code
to trigger these particular events to even do validation.
So that was a very challenging project. I did actually ultimately figure it out.
And through that, I actually ended up writing the documentation for IBM on how their performance
counters worked, because all of their documentation was wrong. And they just eventually were like,
here is the FrameMaker source code, please update it, and we will ship whatever you write. And so that taught me a lot about how the processor
worked very, very detailed inside. And that was on PowerPC 970, which is the G5, which
is a very complex, out of order processor. There was just a lot going on. And so I kind of, that was how my internship
kind of leapfrogged me ahead of my classmates
was just like, I had industry experience
with bleeding edge processor,
micro architectural design right up front.
And so I ended up staying with that team for a long time.
Actually, they hired me on
and I worked there for, what, four years?
And like I started in that area, I moved a little bit more over to developing the tool. So that same team, that team was kind of this
cross-functional thing where it was part of the Mac hardware group, and it was definitely focused
on understanding the performance of Mac hardware systems. They were looking at the new systems that were being developed to kind of inform
and work with the vendors on how we might modify things
to improve performance.
But they were also working with third parties
on how to tune software for those new platforms.
And so the third part of it was developing all the tools
for all Apple developers to be able to do performance analysis work
and come back with it.
So I ended up moving more into the tool development side.
That team notably released the Chud tools,
the computer hardware understanding development tools,
but it was actually a reference to the terrible 80s movie.
But the whole idea there was these are the tools that you can use to
figure out how your application actually is running on the hardware. Beyond sort of like
your basic statistical sampler of CPU time, this was, how do I actually dig in and find out what's
happening at micro architectural level, what's happening throughout the system and raising that
information in the way that you kind of expect from Apple. It's like, here's happening, you know, throughout the system and raising that information in the way
that you kind of expect from Apple. It's like, here's your code and here's the tool tip that
tells you exactly what's happening on this line. Right. That was kind of the thing. And so I ended
up being tech lead of that team for a couple of years through kind of the later versions of that
software. And I worked on almost every product that Apple shipped from 2004-ish to 2009.
So you were right in the middle of your time there was the PowerPC to x86 switchover,
is that right? That is correct. Okay. Yeah, that happened like right as I was joining full-time, my last internship to full-time.
I joined full-time in 2005.
And yeah, we had development units that were hidden inside G5 cases and all sorts of things.
It was kind of a big deal.
Very big change for our team, certainly.
You can imagine that when you're doing the performance
work at that level right digging into the chip details going from power pc to any other
architecture is is a massive change in the tooling uh to understand you know the the conceptual model
of how the processor works probably isn't that huge of a change but the exact details of how
you control performance counters how you actually know
the the subtle details of what's happening through execution units etc and took a lot of time to
build up that knowledge again for the new platform absolutely did did y'all have kind of an
abstraction layer uh over the power pc that allowed you to kind of like write a new back end if you
will for it or was it mostly oh no not not at all. It was entirely written to support PowerPC and nothing else.
So, I mean, and keep in mind that with Motorola slash Freescale and IBM, we had a very close
relationship to the extent that if you install the Chud tools, it actually ships full
cycle accurate simulators for G4 and G5. Oh, wow.
You know, it's for running instruction traces of small snippets, not like booting a whole system.
But we had that level of relationship with them where that was just there. We shipped
the entire assembler reference. So when you were
looking at an instruction, you could just click show me the instruction and it would pull up the
PDF to that exact page kind of thing. And working with Intel, you know, it took a while to build up
all of that because we had just never assumed that we would change to a different architecture.
That's kind of the classical problem with changing architectures is you just build up all these
subtle assumptions about how the system would work. And a lot of folks are more familiar with
the like, oh, you're running on big onion and you're switching to little onion, you're going
to have a lot of pain. There's also just a lot of pain for, I wrote my entire stack assuming that all my instructions were 32 bits.
Oops.
And then we turned around and did it again as the iPhone started development, where now I had to do ARM.
I ended up writing an ARM disassembler over a weekend to support the iPhone program.
We were trying to
to make that happen as fast as possible and she wanted it and at least we had done it once with
x86 at that point but yeah going to three different architectures that we supported was
the code base started to get quite ugly at that point right right did did y'all have uh more or
less heads up with the iphone in terms of the transition coming it sounds like y'all have more or less heads up with the iPhone in terms of the transition coming? Because it sounds like y'all might have not had a ton of heads up on the PowerPC to x86 transition.
Apple is, as you is well known, very tight-lipped, both externally and internally, and very siloed.
And so the x86 development started as a small team somewhere else, not in the hardware group, kind of doing their exploratory
work. And then eventually the hardware project started, and then eventually the performance team
came in. And the iPhone kind of happened the same way. It started with some folks from iPod
kind of speccing out what the system would look like, etc. And by the time we got pulled in, they already had engineering units, not form factor, but
prototype board units.
They had a software stack that was booting.
They had been doing a huge amount of work without us.
And so we got brought in late, wherein it was actually, we're seeing problems that we
need your tools to understand what's going wrong.
And we said, well, now we have to go actually figure out
how to pull out performance information
out of the system that you're running on.
So give us a couple months.
You don't have a couple months.
iPhone was specifically to the extent
where I had to sign a separate NDA just for iPhone.
And I actually had two office locations
because I had one in building five,
which is where the Mac hardware group was, which is my normal office.
But then building two was where they were actually doing all of the iPhone development.
And so I had to go up there to a different desk for any time I was working on iPhone related stuff.
Interesting.
Were there any, you know, performance bottlenecks or anything like that or any
bugs that you all encountered i guess uh you know not just in the transitions but just in general
and being the authors of this tooling where y'all had you know a vendor or a third party or someone
like that come in and say hey like what's happening here your tool is telling me that
you know there's a performance bottleneck but i don't understand any particularly uh interesting ones
i mean we worked with the adobe folks on photoshop a lot and so there were always you know subtle
things that came up there but they also were quite well versed in the what could happen right
like how their software worked and we're like we're working on very core loops i do remember
a lot of game developers coming in um you know the mac is not particularly
viewed as a game platform at least it wasn't especially back in the power pc days but i
remember sitting down with the developer who was doing porting one of the rock band games
and we're just like why is this burning three cores 100 and still struggling and we looked and
it was the game was written assuming
the Xbox platform and that it was just going to have three cores. And so they had just written
three busy loops that one on each core that ran, you know, it was like one was disk IO and one was
network IO and one was everything else. And we're, this does not map well, right? Like it was entirely
assuming it was going to run on a Windows NT based system and how critical sections worked and all that and was very terrible. But I also worked with,
like on G5 was used to build a supercomputing cluster. So I ended up getting a lot of, you know,
here's a application they're running, you don't get the whole application, but you get this tiny
piece of it, like iterate on that. we used to do a lot of demos for
uh like we would work a open source program to show how you would go through the process for
talks at different conferences and things and i remember one time we we looked at an open source
program called celestica it's a star system simulation where you can travel to different planets and see the simulation of the orbital motions and things.
And we had all these different options.
We'd show how you could replace calls to sign with table estimations and basic stuff.
And at the end, as we're getting down further and further into like, here's how you can reorder your instruction scheduling to be able to be tuned for G5 behavior.
We're like,
and now you're at 16,000 X performance.
Did y'all end up making any,
any upstream contributions?
Oh yeah.
Yeah.
We always did.
That was the thing is we would do the work with the,
to do this and we would send it all upstream.
And often they,
they accepted it. You know, they were just send it all upstream. And often they accepted it.
They were just like they were very gracious on it.
We did that with a lot of projects, actually.
Awesome.
So two questions kind of wrapping up your time at Apple.
One, from my research, it looks like the Chud tools are still used today by Apple developers.
I'm not entirely sure.
Like, I honestly haven't paid attention to the Apple platform for a long time.
When I left in 2009, there was a big push to fold the chud tools into instruments.
And so in some ways, when you're using instruments and you're using time profile in there,
it's actually using all the same infrastructure.
It's just kind of through a different UI.
I see. there, it's actually using all the same infrastructure. It's just kind of through a different UI. So, you know, people are using a lot of the pieces, but not necessarily through the front end. Absolutely. And so when you did decide to leave, I think you said 2009, was it mostly
you wanted to go work on something else? Or was there, I know you ended up going to Google,
was there something particular about Google that was alluring at that time?
No, actually, I followed a friend who left from Apple and went over to Google.
And so I just had a good path there and kind of knew there were interesting things happening there.
But I was looking for something different to go do.
The architecture performance group had been located in the Mac hardware division and got
split up and moved.
The Chud tools ended up reporting to developer tools.
And we just had a lot of culture clash.
Developer tools from the CoreOS group was very much of the opinion that you were there
to work on software and it was there to work with the operating system.
And you didn't actually care what the hardware was too much.
I see.
And of course, the entire thesis of Chud was, if you know what the hardware is, you can
do a lot more.
So we just kept running into issues of like, I need a kernel extension to do these things.
You can't install a kernel extension.
Sorry.
Gotcha.
So you follow your friend to google um you know like naively in my mind i think you you
also did some performance work at google but um you know you're kind of going from this focus on
you know apple especially at that time uh i imagine was mostly you know a single machine
that you're kind of analyzing and seeing how uh the the cpu or other components um in that system
are working.
And then Google is, you know, both large scale companies,
but Google is very different in that you're working with large network systems
where you're kind of, you know, analyzing perhaps at a data center layer.
Was that kind of like a, did you find that the skills translated quite a bit?
Or was Google kind of like walking into a new environment for you?
Some skills translated. But certainly, I mean, it's hard to answer that question, really. And
anybody who's had the privilege of working at Google will relate and everybody who hasn't,
it's just hard to explain. Not only is the scale of the systems that you're working on so vastly different, right? Like you
said, it's not looking at one or 10 or a thousand machines anymore. It's looking at tens of thousands
of machines, hundreds of thousands of machines. And so there's that angle, but also everything
is managed in-house through bespoke systems and they're not intended to be shipped to customers.
Whereas with Apple, yeah, it was all a big proprietary stack, but there was definitely
a focus on you have to ship things to customers. And so you can do some hacks, but you have to
make it a little pretty to make it acceptable to the customer. With Google, as long as it worked,
it didn't matter too much. You needed it to be reliable, but it didn't have to be pretty, necessarily.
But it also meant that you didn't have to abide by any industry standards.
You didn't have to, like, the design space was wide open, as long as you could justify why to do it that way.
And so, you know, you put those two together and you get into odd decision matrices where you're like, we're doing it completely different than everybody else does, but that's because it saves us
this amount per machine, which times the number of machines makes a huge difference.
We used to talk about like the team I was in wrote a lot of the software that runs on every
single server in Google's data centers. So there's system daemons that manage the hardware and do telemetry and health monitoring and a variety of other things.
But we at one point did a calculation where we figured out if anyone on our team saved one megabyte of RAM, we paid for the entire team's salary for the entire year.
Oh, my gosh.
No, I mean, it became difficult to actually save one megabyte of RAM because we had already done
a lot of that optimization to fit in there. But that was like the stakes, right? The scale of
things changes how you think about the problem a lot.
Right. That's actually interesting. The episode that comes out tomorrow is with Matt
Godbolt, and he spent some time working at Google as well. And one of the things that
we talk about on that episode is kind of like having that scale where, you know, like saving
a meg of RAM is having that large of a cost implication provides a lot of justification
for working on some interesting things that,
you know, just wouldn't make sense economically at other companies.
Did you have any examples?
I mean, maybe that is one right there, right?
Of like saving a bit of RAM.
But did you have any examples of things that you worked on or you saw folks working on
where they got to kind of like go down that optimization rabbit hole because of the sheer
scale of the systems y'all were working
on? Oh, absolutely. It happened all the time. Things that come to mind. So you know how
when you type in on Google search, it auto completes for you? Right. I had nothing to do
with that feature. However, that feature was burning so much CPU time in one cluster that was depending upon a search
indexing that was running in another cluster that they kept coming to me and saying, your
software is saying that there are problems on the system and taking machines out, which
is causing our service to be unreliable.
We can't actually launch our service publicly. And that led down a digging through so many layers
of the system and ultimately figuring out that there was a bug in the CPU scheduler in the kernel
where I was not being given enough CPU time to actually do the work because the system was so
heavily overloaded. I've never seen load averages in the 500s ever again.
Right, right. overloaded like i've never seen load averages in the 500s ever again right right um so there were things like that but also things like uh you know how the pci express has some a lot of pins on
it and some of those pins are not commonly used right there's like j tag and there's other things
and there's actually ones that are reserved for future use. Well, we justified why we should dedicate a pair of those to be a USB pair so that we can have USB out to our expansion cards to talk to different things.
I worked on a system where we used SRIOV.
I don't know how familiar folks are,
but with PCI Express, you can do IO virtualization
where you make a single PCI device
look like multiple devices.
So you can share them with virtual machine,
or guest virtual machines or things.
And we have used that feature to emulate MRIOV, which is where you have multiple computers attached to a common PCI Express fabric.
And then we use PCI Express switches.
So I figured out and worked with a team where we ended up booting eight machines off of a single network card that they all shared. You know, it's just like, not a problem that anybody else is ever
going to look at, but it was a way of looking at how would you deal with bandwidth problems and
the scale of how much cabling and deployment of the system, etc. So yeah, lots of edge cases. I
mean, it just probably goes on and on and on. And some of that work ended up coming out through
Open Compute Project, you know, eventually Google joined and some of the design work that came out, even OpenBMC is a case where,
funny story on that one, we were working with Rackspace on Barrel IG2 is what they called it,
Google called it Zaius. It was a Power 8 system, IBM Power 8. And because it was a power 8 system IBM power 8 and
because it was going to be public
through Open Compute Project and because Rackspace
was going to use it they wanted to have a
common BMC
or Baseband Management Controller
that you would normally find
on a server because that's how Rackspace's infrastructure
is designed but Google's is not
and
we were trying to figure out how we were going to support it internally.
And my boss was in charge of talking with AMI about their software stack, getting a license for MegaRack to run on this.
And I made a friendly wager with him one day that after he had been trying to work with them on getting just a price quote, I said, I bet I can actually get Linux to boot on the BMC faster than you can get a quote
from them. And I did it in two days. And it took like two and a half weeks for him to get a sales
quote. So by the time he got a quote, we actually had a fully booting Linux stack. And that's kind
of how we ended up working with the OpenBMC folks. And I ended up working in that space of
getting a bunch of the big industry players at the time, Facebook, Google, Microsoft, IBM,
all to come together and actually form that as a proper project under Linux Foundation to,
you know, here's how you actually build an open source management stack for these systems. But that was also driven from a need for, you know, solving some of these problems at scale and starting to work with other players. But you see a lot of these things. We built, you know, 48 volt to point of load voltage regulation. I think no one else does this. But, but you know that's going from like 48 volts to
directly to your cpu core in one stage of conversion but the efficiency is makes a lot
of sense when you're at our at the scale of google right so so you talked about uh bmcs and we're
definitely going to get back to that in the future but uh can you talk a little bit about just like
the general architecture of uh of google servers and what role a BMC is playing, I guess, in Google servers, but also more generally?
Yeah, the concept in like a server is that the main CPU is usually doing some work that's owned
by some team. And they have applications that they're dealing with and whatever. But often,
it's a separate team or organization that's responsible for the hardware management.
So it's like keeping the machine physically turned on and running and reporting the health
information about hard drives and fan failures and that kind of stuff goes to your IT or data
center operations folks, whereas application type failures go to the team who actually is using the machine.
And so a long time ago in a faraway plant, Intel worked with a couple of their companies to create IPMI, which specifies how a BMC exists on a system. And it really starts as like doing environmental control
and power monitoring and having a way of querying
that type of information from the host processor.
So it's about offloading all of that.
And one of the key aspects is that you can access
that information even if the host machine
is completely locked up or turned off.
So you can also do things like reset the system.
You can power it off.
You can power it back on.
And over time, that feature set has grown more and more do things like reset the system. You can power it off, you can power it back on.
And over time, that feature set has grown more and more and more to the point where
you have an integrated KVM support. So you can actually just go to a website that's run by the BMC on that server board. It's actually an entirely separate computer on the same motherboard
as the main system. But you can go to a website that's hosted by that,
get a UI that loads and gives you a display of what the actual console output looks like from the,
you know, if you hooked up a monitor on the VGA port,
you see the exact same thing.
And you have keyboard and mouse control.
And you can select like an ISO image,
you know, like a CD image and have it be mounted as a virtual CD-ROM drive on the remote system.
So you can do like complete from scratch OS installs remotely.
And so there's a lot of useful features for that.
I mean, it falls under a lot of larger category folks talk about as lights out management.
You know, the idea that the lights are turned off in the data center.
Nobody has to be there.
This is how you know how you manage and interact with the systems without having to physically
be there.
And also knowing when you actually do have to go physically touch them, you know, oh,
a fan failed.
I got to go replace that.
So is the whole interface like the HTTP API that it sounds like you're explaining there, is that part of IPMI or is that
just kind of vendor-specific implementation? So IPMI is old. IPMI came out of the late 90s
and there was a couple revisions of it. And it is its own bespoke protocol.
It's something that you talk over TCPcp and um have very specific data
frames and stuff and so that's when you run like ipmi tool on a linux machine or something like
that you're you're speaking these this custom protocol to that device so the web uis became
as a way to like introduce a more friendly way of doing it to the web browser once the web had really
become a big thing and people realized that you could run a web server on one of these
as a friendly thing. But yeah, it's more bespoke. There's been a change in the past 10 years of,
there is a later standard called Redfish, which is kind of reimagining IPMI as a R api so it looks more like you'd expect for a web interface as an api um but it it kind
of has the same data model as ipmi it's like you ask the system how many sensors do i have and
then it comes back and it's like oh well which ones of those are temperature sensors you know
like right is the uh are ipmi or redfish or any any alternatives if there are any are they ever used
outside the context of data centers um or is this you know mostly a pretty focused uh protocol
um i mean it's it's mostly used for dedicated server hardware um whether that's using a data
center or not kind of depends i mean i have a the machine I'm on right now happens to have a BMC and can actually do all this stuff.
Do I use it in my house on a regular basis?
No, not really.
But yeah, it tends to be focused on that out-of-band management, lights-out management type situation.
And so that mostly finds its way in server gear.
Sometimes you find it in like industrial control
equipment that's based on pc stuff you know like there's various applications for it but it doesn't
show up in your usually doesn't show up in your average pc although there was a There is a parallel in laptops.
Intel has what they call AMT, which is not a BMC, and it doesn't really speak IPMI, but it offers most of the same functionality.
So you can do remote management of laptops.
And again, it's intended for an IT department to be able to fix your laptop for you from
wherever you happen to be.
Gotcha.
One of the things that kind of stuck out to me in your description of BM from wherever you happen to be. Gotcha. One of the things that kind of stuck
out to me in your description of BMCs was you led with the like organizational component of it.
And obviously there are some technical benefits to the isolation. And like you said, you know,
the lights off management and that sort of thing. But I think that is always an interesting way to,
you know, approach any system design as, oh yeah, there's a different team that does that and is responsible for that. So we literally have, you know, approach any system design as, oh, yeah, there's a different team that
does that and is responsible for that. So we literally have, you know, a different computer
for them. I haven't heard it explained that way. So but it makes a lot of sense.
I mean, it's a long term thing where I don't know that that was the original intent.
Right, right.
However, a lot of solutions end up growing according to how organizational
lines fall. And so that's how a lot of the feature set has developed for BMCs over time,
which has interesting implications because the security model did not evolve the same way.
Right. Well, that was a little bit of foreshadowing, perhaps. But while we're still
at Google, you spent time working on these servers and the data centers and that sort of thing.
I think I also read that you spent some time working on flight control software.
Is that right?
Yeah.
After spending five or six years working on servers and hard drives. I was the manager for the hard
drive team and, you know, had done a lot in that space. I wanted to try something different. And
Google being Google, it has a lot of different areas that are going on. And it just happened
that around that time was just before the whole shift to being Alphabet and having the other bets, but X had existed for a while,
and there were some of these more grandiose, very different ideas, not the typical Google
development things. One of those was building a satellite internet project, so a LEO constellation
of satellites that provided internet access. If this sounds a lot like Starlink, that's because it is.
This is actually a predecessor project.
And through a series of unfortunate events,
the team forked and part of them went to create Starlink.
So common lineage there.
But yeah, I was working on flight control systems
and a lot of the high level
architecture for how the system would work. You know, when you're designing a satellite,
you kind of have the part of the bus, they call the satellite the bus, the part of the bus that's
responsible for keeping the satellite where it needs to be, and doing all of the flight related
things. And then you have the payload, and they're supposed to be entirely separate systems like they should have no
interaction really and so i was mostly focused on flying the bus um with but a lot of that was
just figuring out what did it need to do like what were the sequencing that needs to happen
what sort of orbital maneuvers would we need to do i learned way more about orbital mechanics than i ever had intended to ever learn um but uh i was only with that project for
geez maybe a year or or so before that project um got turned down um and that's around the time that
starlink got started up and uh but one of the pieces that we had built or started to build out of this was,
even though I was mostly focused on the flight control side officially,
I was working with a lot of folks that I knew from the server teams and the networking teams on,
there's an interesting data routing problem.
How do you actually build a network that goes through multiple satellites?
Because you not only have the classic networking problems like in Wi-Fi of what's my quality of my link?
You now have a problem of everybody's moving.
And so the potential set of links that you could form is constantly changing.
Right.
So I was working with them through that problem because it was just kind of fascinating. And we ended up building that system. And we ended up calling it the first
spatial temporal software defined network. And it literally is, it's a cluster system that just
kind of evaluates where everything is in space and evaluates all the potential RF links based upon
the radios and pointing capabilities that each of them have
and comes back with solution sets
so you can predict in the future where it's going to go.
And that ended up following through
a couple of different follow-on systems of repurposing it
because we built the system to track anything.
We didn't care how it was moving.
So it was like, oh, we're not on satellites anymore?
Well, guess what? It works on aircraft.
Oh, you're not on aircraft anymore?
Well, the Loon folks with their balloons,
they could use that too. You want to do between balloons and aircraft? Sure, guess what? It works on aircraft. Oh, you're not on aircraft anymore? Well, the Loon folks with their balloons, they could use that too. You want to do between balloons and
aircraft? Sure, why not? So that project actually continued on and eventually got spun out. And
some friends of mine still run that as a company. I can't remember the name.
But yeah, so that's actually a product that they're trying to sell now on the market.
Yeah.
So obviously, Starlink has gone on to be successful by a lot of measures presently.
What was kind of some of the blockers at that point in time?
Were there any technical blockers or was it mostly organizational in terms of that project's fate within Google?
I mean, yeah, there were a lot of challenges.
Not only, like, from a technological standpoint, there was a lot of how do you actually build
Leo to ground radio comms that tolerate the,
you know, basically how do you build the ground station
or like the user access
terminals, not the gateways. But there were a lot of challenges with just the technology for
building the antennas and stuff to hit a price point. You know, like it would be trivial if I
could have you spend $5,000 to buy a steerable antenna, but that's not practical for that type
of system. So the idea of getting to that, you know, panel that's not practical for that type of system.
So the idea of getting to that panel that's actually a phased array of all sorts of antenna elements was a lot of technical development that was going on.
And Starlink has definitely moved in that direction and shown to be successful in that
area.
There's a lot of other aspects in terms of the software backend of not only the data
routing, but the flight planning.
We had a lot of discussions around how do you actually keep in touch with all of the other organizations that you need to about potential collisions in space?
There are processes established for this, but they were never designed for the scale of the number of satellites in space.
And so they don't have the ability to do that conjunction analysis that fast.
So there were problems associated with that.
Yeah, lots of different things.
But from a technical standpoint, a lot of it was, how do you just build the radios?
How do you build the power systems?
How long is the satellite even going to last before you have to intentionally
de-orbit it? And a lot of those turn directly back into business problems. If I have to replace a
satellite every month, that's expensive. Right. Yeah. So you mentioned also kind of like
collaborating with the server and networking folks during that time. I'm not very familiar
with the architecture of satellites
and especially internet satellites, if you will.
Is it essentially like a flying server?
It kind of sounds like I couldn't help but draw analogies
between what you were talking about with the flight control system
and the BMC that we were just talking about,
about doing similar operational tasks for the server. And it's not a bad parallel. The main difference is that,
you know, a flight computer has to be radiation tolerant. It has to not fail effectively.
Otherwise, your satellite's completely useless. And that's the worst case, right? Like,
the absolute worst case for a satellite is you lose control of it and it's just there and you can't intentionally de-orbit
it like if the payload fails i can intentionally de-orbit it and get it out of space out of space
right i can get it to burn up in the atmosphere and that's a much better result than having it just be there as inoperable.
So the flight computers and everything related to flying the bus tended to be specced for much higher reliability and a much longer lifespan so that you were pretty confident that you'd be able to fly the thing
even if you lost the ability to use the payload to actually run the data comms.
I see.
And I can't help but uh
but think about as well in this you know you mentioned the episode with uh nathaniel earlier
um and we talked about their uh ct scanners and naturally what came to mind to me was how are you
updating these systems uh partly because i work for uh an iot company uh during the day so you
know we're always thinking about OTA updates and
that sort of thing. What was the software update process like for these satellites?
We didn't actually get that far in terms of the development. I mean, it was like,
try to make the system work first and figure out what all components you were going to have
and then figure it out. But it was definitely a very key thing on mind of a firmware update that kills the machine is not okay right like you have to be safe
no matter what happens um and so there there were a lot of thoughts towards how do i actually get
the system to come up in a mode where i can use one of the radios to basically be a serial port
to the console port and be able to upload things over x modem if i have to kind of stuff gotcha gotcha so uh after working on that you already
mentioned the the open compute project and your involvement there and maybe somewhat uh kind of
like tangential to that i know you started getting involved in open source FPGA tooling.
Obviously, there's quite a bit of FPGA usage in data centers for things that are kind of like similar to a BMC or where they're doing some sort of operational task.
Once again, Nathaniel and I talked about that on the Oxide sled. But, you know, how did you get interested in FPGAs and then especially get interested in the development tooling? I mean, from my perspective, it's hard
not to get interested in it when you suffer through using it. But what was your path to
kind of getting exposed to that? I had used FPGAs a little bit in college.
And frankly, with the way Google did their server designs,
there wasn't much involvement.
The network group tended to do a lot more with FPGAs,
and I just didn't have a lot of exposure to it there.
But as we got into working on BMCs and things,
one of the things you find quickly in that space is there are two companies that make BMC processors. There's A-Speed and there's Nuvaton, and that's it. And part of that just
comes with, it's a very strange set of hardware capabilities that you're looking for. It's not
about the software. Why can't you use an STMicro for it? Well, I need like 15 I2C controllers.
And I need six PWM-controlled fans with tachometers.
And I also need custom protocols that the processor vendors make up.
And that's just not a thing that you find in general.
And so there was always this thing of, well, the A-speed parts are not great.
They're not well designed.
And the new Vatan ones are expensive.
So what if we made our own?
And, of course, the answer is that's ridiculous.
Nobody wants to make their own chips.
You can tell that was a certain time period at Google.
Right, right. their own chips um you can tell that was a certain time period at google right right and there's an announcement today about uh some new arm arm server processors at google so
right so that's changed over time but at the time i i you know had discussions with like bart sano
and bart was just like i've done chip design before we're not doing chips um and it was just like, I've done chip design before. We're not doing chips. And it was totally reasonable.
So there was always somebody that was like,
well, what if we used FPGAs for this?
And so somehow through the open compute work
and through those kinds of discussions
where we kept wanting to use FPGAs to replace these BMCs,
I got connected with Tim Ansell.
And Tim was trying to get together a pitch to the internal startup incubator at Google
to develop open source tooling for FPGAs. And I thought, you know what, this sounds really cool.
Like, I understand where this problem is. And the FPGA tooling is absolutely horrific.
So sure, why not, right? Like, this is an area where i can go in and they'll give us a year to work on
this and and see how where we get and that was really the introduction to it was just a i kind
of understood where the problem was um and figured i could help and they were happy to have me come
join there were it was only three of us really working on it. And a lot of it was actually not so much building the tooling as it was figuring out how to reverse
engineer the chips themselves, which is a very different skill set. Um, right. And I think you
worked on, uh, project X-ray the, the Xilinx bit stream reverse engineering? Yeah. So, I mean, Project X-Ray has an interesting
history. Claire Wolf, who did the Project Ice Storm for the Ice 40s, actually had started Project
X-Ray much earlier and figured out that these are big, complicated devices and it was going to be
really hard to do. And the Ice 40 was an easier target. so this was actually picking up that work and and going with
it and this was really the the reason for having the the project be done through the startup
incubator was to give us time to actually go figure out how to do this and so it was three
of us working on project x-ray which was you can think of it as coming up with arbitrary like designs that probably didn't make any sense
um you know feeding in verilog or or specific device like placement things and then setting
really strict constraints to force very simple circuits to be placed in very specific locations
on the chip so that we could then take
the resulting bit streams and compare them against various permutations and see what bits changed.
And from that, infer what behavior each bit actually was doing. So you'd have to settle
out, was this something where this section had always changed? that's actually the ecc over this section of
the configuration rom or you know the configuration data or is this actually the bit that tells me
uh use this input to this lut kind of thing and so a lot of it was writing tools that did that
analysis and um i wrote a lot of tooling for actually picking apart the the multiple layers of encoding that they they do for the bitstream that actually gets stored in Flash to get to the actual configuration data.
And yeah, I mean, we got pretty far, but it turns out Xilinx 7-series FPGAs have a lot of different tile types, and there's a lot of bits in them.
And they're still working on that project,
actually, many years later. So it's, it's a tall order, but it was a lot of fun.
Absolutely. So did you also get involved in any way and kind of like some of these modern HDLs?
Because obviously, you know, you already mentioned Verilog. There's System Verilog, VHDL, but it seems like also on Hacker News every week, there's a new kind of like HDL that someone has done as part of their PhD thesis or something like that.
Did you have any involvement in those areas? Because I know that SimbaFlow was right. Was that the name of the whole project? Yeah, that was a top level name for doing Project X-Ray and trying to use available open source tooling to actually build out a tool chain around it.
So it's do the reverse engineering work and then be able to use that to build out a tool chain.
And there's there's just this small community that actually works on this kind of stuff.
And once you get introduced to that community, you also find that that's where all of these modern HDL folks come from.
So that's where I met folks like White Quark, who does the Amaranth. introduced through that path of, hey, I'm working on Project X-Ray and getting into
IRC chat rooms and talking about the tool chains and what's going on.
And so I know some of the folks that do some of them.
And I keep track of it because ultimately, when you approach it as like a compiler problem
or a software engineering problem, Verilog's a terrible language. And system Verilog
is the C++ of HDLs. And they're useful, but they often, they give you a lot of foot guns.
And the modern HDLs, some of them do better at some areas than others. And, you know,
they all have their different trade-offs. And I think it's really just a renaissance of trying to come up with an HDL that better
represents the problem at hand.
But there's kind of this fundamental challenge that happens there, too, of because the tooling
is kind of nascent in the open source space, like Yosys works, but has its limitations,
and XPnR has its limitations, then everybody falls on a, well, if my modern HDL generates Verilog, then I can feed it into the vendor tool chain.
And it's kind of like the JavaScript situation.
You know, you have TypeScript, but TypeScript really just compiles down to JavaScript anyway.
Right.
So you worked on that for a year, and then that was kind of when you started to wrap
up your your time at google right yeah i mean the the way that the startup incubator worked it was
you you do your work for a year and then you come in for a review and then you do a review every six
months to get renewed or not um and just after doing that first year, we did a lot of work, but I was
having difficulty seeing how the project would succeed with what the goals were being outlined
and where to go with it from a big business perspective. And having spent nine years at
Google at that point, like I had watched the company grow in a lot of different ways and
kind of change culturally and just decided it was kind of time for me to move on to something
smaller. As you may have gathered, I at this point had worked at two extremely large companies
and just had an opportunity finally where I could actually go work at a startup. Like I'm
at a point in my career where that works okay. And so I started just shopping around for what was
out there. And we didn't really touch on it, was what was out there. And, you know, we
didn't really touch on it for but while I was working on server designs and things, I got heavily
involved into hardware and firmware security aspects, and more into PC security in general.
And so I ended up, you know, finding some startups that were working in that space and
moving over that way.
What kind of drew you into the security realm? Was it just like once you start understanding a system really deeply, you kind of have this natural kind of thinking about, you know, I wonder how this fails or I wonder where the holes are in this.
Was that kind of the impetus for you? I always had the aspect of looking at it from, you know, how are things going to fail?
Because that's what health monitoring and like, you know, automated diagnostics is. And, but
there's a slightly different angle when you start looking at how operating as a public cloud vendor
works. And that was really a, like, there were two big shifts that happened while i was at google one was the aurora attacks um which is a you know well documented like campaign against many different
companies by um threat actors that are believed to be associated with the nation um and that
affected like they were targeting my team it It was one of their targets, right? So
they were trying to figure out how you could do persistence by getting access to firmware and,
you know, building backdoors and things into firmware. And so that, that just kind of spooked
things in a lot of ways and really caused a lot of rapid changes to how we developed software and firmware and how we deployed things to production, et cetera. So I got a lot of ways and really caused a lot of rapid changes to how we developed software
and firmware and how we deployed things to production, et cetera. So I got a lot of front
experience too. We were pretty free and open how you do all this and then very closing it down.
But then as Google became a public cloud provider, there was another thing of, well,
now I'm running untrusted code. And so how do I actually provide isolation? How do I do all these different things? And I was not actually working on that,
those set of problems, but I was working on the hardware they were running on.
And so we'd get these questions of, you know, how do you bootstrap from nothing? And how do you know
that it's trusted? And it mostly tied into, you know, manufacturing operations. How do I know that
when I receive a server from the manufacturer at the data center door that it wasn't tampered with at some point? And we knew that we were a
big enough target that we had to pay attention to all of these cases that a lot of folks just
don't, right? Like when you buy a laptop, you probably assume that you're not interesting
enough for a three-letter agency to intercept it and do something to it before it arrives at your
door. But when you're Googled it, you don't get to do that right you have to think through some of
those cases and how would you actually find it um so that's a lot of where my introduction came into
it uh and working on some of the things like i worked on how uh titan was used as the root of
trust and measurement in the systems and the server designs.
And I co-wrote one of the white papers on how that system works.
And so I had a basis there from that perspective.
But then I ended up joining a startup that was much more focused on application security.
Like, how do you defend against people sending malicious attacks to your application running on a server?
Which was just a very different field for me, but it was also at a startup where I was working with like 10 people.
So it was trying something different.
Right.
So talking a little bit about the Titan system that you were mentioning there,
can you talk a little bit about how that system works
and maybe some of the strategies you'll have at the hardware slash firmware level for mitigating
some of these attacks? Yeah. I mean, a lot of folks often think about, you know,
UEFI Secure Boot or Intel Boot Guard as this way of defending sort of what we ended up calling the
first instruction integrity. When I press the power button, how we ended up calling the first instruction integrity.
When I press the power button,
how do I know that the first instruction
that the CPU executes is actually trustworthy?
Because once that instruction executes,
I have no control over...
Like, if I can't trust that first instruction,
I can't trust the second instruction either.
Right.
And it's a very difficult problem, actually,
because the PC architecture has grown over time, and this was never a concern in the early days.
So all of your firmware is stored on a flash chip that is outside of the chipset.
It's on the motherboard, for sure, but if you have to consider somebody coming through and having physical access to a machine, they could change the contents of that.
Or if you have an attacker who managed to gain access where they could write to that, then how would you know? How do you
detect tampering? And frankly, a lot of the question is not so much about defending. It's
not prevention, usually. That's certainly a goal, but the assumption is that someone's going to get through.
And so the question becomes, how do you detect and how do you mitigate or remediate?
And so that's a lot of what the Titan design was about.
It was, how do I do opportunistic prevention of someone writing to the firmware?
But also, if someone managed to get something in, how would i actually be able to power down the
system power it back up and know that i had gotten full control of the system back and that i didn't
have some tampering in you know the system firmware somewhere so in that design uh it's
actually a an interposer on spy so essentially it it sits between the host processor and the spy flash and watches and actually, it does more than watch.
It actually intercepts all of the traffic, all the read and write requests over the spy interface for the contents of your firmware.
It makes a decision about whether or not it should actually pass it to the flash and whether or not it should get the contents back to the host system.
So it essentially gets to make decisions about, are you allowed to write to flash?
And it also does cryptographic verification. So when you write to the device, it actually is
building up a hash of the firmware and then validating that that matches a signature
before it will ever allow the right to actually complete. And so by acting as this intermediary and also having control over
the system reset line, when you power on the system, it actually does a hash over the contents
of the flash, and then it verifies the signature chain, and only then does it actually let the
system boot. So you get to a level where attacking the system requires much more physical access.
It's not as simple as, oh, I get root access to the machine.
Now I can just write to the flash.
Nope, it's not going to let you.
And there's nothing you can really do about it other than physically get hands on it.
And even if you do, even if you physically change the contents of the spy flash, it's still going to notice that you did that.
Right. even if you physically change the contents of the spy flash it's still going to notice that you did that right so you'd have to actually physically manipulate the the titan component right to actually make any headway there right and of course there's a lot of technical limitations
about this spy is very fast um i mean it's usually only like 33 or 66 megahertz which doesn't seem
particularly fast but the way the protocol works, when there
is a read command, you have two cycles before you actually have to send data. So you have very
limited time to make any decision or do anything. So there's a lot of small things there of getting
that system to work. And it's really a bolt-on solution. You know, it's not great. And this is why you see other systems like TPMs
are another system that Google doesn't actually,
didn't use up to that point.
I don't know if they do currently or not,
but it's another bolt-on system, right?
It's another way of adding something
to the PC infrastructure
without radically changing how the system works
to kind of add additional security
properties to the system. And you see that kind of trend over time that it's how do I take the PC
and add something to it to give me a little bit more guarantee about how it works.
Gotcha. So as you moved into some of these security focused startups, I think you had
kind of more of like a researcher role,
perhaps, than you previously had been more of an engineer. Was there a practical distinction
between those two? And what were the differences, if so? There was. I mean, certainly, as I moved
into working at Eclipsium, which is much more focused on firmware, there was a need for
someone who actually understood the system infrastructure and could do research work
of how could I break the system, right?
Like, how do I get ahead of attackers and the things that they could find, what vulnerabilities
they could find?
But there was also a need for somebody who understood that and could translate it into how you would build defenses for it.
So you'll hear security folks talk about red team, blue team,
and now there's purple teaming because why not have both?
And red teaming is figuring out the attacks, right? It's offensive trying to figure out how
to break systems. And blue team is how you defend.
And so they needed someone who spanned those two teams.
So depending on the day, it was sitting down and writing in their application how to detect whether a system was configured in a way that would allow for certain vulnerabilities.
Like looking at chipset registers, looking at flash configuration, et cetera, and reporting on that information so that you can make decisions about the risk of your fleet of machines.
But there were other days where it was, hey, I've got this machine.
Let's go see what we can break on it.
Right.
Like that's.
Right.
Yeah. Was it mostly vendors or companies that were running data center scale compute that would hire Eclipsium to come in and kind of like do this analysis?
Or what's the business model for that type of business?
I mean, the business model for the product is more selling to companies that have large deployments of machines as kind of a monitoring system, right? Think of it as like your antivirus deployments, right? Except it's something
that's scanning your machines, looking to see, do you have systems that have out-of-date firmware
that have known vulnerabilities? That kind of information. But the research side was more keeping the interest in that a lot of folks just assume that their attacks are going to be more application-focused or operating system-focused.
And they just don't even need to think about hardware.
I bought my equipment from HP.
It's HP's problem.
And so a lot of it was, how do we keep finding new things to point out that the vendors aren't actually doing this work?
They're building the machines, but they're not actually thinking through the security problems.
And occasionally we would have a vendor who said, you know, we did something cool.
We want you to actually look at it and see if you can break it.
But that was very rare.
Most of the time it was just, hey, we bought this thing off eBay.
Let's go see what we can find.
And you'd be amazed at the stuff that was found. I mean, sometimes you just buy something off eBay
because it looks strange
and it turns out it's already,
it was a vehicle mounted unit that,
or a computer that has a cellular modem
and still is registered
with FedEx's Active Directory domain.
You know, it's just too easy.
And other times it was,
we bought the server and we started poking around
and we were just like, it looked like maybe there was something that you could do over here and we'll just keep poking at it. And then you find, oh, I can do a authentication bypass on the BMC and better and use that to raise awareness so that
people thought about buying the solution of having a way to know what vulnerabilities were in there
and how to do remediation. Gotcha. That makes a lot of sense. So kind of like talking to BMC
vulnerabilities, one large thing you were a part of or kind of discovered while at Eclipsium was this USB
Anywhere vulnerability. And you've given some talks on it, and we can certainly link those in
the show notes as well as the report itself. But do you kind of want to run through what USB Anywhere
is and also how you went about discovering it yeah usb anywhere is a vulnerability related to you bmc virtual media so we kind of mentioned
this earlier you know bmcs offer this lights out management capability and one of those things is
i don't want to have to walk into the data center to stick a CD into a machine to reinstall the operating system or to update the device drivers or whatever. So instead, it has the capability for
from your web browser, mounting a CD image as a virtual CD-ROM drive on the server somewhere else.
And I was always just curious how that actually worked, is really how it started.
I knew from my work on OpenBMC that the hardware level was a dedicated piece of hardware that
emulates a USB device.
So you see this in mobile phones.
This is kind of how USB2Go works,
where if you plug in a USB,
or when you plug your phone into your computer
and it's like, here, you want to download your photos,
the phone chip actually has the same kind of hardware
where it can emulate any USB device.
It's a USB endpoint that doesn't have a fixed device to it.
It's not one specific thing.
So you write software to implement what it does, how it responds to requests.
So I knew that that existed in the BMCs and that the firmware was actually deciding what kind of devices it was.
That's how your keyboard
emulation works how your mouse emulation works how your you know a lot of different things but
the virtual media i was like how is this getting from my browser all the way there because it
doesn't seem like it was transferring the entire uh file over before starting it um and so it turns out that it's a horribly insecure protocol. What actually is happening
is that in older one, older BMCs, they would ship a Java application to you.
On newer ones, they do it via HTML5, which is kind of even worse in some ways but essentially like in the html5 version there is a javascript library running
that in your browser that is an entire scuzzy stack and it is actually and an iso file parser
and so it's actually javascript in your browser, opens the file in your
local machine and presents that as a block device to a virtual SCSI device that knows how to answer
SCSI requests like it is a virtual CD-ROM drive. And that is connected over a web socket to the
BMC. And then the BMC is effectively just forwarding the request back and forth. It's actually speaking raw USB requests over that.
It happens to be sending USB mass storage, but it's actually just sending raw USB packets.
So this always comes as my example for when folks are like, how bad can it be?
I've seen the worst thing in programming. I'm like, have you ever seen a SCSI implementation in JavaScript? And so the
Java version is very similar. It happens to be written in Java with a JNI extension, but it's
the same basic problem. And it turns out that, you know, they had just built their own thing.
The vendor had a long time ago and had not really updated it.
And the way the ecosystem works around BMCs, there's a couple of key companies that make the hardware,
Nuvaton and A-Speed, and then there's a couple of key companies that make the operating systems for it.
So like AMI and Vertiv and a couple of others. And then the companies that actually manufacture your motherboards
just license both of those.
They buy the chips and they license the OS,
and they do a little bit of customization to it.
And then that is sold to whoever puts the name on the box,
and then that's what gets sent to you.
And so fixing bugs is a complicated process.
And so oftentimes the same bugs show up again
or they just never fix it
because it's hard to get the communication
across all these teams and companies.
So in this case, it was doing things like
it was unauthenticated.
If you did authenticate,
it was trying to use encryption,
but it was a very, very, very old encryption
that was trivial to break.
It had hard-coded passwords in it. It
had just all sorts of things. And so my proof of concept for this initially was using a framework
called FaceDancer that lets you develop virtual USB devices in Python. And I wrote a backend that
connected it so it would actually connect to the virtual media or virtual media service on the BMC.
And the very first time I got it working, FaceDancer's default is to emulate a TI calculator.
And so I like logged into the server and I do LS USB and it says that there's a TI 83 plus connected.
I'm like, I don't think that's right.
Right.
But yeah, I ended up doing a demo where I actually plugged in a virtual USB stick across the internet over to a server many states away.
And it was actually just a file on my drive locally.
And this is kind of terrifying because this was unauthenticated access.
So you could literally, if you could find a way to talk to one of these bmcs you could plug in any
usb device you wanted which seems like not a big deal until you think about it a little bit more
that's actually terrifying right wow so what was the uh the reception like when you put this out
was there um obviously we've had recently some vulnerability discoveries
that caused some hysteria, I would say.
Was there a lot of feedback when you put this out there?
You know, on one hand,
there was a lot of folks that showed up that said,
oh, you figured this out too, right?
Like this was just sort of an open secret that BMCs...
I actually had someone write to me and say that
finding vulnerabilities in BMCs was unsportsmanlike.
So, I mean, there was that sort of reception and then on the other hand there were
a lot of folks who just had no idea that this functionality existed in the first place
and then to see that you could abuse it in this way was absolutely terrifying
so I did end up getting some national press and things for it but it really didn't have a huge...
I mean, I did some talks on it, but it didn't cause the fervor
that some of the more recent vulnerabilities
that show up on literally everyone.
And part of that was just that most people try not to put
their BMCs on the internet.
It turned out that as part of this, I have a friend
who happens to run an internet exchange,
and so he let me have a VM, and I ran a scan
of the entire internet for
affected bmcs and came back and so you know part of the the story was actually that i found you
know 30 000 plus bmcs that i could just arbitrarily plug usb devices into right because i mean like
obviously the the usb part is is bad but you have to have access to it. Right. So it being on the network is kind of the prereq. And I mean,
how does that, how does that happen?
A lot of people just don't know. I mean, they just, they're like, Oh,
I got this cool feature from the vendor that says I can do remote management.
Let me just plug that in so I can do it from home.
And not thinking about the vulnerabilities that it might have or anything else.
And so there's just, how do you actually do the risk assessment of equipment that you're buying?
And that's a perennial problem.
You know, it's, there doesn't seem to be a good answer to that.
Right, right.
So kind of like bringing the security and your previous server work together from Eclipsium, you went to Oxide, which listeners of the show will be familiar with at this point after the last episode.
What was kind of the decision like to join Oxide and what was the kind of like role of looking into security there. And I know you and I believe your colleague, Laura Abbott, also
identified some not similar vulnerabilities, but vulnerabilities at similar layers of the stack
there. Yeah. What was the decision like to join Oxide and what kind of work did you do there?
Yeah. So I had been at Eclipsium for a while and kind of had the viewpoint of it was fun finding vulnerabilities and reporting them.
Getting vendors to fix them was hard.
And actually building the detections for some of these things was also difficult.
And it just wasn't a good fit for me anymore.
Like I needed something a little bit more concrete.
I like building tools.
I like building systems and stuff like that. And so I had gotten in touch with Jesse Frizzelli about BMCs and that kind of stuff as some of the things that stuck out with me while I was still at Google
was folks coming up and saying,
hey, will you just sell us your machines?
And we don't want just the machine.
We actually want a software stack that goes with it.
Like, you already know how to run all this stuff.
We just want to buy it and use it.
And you would think this is small companies
or something that we're asking this.
No, no, no.
These were folks who were at the scale where they were buying hundreds to thousands of racks of machines from hp or dell on monthly
right like they were at the scale where it didn't they were starting to question whether they should
actually be designing their own machines or not and so they were looking for a solution and they
were looking at open compute as how do i get cheaper, but I need a more complete system. And so this had been
in front of mind and I had actually tried to work internally at Google for how could we improve this?
How could we provide a more complete software and firmware stack for these machines so that people can
actually buy them and use them. And it wasn't very receptive at Google. You know, Google's
position was basically we're designing machines for ourselves and we're sharing with the industry
so that we don't have to keep doing the new development work all on our own and that it
actually gets picked up by other companies. So when Brian and Jesse came around and they were talking about, hey, let's do this thing, I'm like, I know exactly how to do this.
Like, I know that there's a market for this.
I've been talking with folks for years about this.
I know what a lot of the problems are, but I don't think you're going to get funding because I don't think you're going to be able to convince anyone, any VC that you're going to have to fund you for long enough to actually build the system that somebody would buy.
Right.
And they succeeded, right?
They came back with, they had got their seed round and it was enough.
And it was on terms that was,
this is going to take years to build and that's okay. So at that point I was like, all right,
I'm in. I came in not really focused specifically on security. I certainly brought that background,
but it was definitely just a, let's build this the right way, right? What would open compute
look like if you were actually trying to build it
as a sold product,
as opposed to a repository of standards
and like prototypes for other people
to draw their technology ideas from?
And yeah, I mean, that's,
I spent a lot of work,
like I worked on all sorts of different pieces
of the stack there
because I wasn't particularly focused on anything, particularly focused on any one piece of the system.
I definitely had a significant security role there over time, like looking at how to build the root of trust and measurement, how to do firmware signing, how to build out a system where at Power, you can trust the system in much the same way
that Google was designing their systems to do the same thing. But I also worked on fan control,
and I worked on, you know, how do you do the lights out management features? And
how do you actually power up the 100 gig network cards and all sorts of different things. Right. And so you'd had this kind of experience with
VMCs made by two vendors that were not necessarily great. When you had the opportunity to,
you know, do it the right way. What was the right way?
Well, I don't know that there's any right way necessarily. But there was definitely if I don't
have to play in the sandbox of building a
pc compatible system if i'm not assuming that i'm just going to boot an off-the-shelf
operating system with no work then i can throw out a lot of things and if i can do that then
i don't need these specialized chips i can build it out of commodity chips which by the way nuvatan
and asp didn't really want to sell to me anyway because like we're a startup they don't want to sell to us right um so uh just looking at it of
how do we how do we build this out in a way that meets the needs and it turned out like
one of the things is that bmcs were starting to become seen as a way of building your root of trust of measurement so the repository of
information about what was booted on the system um and that's kind of scary because the BMC also
is a network accessible device and that's just sort of like feels wrong right i don't want my secure element
to also be network accessible necessarily that's odd um so a lot of it was just trying to pick
apart what sort of functionality we actually need do i actually need a graphical console to the machine? And the answer is no, right? Like Oxide's thesis is you don't actually need to know any of the details of the
machine. That's not what you're buying. You're buying a rack and the rack has the interface.
The notion that the machine, each individual machine is booting an operating system and has
management controllers is an internal detail that can be hidden from you.
And not necessarily like we're trying to intentionally hide the information, just like
you need the output of that information, not all the details of how it worked. And you don't need
to write the software to deal with that problem. That's my job. And so being able to split that up
into multiple microcontrollers and design the system to have all the features that we always wanted you know it's um there's a lot of we nathaniel
mentioned the ignition system where it's like you can actually control the machines even if they're
powered off even if 90 of the board is dead like you can control the power the main power control
to the to each tray and that was a in a decision that we made, you know, a design choice that we made of having that level of control of the system as a way of being able to recover from various scenarios.
You know, oh, the machine's locked up and the service processor crashed.
I can actually hit the reset even harder.
Right, right. crash i can actually hit the reset even harder you know that's right right so you mentioned having
uh kind of like multiple microcontrollers playing the role that the bmc traditionally would um
the so you have the the service processor is that doing kind of like the the bulk of the bmc related
uh management and what other um microcontrollers or you know FPGAs and other chips are kind of
involved in and kind of like providing a chorus of of components that are playing the role of the BMC
so ignition is its own dedicated FPGA that just does this lowest level of being able to turn the
power on and tell you if the initial power stage is functioning and so that was designed to be as simple as possible it should never fail that was the goal
and so that becomes its own dedicated thing then the service processor yeah that's the bulk of the
functionality it's doing the fan control it's monitoring the power rails. It's doing a lot of different things.
Inventory of the system, et cetera.
And then the other main piece of it is the actual root of trust.
So root of trust measurement is very similar in concept to what Titan actually was doing in the Google design.
But a little bit more integrated because we could.
And build on commodity hardware because we could.
Which that took a lot of time.
It took a lot of time to find a device that we were very keen on not having to sign NDAs to get the documentation for these parts. Like you can go to vendors and get details on secure,
actual secure elements, but they are heavily restricted on just getting the documentation
on it. Like even getting a brochure on which parts are available is a restricted document,
where we were looking at parts that, yes, there are some aspects of them that are under an NDA,
but generally you just had to maybe create an account
to download the bulk of the documentation.
And the intent was that we were going to open source all the firmware anyway.
So we wanted you to be able to have not only the firmware,
but also the ability to actually go comprehend what it's doing.
You should be able to read the data sheets.
So the root trust got built off of another commodity part,
and it happened to be one of NXP's LPC55 series parts for a variety of reasons.
I had a very long and detailed document around how we did the assessment of selecting which device based upon its security properties.
And that was the whole point.
This was a device to manage the root of trust and do key management. You know,
it was very dedicated to that purpose. And, and, you know, compared to some of the other options,
it likely was, you know, had some security capabilities that were preferable, but you
also discovered that there were some issues.
What was the process like for identifying?
I think there's two vulnerabilities that y'all ended up discovering.
Is that right?
Yeah.
I mean, honestly, we weren't specifically looking for either one.
We were looking through the chips as how they worked and the potential for vulnerabilities and everything from I sent it off to a friend of mine who does decapping and being able to look at it and see,
does this have a security mesh on it? Does it, you know, just that high level type things.
But there were definitely some features available in the chip. Like it has a physically unclonable function, which is basically it's an SRAM cell.
It's a memory cell that intentionally you can't write to.
You can only read from it.
And so it comes up in an uninitialized state.
But because of the silicon processing, the state is unique to the particular section of wafer that it was cut out of.
So you can apply a probability weighting to it
and get a consistent readout.
So you can say like, every time I power on,
I apply this probability to each bit of the SRAM readout.
And that gives me a way to generate a unique identifier for this device.
And since it's based upon the actual silicon properties, you can't copy it.
And so that was a cool piece, right?
We used that, actually.
But as far as the vulnerabilities, we weren't digging really that far into it.
We were actually just trying to figure out how to use the device.
A lot of microcontrollers nowadays have a built-in ROM that does their initial bootstrapping, offers some of the common library functions or power management type things.
Instead of you compiling in their code as a library, it's just you call this
section of the ROM. And we were just trying to understand how the code signing system worked
initially. And we were generating a variety of different binaries that we wrote with the
different headers and we're signing with different key types and things and and just encountered unusual behavior uh we started reverse engineering the rom just because we needed
to understand how it worked because the documentation didn't tell you uh and nxp wasn't
really particularly forthcoming about how it worked they wanted they just wanted you to use
their tools and their tools were insufficient for our process. So we had to dig deeper into it.
But once the chat room sort of like,
hey, can somebody look at this?
This is odd.
Followed by a lot of active discussion around,
well, what is going on here?
And then, oh no.
Oh no.
And that really happened for both of them.
We were looking for something else entirely, trying to just understand how the part worked and realize that you could manipulate it in ways that bypassed various countermeasures or caused the chip to do things that it's not supposed to do. And then having come from the world of working like at Eclipsium and doing vulnerability disclosures,
it was like, okay, I have a process for this.
I know how to do this.
So here's how we talk to everybody.
Here's how we coordinate it.
And honestly, the talk that Laura and I did at DEF CON
was entirely driven by spite.
The vendor's response was so poor and so frustrating that i submitted the the talk to their call for proposals and you know didn't expect to get
accepted but we did so we had fun that's awesome How did y'all go about actually dumping the ROM and then analyzing the code that you got off of there?
So on those parts, it's actually the ROM is intended to be readable.
So it's actually just really trivial if you just write a program that goes and reads it or use a debugger to dump it.
It's just accessible space.
I mean, other chips I've worked
on, you have to do clever tricks. You have to find a vulnerability itself to get access to it or
something. But then once we had it, it's mostly loading it up in Ghidra and spending a lot of
time. Reverse engineering is a skill. A lot of people see it as this very difficult task,
and in a lot of ways it is, but it's really a puzzle problem.
It's finding patterns and recognizing things and building up an understanding of how things work from small levels to bigger understanding.
So we just had a couple folks that we would share it around, and we all had Ghidra, and we would just pull up different sections of it and start working through, oh, this is a mem copy. Oh, this is a strlen. Oh, this lines up with these registers that the
datasheet says. And then you could start to infer what the behavior was. From the breadcrumbs that
I had, you could trace what it was actually doing. And the ROM was small enough that you
could mostly get through it and understand it. whereas on some other devices I've worked on or other software, once you have a megabyte of code, you can't realistically reverse engineer the whole thing.
Right, right the conversation now. Uh, but
I am, I'm going to ask you to explain quantum computing now. Um, but, uh, so, so you, you went
on and you joined, uh, IonQ. Um, and, uh, I've, I have spent the bulk of my research for this
episode, uh, on quantum computing. Cause I'm, I I'm coming in with very little experience.
I do have one friend who is doing a PhD in quantum error correction. So I'm familiar with
some of the constraints of quantum computing. But for the listeners and for me, talk about,
one, your decision to join IonQ, but also,
can you give us kind of a bit of a primer on quantum computing?
Sure. So I was looking for a place to work. I had moved to the Seattle area
while I was working at Oxide. And for reasons, I was just looking for something in this area.
And I honestly applied to be director of security at the company.
And in the interviews, they said,
oh, well, we're not actually hiring for that role right now.
We've decided we're not going to hire anyone for that role.
But your background is fascinating.
And we would just like you to interview with some other folks about this embedded control systems.
I'm like, sure.
I know absolutely nothing about quantum computers.
But I just assume that you're going to have physical hardware.
And I'm going to understand the security properties of this.
And we'll figure it out.
And between all of the work that we've talked about, about, you know, processor design and health monitoring aspects and flight control systems and all these different pieces, it's just
like, yeah, I've worked on embedded real-time control systems sometimes, you know, and I have
some understanding there. So I really came in knowing nothing
and not expecting to end up in this team.
As for how quantum computing works,
I always get asked this question
after like an hour of discussion,
and it's great.
So here's the capsule summary.
Quantum computing,
there's actually a really great tutorial on this online.
I think I sent this to you, quantum.country, that kind of walks you through thinking about how quantum computers work from a programming-centric or algorithm-centric model.
And it kind of walks you through the basics.
But the top-level idea is that a quantum computer is something that can represent or do computation of quantum mechanics more efficiently than a classical computer. The practical aspects are
that you're really dealing with this model where somehow you have a thing called a qubit,
and we call it a qubit because it's quantum.
And it's not actually a bit.
It doesn't have just a zero or a one state.
It's a probability function.
And often what people express it as
is what they call the Bloch sphere,
where imagine you have a sphere
and the north pole of the sphere
is the zero basis state
and the south pole is the sphere is the zero basis state.
And the south pole is the one basis state.
Or vice versa. It doesn't actually matter.
There is common nomenclature for it, but I don't remember what it is.
The idea with the Bloch sphere is that it's a unit sphere. So you've got the two poles defined as known states, and then every other point on the surface is somewhere a unit vector based upon spherical coordinates. And wherever you are on
the surface defines a probability that if I measure that qubit when you're at that position,
whichever pole you're closest to, you have a higher probability
of going to it. So if you're at the equator, you have a 50-50 chance of going to either side.
But if I'm a quarter of the way up the north, then I'm like 75% chance of going to the north.
So I would be a zero, right? So the output of a qubit is always either a zero or one,
but it's a probability. And so quantum computing is mostly around doing operations to move the point
around the unit sphere, such that when you measure it, you get some result
that collapses to a zero or a one. And an interesting part about quantum computation
is when you do the measurement, you destroy all of the other information about where exactly you were on the sphere. You only
know the output of a zero or a one. You don't know anything else. But while you're doing the
computation, you actually can control very finely where you are in terms of spherical coordinates.
You can do sort of arbitrary rotations around the sphere.
So that's the computational basis of how this kind of works
and how the algorithms work and other stuff
is a whole separate discussion
around quantum computing theory.
And the quantum computing practice
is more of what I deal with, how the machine works.
And to that end, you know, quantum computers
are kind of at the state that classical computing was in maybe the 40s, where there's a theory of how quantum computing works.
There's a couple of specific use cases that show promise for why you might use it.
But nobody really knows how to build a good one yet.
And you're trying to figure out the raw technologies that actually make it work.
So like with classical computing, there were people doing relay computers.
There were people doing vacuum tubes.
There were people doing all sorts of things, right?
And for data storage, we had Williams tubes and Mercury delay lines and all sorts of different things, right?
Because we just didn't know how.
Well, quantum computing is kind of the same way.
We know that you can build quantum computers out of transmons and out of trapped ions and out of superconducting systems and all sorts of things.
But we don't really know which one of these is going to be a good way of doing it.
And so everybody's going in different directions. And similarly, there's no coherent or consistent programming model, I guess would be the right word, around what the correct set of operations would be.
We don't know that incrementing one makes sense as an operation, where that makes obvious sense in a binary world.
It took a while to figure out that using binary was the right answer, and that when you had binary, you should use two's complement to express the numbers and all those things.
We're still figuring that out in the quantum world.
So with IonQ, I work primarily – they exclusively worked on trapped ion quantum computers, which the idea is that you take a material, the systems that IonCube produces
use euterbium. And basically you use a laser to ablate a euterbium target to cause a plume of
ions to happen in a cryostat chamber that's under vacuum. And you then, that causes this plume,
which gets it up into what they call the loading chamber of the trap.
And the trap is effectively a series of electrodes kind of made in a line.
And you've got the series of electrodes that constrain the ions
using both RF and DC voltages
to hold it in X, Y, and Z coordinate space.
And so you're literally using an RF pattern as well as DC to force this ion to sit
kind of in space and then be able to move it along in a line.
And you do this so that you can hold the ion in very specific places so
you can hit it with lasers. And the reason you do that is if you remember your chemistry from
college or high school, remember how atoms have S shells and P shells and D shells and
all these things where your electron states and your energy levels, yeah, that's what we're playing with.
We're holding that ion there, and it's an ion, so it has extra electrons.
And we're using laser pulses to essentially inject and remove energy to cause that electron to move between different portions of,
or to different energy states.
And the reason we do that is we can encode quantum information that way.
And then when you get to the measurement aspect, it's using lasers where in a different sequence, you hit it with a
laser and it either generates a photon or it does not. So that's how you do the collapsing onto
binary zero or one. So the whole system that I work on is basically a very complicated real-time system controlling
lasers very precisely to aim and fire waveforms, like modulation of lasers, at individual ions
to be able to then cause them to emit photons that we then have photomultiplier tubes that
we actually count how many photons come out to figure out what the probability
distribution was. Gotcha. Well, I probably shouldn't say gotcha because, you know, I'm sure
that there's a tremendous information loss that just happened over our internet connection here.
But the part that I guess, like, I want to, like, press in on a little bit. So, um, you know, I, I, I will give a plug to iron
Q's resources page here because the, uh, like background and glossary section is extremely
useful. Um, so in this kind of like, uh, period, I, I kind of think of it and this, this may be a,
a faulty mental model here so so feel free
to correct this but um you know when i was learning i i started in kind of like a software background
and then when i was learning about hardware um and kind of like working with fpgas and stuff like
that um you know you have this kind of like settle time for whatever uh logic you're trying to
represent in your sequence of circuits um and you you know, you get, there's kind
of like this, this time and you, everything reaches its steady state and then you get the output
out of it. And so when we're talking about quantum computing and, you know, you're saying this kind
of like collapses down to the photon or the, the, you know, mapping onto one or zero, the compute
that happens, compute, I'm doing air quotes for anyone on one or zero the compute that happens compute i'm doing air quotes for
anyone on the podcast the compute that happens uh in that interim though is is happening at a like
non-binary fashion right so the uh i you know i'm gonna use terms incorrectly here i'm just
gonna go ahead and put up front and and the uh the process of kind of like when you're talking about moving the point kind of around the sphere there,
is that what we refer to as entanglement?
Or is that when you're taking like two of these cubes?
Okay, so this is at a higher level, right?
Because when you're talking about the sphere, you're talking about a single qubit from a
logical level, right?
Correct.
And then entanglement would be when you're taking two of those qubits and they are influencing
each other in a somewhat deterministic way.
Is that correct?
Yeah.
So, I mean, there's a couple different things there.
So, one, when we're talking about this single qubit and like that sphere model
and moving around the point that's that's usually what we talk about is a one q gate right it's a
okay often when they when describing quantum electronic or quantum computation they use a
circuit model and so you think of it like an electrical circuit except it's not it's it's
like a timeline where you have the qubit going from the left to the right and you apply
gates in sequence and so it's like something goes into a gate and it comes out the gate but it's
actually the same qubit it's just a timeline of when you did operations on it a little confusing
but you can kind of think of it like you're drawing a schematic it's just a schematic that
can only go left to right um and so a one Q gate only works on one qubit.
You have, you know, it's an operation that you do on a qubit,
and it changes just that qubit by itself.
It has no influence on others.
And that's doing some transformation.
It's moving that point around.
And the reason we talk about it as gates is there's just different operations you can do.
There's like a gate that can move an arbitrary amount,
or sorry, you can move pi rotation, you know, pi radian rotation around any axis,
right? That's an operation you can do. And you can do like pi divided by two. So like a quarter turn rotation around any axis. And so those become gates. And so there's like a native gate set that the machine actually
can do. And then there are sort of standardized gate sets that are how algorithms are developed.
And so just like we have in a computer, you might write in a higher level language and you compile
it down to the native instruction set of the machine. Same thing happens in quantum. It's just
you have a standardized gate set and then you have the native gate set, and there's a compilation process that happens.
With the 1Q gates, it really is just moving around that point around one qubit.
Entanglement is, as you said, when you actually cause a coupling between two qubits.
You're actually creating a situation where you're actually changing both of them at the same time in a way that causes them to have a corresponding effect later.
You're mixing the quantum information between these two. And from a physics perspective, it literally is doing a phase
coherent laser modulation on those two simultaneously, which is what makes my job
really, really hard is I have to do basically instruction level parallelism of firing waveforms
at two qubits simultaneously or at ions simultaneously um and so that that creates
the entanglement there and that that's also used to do things like teleportation which is
actually a real thing in quantum um so uh yeah i mean that's that's kind of the level of
the operations it's usually one q or two queue gates there are higher queue like three or four
but most of the time most of the systems work on either a one or a two queue gate and it's usually
a combination of the two you run a couple one queue and then you do a two queue and then whatever
um but uh yeah that that's what makes the control system so difficult is like you
you're actually having to to sequence all that. And then the whole thing is,
just to add a little bit more complexity to it.
Please do.
Even when you're not actively running a gate, when you're not actually modulating it,
the electron has spin. So there is a phase precession happening just because time is passing.
I see.
And that is unique per qubit. So each qubit is actually has a phase precession that's happening
at a different rate. And in order to get your entanglement to work correctly, you have to
actually introduce your operations in a phase coherent manner with the phase precession of the device of each qubit so you're actually having to track the phase of the laser pulses and
make your modulation match in phase right so there's no like determinism to it you're just
measuring it and then factoring that into your subsequent operations measuring it would be great
but if i measure it i destroy state so no i'm actually
predicting what the phase is procession is going to be and having you know just based on how much
time has passed okay makes sense yeah somewhat uh essentially you're right you run a calibration
process to figure out what the uh phase procession for each qubit is and you do that in a way where
you destroy the state but that's valid as long as the qubit is is valid i see i see so moving
back up to like sort of the the um uh hardware software interface if you will um for these
machines right because i'm kind of envisioning you you know, you're working on the embedded controllers
that you have are kind of like the microcode
of this machine that we're going to have instructions
for perhaps.
And you need to have some sort of instruction set, right?
So you talked about this model of circuits
that are used to program a quantum computer.
What does an instruction set look like? And I assume that due to the wide variety in implementations of quantum computing that
it's fairly varied across these machines. It is. In a lot of ways, it's similar to
how the GPU world works, right? Every GPU has its own native instruction
set that they don't tell you what it is. What they do is they give you software stack that compiles
from standardized shader libraries or GPGPU frameworks down to that instruction set. And
that's kind of where we are today. And that's probably the model that will continue for quantum
stuff. But like, what does the instruction set actually look like well
you do express it at a high level as as sort of these standardized gates and then it gets compiled
down to the native gate set but it's still done as a circuit right it's it's like this this time
precession circuit model um but that's not what the machine executes you know as we just talked
about it's like tracking phase coherent
laser modulations um among other things right we also have to aim the lasers and so there's a whole
separate series of things going on there to to steer them um along with dozens of other things
because we can also move the ions around um if we need to we. We can change their locations. So the control to the machine,
the native instruction set of the machine, is actually much more low level, but it needs to be
incredibly time consistent, right? Like we actually have to have all of these operations happen
as simultaneous as we possibly can in a phase coherent way. So like, this is how this system works. It's actually
pretty common in the trapped ion world. And actually, it's somewhat common in the superconducting
world of you have the same problem, you can effectively think of it as you have a whole
bunch of arbitrary waveform generators. And those are going off and being connected as to control a
variety of different types of devices, but you basically just have arbitrary waveform generators.
You hook them to a common clock
so that they are all locked to the same reference,
so they are now phase coherent.
And then you are running programs
where each instruction,
each one of these waveform generators
is getting its own instruction stream.
So they're all running independent programs, but they're running off of the same clock tick so they're executing the next instruction simultaneously across all of them that does
map pretty well onto like the gpu kernel kind of model uh at least conceptually it does um i mean
there's a lot of delicacy about how it works um right but but yeah i mean in a lot
of cases it's it's a you can conceive of it as a vliw machine um where you just have
an instruction per per awg channel or per you know you need. And that works. That does work up to a point
until you actually need to do some sort of conditional behavior.
So kind of state of the art is, well, I can compile a circuit and I can execute it as long
as it's a straight line code, right? Like I have a basic block that's just executing
straightforward. And then I do a measurement at the end, and that's fine.
But if I need to do a measurement midway through the circuit where I actually have moved some of
the ions out of the way, so I'm only measuring a couple of them, but the other ones retain their
quantum information, and then I want to change the behavior of the circuit based upon that
intermediate measurement, well, now all of those processors how do you do
control flow right which is again similar to how the gpu problem if you've ever done gpgpu
programming uh you are basically writing giant simd type operations it can be mimdy but if
diverging threads right if you have a program where some of the GPU
processors are going, what, take a branch one direction, but others go the other way. If you
have a data dependent branch, that's a performance penalty there. In the quantum world, that's
disastrous, right? It's really hard to model and figure out how that's going to work. And so that's
an active area of development. You know, What does an instruction stream look like for these machines
to allow that type of operation?
Because as you mentioned,
you have a friend that's working on quantum error correction.
You need that kind of model.
I need to be able to do some amount of work
and then be able to look at a couple of the qubits
to decide whether or not I need to do error correction.
Right. So you can't like essentially it's unacceptable to introduce a stall because like the system is
just gonna have looked totally different like while you were not doing things or if you introduce a
stall everyone has to stall and you have to know exactly like, you have to be keeping track of the phase precession of every single qubit during that stall.
That makes sense.
Yeah.
Okay.
So, naturally, and this is also a little bit informed by listening to some podcasts, interviewing folks from IonQ as well. like there is a, the GPU comparison is helpful here in that like, you don't want to do all
computation on a GPU, right? So like, there are some operations that a GPU could be extremely
useful for, right? There, I'm sure there are, I'm not aware of all of the different
ones available for a quantum computer. But you know, know like there is a portion of in the the
sequence of a program right there is a portion of it typically that is going to be uh you know
massively parallel or something but there's usually a lot of like setup or tear down afterwards if you
will and so what i've heard a lot in this kind of quantum computing space is folks talk about hybrid models where you're kind of like mixing classical and quantum computing.
And, you know, that brings up a lot of different questions, both on the instruction set front.
Right. So how do you model instructions that span across that?
And then also on, you know, an interconnect front, you know, what is the speed between your classical and quantum computing there?
Is the work that you're doing examining that space as well?
That is exactly what I'm working on.
It's literally that exact problem and trying to understand what levels, as you said, how close do you need to have the classical and quantum
and how do you exchange data and control flow between them um as you said like as we got to
you can't stalling instructions is on a quantum computer is really hard and for more reasons than
you might expect right not only do you have to track the phase precession, but we didn't really cover this earlier in the physics part, but when you actually initialize the state of your qubit,
it's only going to maintain coherence for so long. So basically the quantum information will
deteriorate over time. And so you only get a second or two before the state's gone, right?
That's as long as you can run the quantum program.
And then there's also another dimension of the problem, which is that quantum computers
are effectively analog.
They're very similar to an analog computer in that you have infinite precision because
you're ultimately moving that point around the sphere.
You can literally move it in as tiny increment as you want,
but there's error, right?
Just like when you had an analog computer
and you build an integrator,
there's error introduced just by noise of the op amps
and other things going on in the system
or the tolerances and resistors.
Same problem in the quantum space.
So you end up with fidelity.
And so the number of gates that you can apply to a qubit
introduces cumulative error
until you get to a point where you can't actually discern what the output is anymore.
So this is kind of the driving space of how much computation can you do is how long do I have for coherence and how many gates can I run in that amount of time and still get a reusable result out.
So box all of that with, and by the way, while I'm doing that in that one second interval,
I want to take a measurement and run some computation on it and come back and change
what I'm going to do. So I can't stall for, you know, a minute to go run a different classical
operation and come back to it. I can't, you know, waiting a hundred milliseconds for a web service request to come back is just a tenth of my coherence window.
Right. and actually treat it like the quantum operations are just specialized functional units,
and be able to do sort of micro architectural design of passing data between registers and
queuing it up so that, oh, it did the measurement, and then that's being queued into an ALU that's
going to turn the result around. And that's one way of doing it. Or do you have a separate
processor that's right nearby that can punt the data back and run some straight line classical code and then push a result back the other way? And that's a very open area that is trying to figure out. And the answer is not clear. Because sometimes you want to have a model where the classical computation is what we call between a shot. So between those coherence windows, you maybe run five different runs of the circuit, which
each has its own one second coherence window.
And then you look at the results of those five to decide what you're going to change
about it, right?
You're doing some sort of statistical analysis.
So you do a little bit of classical and then you do more quantum.
But there's other times
with like an error correction
where you actually want to do
that mid-circuit measurement
and insert it
and different time scales get involved.
So yeah, this is a problem
that I'm actively working on
and the industry as a whole
is working on
of how do you introduce
these capabilities?
How do you even build,
like what does it look like
conceptually from like an instruction set standpoint is one part of the problem.
What does this look like from a language perspective is a whole separate problem.
What does this look like in a micro architectural standpoint is a whole other problem.
And then there's all the resulting problems that we create in the physics when we actually do classical computation and we're not running quantum stuff.
Right. do classical computation and we're not running quantum stuff right so in in the absence of kind
of like this uh generalized computing model or you know an instruction set or something like that
is most are are most of the um present day applications that are run on quantum computers
are they essentially like you're just rewriting the micro architecture uh to do a very specific thing or is there
kind of like a interim state that we're in right now that allow us to you know do useful things
on quantum computers without having a fully generalizable compute interface so the very
high level like circuit level with standardized gates is is where most of the algorithm research and like useful work is done and so then it's
internal compilation phases and things where it gets changed in the native get set and then
basically does get compiled into what would look like a microcode um you know it's like a vliw
style microcode that's running um right and uh so that's usually hidden by whatever provider is
you're using right because most quantum computers today are as a service, not outright purchased. So you can submit a job via one of the common frameworks like KissKid or, you know, things like that to AWS's bracket service and it'll run on a simulator or on a quantum computer um and it basically you know
once it gets to the simulator or the quantum computer that's where the compilation to the
native instruction set happens so it's it's a little bit isolated similar to how gpgpu works
right you you write it against more of a common language and then it it does that compilation
behind the scenes for you but um you know But the expression of how you interleave
classical operations with quantum operations
has traditionally been more of the GPU kernel model,
where I'm going to run...
I write a Python program that builds a quantum circuit
and then sends that
off to the service. It runs it, it gives me the result back. And then I look at the results
in my Python program and I do, you know, a little bit of work and then I submit another job.
And so there's a lot of active work around how do you move that lower? How do you intermingle those
classical operations more and make this more expressive, right? Like most of the time,
you don't want to write just a quantum circuit with a separate classical piece around it. That's, you know,
most of the problems that you start to look at are a little bit more complicated and intermingled
than that. Makes sense. So I think we're going to have to have another check-in with you on
quantum computing.
Perhaps I'll do even more research and come with some more questions.
But I definitely appreciate you being willing to dive in depth on that with us here.
I did want to kind of wrap up by talking about some of the things I guess you do outside of, you know, your day job responsibilities. One of the things we chat about
kind of like leading up to this chat is mentoring. And so I feel like that, as well as, you know,
being on podcasts and something like that, one of the things that I feel like that I've observed
through watching your work and your involvement in communities is your interest in like bringing
other folks along on that journey as well. So I kind of just wanted to give you an opportunity
to talk a little about, you know, what role that has played in your career and why that's
important to you. Well, I mean, as we covered, I grew up in rural Ohio, like I literally grew up in
a around farmland. And so the chances of me ending up working in Silicon Valley at some of the, you know,
biggest names in computing is, is a wild story.
I wouldn't have believed it if you had told me.
And I recognize that a lot of that is luck and a lot of that is privilege.
So even though I had all that luck and all that privilege, I also recognized that I didn't have anybody giving me any direction there. I just kind of stumbled through that whole path. And it was difficult. I just didn't have mentors. I didn't know people. And I couldn't ask a lot of questions until much later in my career where I built a bigger social network. So quite a few years ago, I guess it'd be like
eight, seven or eight years ago, I started offering just public signups for people to do
mostly mock interviews initially, but it kind of expanded into resume reviews and mentoring
sessions of just, I've been in this industry for a long time. I've been a hiring manager.
I've interviewed hundreds of people. I've worked at the big companies. I've worked with VC. I've been in this industry for a long time. I've been a hiring manager. I've interviewed hundreds of people.
I've worked at, you know, the big companies.
I've worked with VC.
I've worked with startups, et cetera.
Like, come ask me your questions.
Like, you know, where are you at?
Are you trying to change from a different industry to come in here?
Are you trying to graduate from a school and figure out where to go career-wise?
Are you a junior engineer that is kind of reaching a limit with your current job and you want to
bounce some ideas off? I'm just happy to sit down and chat with folks. And it's really just to be
that resource that I wish I had had. Some folks do similar things, but they actually ask for, you know, pay to
compensate for their time.
And for me, it's just, no, I, I'm just, it's free, right?
I, I set apart, set aside a certain amount of time each week and, and have a sign up
and folks just do, and I show up and we just have chats.
That's awesome i'm sure that's uh hugely impactful for
a lot of folks and you know alongside that as i mentioned earlier um you've also been pretty
instrumental in going out and kind of sharing things that you've worked on i know we already
talked about uh your defcon talk um but, I came across a talk where you were, um, uh,
detailing some of your work on engine motors. So things outside, uh, of your,
although that does seem somewhat related to, uh, some of the control systems you've worked on,
perhaps. Um, but what's kind of your, uh, your motivation for for for sharing your work uh more broadly as well i mean largely
i just i i enjoy working on various things right like cars is a hobby for me i'm not a race car
driver you know that's not my profession but do i enjoy driving on a racetrack yes
do i like tinkering with cars absolutely So when we get into where does that overlap with electronics and software and things like that, a lot of it is just, I find it fascinating, and I want to show other people what's cool about it, and why it's an interesting problem. Right? Like, I suspect that there's a lot of folks who are just like, oh, my car is there, it takes me from A to B, and that's fine. And it's like, well, actually, there's a very complicated control system running under the hood that you don't know about, and all the
different subtle things that are happening. And I don't know if you're familiar with the 99%
Invisible podcast. But the ethos there is finding architecture or design kind of out in the world, or in different ways that are subtle things that you might not have really considered.
And there's a similar thing here of, well, let's go find some of the engineering and things that you may not have seen that I found because I found it interesting and I'm aware of it.
And I just want to share that to have you think about it next time you are in your car, that there's actually a whole fleet of computers keeping track of how the engine itself is
running and making that work.
And, you know, it's a similar thing with coming and talking about quantum computing.
Don't expect that a lot of folks are going to actively write quantum computing programs,
but I'm sure you're all very interested in how it works now.
Right, right.
Absolutely. I'm sure you're all very interested in how it works now. Right, absolutely.
Well, one of my goals here is definitely for those folks who are curious about how things work,
a lot of times kind of like below the surface of the interfaces that we commonly interact with,
that folks like you will come along and be willing to kind of go a little deeper down the stack
and talk about what's really going on.
So I appreciate all the times you've done that elsewhere, and I appreciate you coming by the show today
and sharing your expertise on a lot of different topics here.
It's been hugely informative for me, so I'm sure a lot of folks are going to get a lot out of it.
Yeah, well, I'm glad to have been available to uh come talk and and share
things absolutely all right well i think we can uh probably wrap it up here but rick thank you
for joining and uh have a great rest of your week thanks