Microarch Club - 100: Nathanael Huffman
Episode Date: March 27, 2024Nathanael Huffman joins to talk about the magic of FPGAs, the role they play in domains ranging from medical imaging to data centers, and how software development principles can be applied to... logic design. We also discuss how Nathanael and the team at Oxide Computer Company built a new rack-scale computer while working remotely, and what exactly happens when it powers on and boots up.Nathanael on LinkedIn: https://www.linkedin.com/in/nathanael-huffman-5128024a/Nathanael on X: https://twitter.com/SyntheticGateNathanael on Mastodon: https://hachyderm.io/@SyntheticGateDetailed Show Notes: https://microarch.club/episodes/100
Transcript
Discussion (0)
Hey folks, Dan here. Today on the MicroArchClub podcast, I am joined by Nathaniel Huffman.
Nathaniel is an engineer at Oxide Computer Company, where he helps build their RackScale
Compute product. He previously worked at GE Healthcare on FPGAs for medical imaging systems.
In this episode, we talk about the role of FPGAs in CT scanners, proprietary and open-source FPGA
tooling, the boot process for the Oxide server,
and much more. Nathaniel also shares a number of anecdotes about building the Oxide rack with a
remote team, including picking up servers in the parking lot of a cheese store in Wisconsin.
The thoughtfulness and pragmatism that Nathaniel applies to building systems is evident throughout
our conversation, and I'm certain that you'll find his insights and experience as valuable as I did. With that, let's get into the conversation.
All right. Hey, Nathaniel. Thanks for joining the show today.
Hey, thanks for having me.
Absolutely. Well, I wanted to, and I kind of try to do this every episode,
give a little bit of backstory for how we got connected. I've mentioned before in some of the
previous episodes, the Oxide and Friends podcast, or
it is a podcast, but there's the unique nature of that y'all do it on Discord.
And, you know, I can stop by and talk and that sort of thing.
So I've heard you on there a few times and specifically was interested in some of the
things you talked about in regards to some of the FPGAs on the board.
And also there was some vague mentions to your background. So I did a little bit more research and then reached out.
But definitely been intrigued and grateful, actually, for you all to have that podcast.
And so glad to now have you on my show here.
Well, thanks.
Yeah, it's been fun to do that.
It's been fun to be able to talk about stuff that we're doing at work, just kind of out in the open. So
yeah, absolutely. Well, you know, part of the the goal of the show here is to kind of
both talk, I guess the primary goal is to talk about technical concepts and get deep on that.
But I think, you know, folks background and experience informs some of
the technical decisions or career decisions that they've made that have kind of placed them in
their current situation. So I'd love to learn a little bit more about, you know, just you growing
up, if you were interested in computers, if you're interested in hardware or software,
and then kind of, you know, your education and that sort of thing.
Sure. Yeah, well, I've always been kind
of interested in how things work and understanding, you know, like the technology and the things
around you and why they work the way they work and how they work. I didn't have a big
experience with any kind of hardware growing up, really. You know, I'm too young to have remembered
all of like the transistor radios
and I'm just slightly too old
to have missed the like dawn of computers,
you know, so I'm, you know,
kind of like early 80s kid.
And, but as like, as I've grown up,
you know, I've been interested in,
you know, various technology.
You know, I was the guy who would program
the VCR, you know, for mom and dad, you know, if you remember those,
you know, everybody's always blinking 12, because no one could figure that out. And I always enjoyed,
you know, solving those kinds of problems and looking at that. And, you know, as I went through
middle school and high school, especially high school, I took a physics two class,
and physics two class had a lot of circuits in there and I found circuits to be
very fun. And so I was like, Oh, this, this is, you know, something very interesting. I like the
pictorial representation of things and the model behind it. And, you know, understanding like,
this is how like most of our world today works is, you know, these little tiny chips and resistors
and things doing stuff. And, uh, so, you know, my junior and senior
years, I took physics classes, you know, and I was starting to look into what I'd want to go do.
Electrical engineering seemed like an interesting spot. I was actually also interested in computer
engineering, which like for a lot of schools is almost the same thing as electrical engineering,
but with a more software-y focus.
But I wasn't totally convinced that that was maybe a mainstream,
like widely recognized degree at that point in time.
And so I went with a BSWE.
So I majored in electrical engineering.
But I tried to opt for all of my optional classes in engineering to be in the computer engineering curriculum. So I took
a lot of the programming classes, you know, I learned C, I learned microcontroller assembly,
all of that stuff that wasn't strictly required for W. But then I also took the painful classes
like field and wave electromagnetics and, you know, stuff like that. So just I thought that
would be a little more of a broad marketing strategy. Like I was sure to get a job with a double E degree.
It was a little unclear to me about computer engineering.
And I didn't really know really the difference between them other than what I just explained where one had a little more software and one had a little less software.
So I went to Purdue.
I grew up in Indiana. So that was a produce a state school
has a good reputation and an excellent engineering program um i went there and um as you know i had
talked with other people and one of the big feedbacks that i got from people going before
me was like try to find an internship try to find a co-op something like that and so uh purdue has
a neat co-op program where you sign up for this like rotational
co-op program and you rotate in and out and do, uh, the equivalent of like five semesters of work,
uh, throughout, throughout the, um, the course of your studies there. And, uh, Purdue is a little
bit flexible about how you go about doing that. Um, but you can sign up like these companies sign up
for, you know, coming on campus, interviewing people and you get connected up with a company
and you know, there's some rules, you got to keep your GPA above a certain level and that kind of
thing. But, um, they sign up to like host you for the five sessions. And so, you know, over the
course of your, your career and that turns your four year BS double E degree into a five year
degree, right? Because you're rotating out to go do work.
But you get the opportunity to do work.
And I would say, like, that's probably the biggest impact that anything had on me.
I mean, I'm not sure I would have stayed in engineering without having done that.
Really?
Because, like, a lot of the engineering classes were very heavy, theoretical, very heavy math.
And there's nothing wrong with the theory or the math, but it like, at least for me,
doesn't end up being very fun. And so, you know, you're like, wow, well, I, you know, I can do,
you know, Euler's method, or I can do like all of these, you know, like multivariate calculus,
like how cool am I? But, but like, how does that translate into real things? And so getting the
opportunity to go and work for a company, I got to see, oh, like, you don't actually spend your whole day, you know, sitting there with a calculator solving differential equations, you actuallyop program there at Purdue. And so, um, you know, I was looking at different companies and I interviewed with them. My, uh,
my dad was actually a business administrator for a group of radiologists and was like, Hey,
that GE medical systems company, like they make neat equipment and they do like some neat stuff
that really impacts people's lives. And, uh, so I interviewed with them and, you know, like the
long story short is it worked out,
and so I was up there for five rotations during my school career,
and then as I graduated Purdue, I got to rotate onto one of the teams that I had worked on
and just start my full-time career there.
Awesome, awesome. Well, that's a great experience to have.
I kind of, in a different way, identify with somewhat of the frustration with the theoretical side.
I studied computer science in a very theoretical computer science program.
I kind of feel like there's a bit of a spectrum there between hands-on, like we're going to build applications or learn kernel development, depending on what level you're at there.
And then there's more of the academic side. And I definitely enjoyed it. But it wasn't, it wasn't necessarily the greatest preparation for industry, right? Right. But I
definitely, you know, one of the commonalities I find when I'm talking to folks who, you know,
have worked in hardware or worked in, you know, chip design a lot on this podcast is that kind of strong foundational background in electrical engineering or maybe computer engineering.
But that's one of the things when I look back, you know, I think that everything kind of builds on that.
So I think it's a really useful background to have, even if it is, you know, you're not solving differential equations all day, per se.
Right, right.
So, yeah, but that's really cool.
And certainly I found the theory to have paid off long term.
It's just that, like, it's not a good taste for what, like, your day-to-day life is.
Right.
But it's a lot of stuff that you do actually need to know in order to, like, understand how, you know, these circuits do actually work.
Right. And so at GE, did you have any kind of like background or interest in the medical field?
Or was that just kind of where it happened to have an engineering job?
So that was kind of, you know, I knew them very well after my five rotations of co-op
time.
And so I realized, you know, I like the technology.
I like the people, you know, as we'll get into it later, but like, I got exposed to FPGAs there. And so like,
I was pretty convinced, you know, once I learned about an FPGA and, and, you know, we'll go through
that, but I was like, that feels, I mean, that feels like I'm living in the future. You know,
you have these like programmable chips and all this cool stuff. And so, uh, and they're pretty
big, uh, FPGA users in most of their applications
there. And so it was, it was something that, you know, I knew whatever I wanted to do, I wanted to
do something related to FPGAs at the time. And it was just like, this, this is a good opportunity.
And, and it's a neat product because, you know, you need CT scanners, like CT scanners have
certainly made people's lives a lot better over the course of the last 40 years or so.
Yeah, absolutely.
That's really interesting.
I guess it was just the domain, kind of like you were saying, that you got exposed to FPGAs kind of early on in that experience.
Kind of talk about that fascination. I know just kind of from reading things you've written, looking at, you know, your social profiles and things like that, you're definitely enthusiastic
about FPGAs, which is something I share. I mean, they feel, as someone who kind of
started to educate myself around digital logic design, an FPGA feels pretty magical, right?
Because you can do it in your home office.
And so, yeah, tell me a little bit about like getting exposed to that and maybe some of that like fascination with it.
Yeah.
So I think, you know, like all good double E's, you go through a class, you know, that does digital logic.
And so, you know, maybe you make an ALU or something with like some pals and gals or like a bunch of discrete logic chips.
But realizing that you can drop a chip like an FPGA down on a board and, you know,
shoot some code into it after some, you know, magical software stuff happens.
Right.
And that thing behaves exactly like that chip.
I mean, it was just, I mean, it was just really fascinating. And, you know, you can kind of see like the older generation of engineering, you know, some of my mentors grew up in the era where like gals and pals, which were like early programmable devices came out.
And so that was something, you know, you might get 10 gates in a chip or something.
And so you could totally customize that chip and then drop that customized chip down.
And it would, you know, be a few and gates in an or in an or gate or you know some flip-flops or that kind of thing and like the kind of power that
you get from being able to do that without having to drop down all this discrete logic and if you
mess up your uh you know your carno map as you you know optimize all your logic away or you forgot
you know that uh like de morgan says you know like and gates and or gates can do different things in
active low circuits and you know all of those all of those things. It's very expensive and painful to change those,
but like in an FPGA, you just change it and, you know, you compile again and you go.
And assuming you didn't do something really bad, like, you know, turn outputs to inputs and blow
up your pins and, you know, like it is still hardware and like, you can still break things. But, uh, the other, the other thing that I saw that was
super fascinating is the thing that you go out with on day one for a product, especially a long
life product. So you think, you know, a CT scanner is a, an investment for a hospital, uh, you know,
and they want to, they want to have long life and longevity on that thing. Uh, the manufacturer wants to sell
new features. And so the ability to go out with something where you, you have a set of features
and then in the future you can download more features into it. And, and it, it's like downloading
hardware like that, that is, that does feel magical because it's, it's stuff that didn't
exist on day one and you know, on day 400 or day 300 or whatever, you can come back in and say, oh, I'm going to just totally change this or add this totally new feature.
And the hardware actually, in some respects, changes.
I mean, it behaves differently than it did before.
I mean, obviously, the physical hardware is still, you know, exactly how you shipped it, but right.
But that, that was a big feature because you realize you can monetize that and you can provide bug fixes.
And so, you know, there's, there's a lot of tension between, um, with FPGAs and like FPGAs versus ASICs, right. There's a lot of tension there in the like cost and, you know, we can talk about some of the trade-offs there but the the trade-offs um
in some ways like it's very expensive to like tape out a chip right right and then once you've
taped out that chip uh even on like yesteryear's process it's still pretty expensive and once
you've taped it out like that's all the thing does and so you you have to have all of the
functionality in up front you have to have everything tested, everything validated, anything that you messed up, you basically can't use,
or you have to find a workaround or, you know, whereas an FPGA, you have some flexibility there.
And so you can say, well, I didn't even know I wanted this feature two years ago, but now I'm
going to download this new feature in and or we have this totally unforeseen bug that occurs and we can work around that in an FPGA.
So we just change the FPGA. So those things are pretty powerful.
Yeah, absolutely. I think one of the things that's interesting, I feel like when I was first exposed to FPGAs,
one of the things that was maybe a misnomer to me was like, oh, we're reprogramming these all the time, right?
And maybe we are, but they have a lot of value even if you never reprogram them. So it might just be economic value, right? It might be a life cycle of product development, or it could be like,
you know, a break glass in case of emergency kind of situation. And so, yeah. So I I'd love to learn a little bit more about
how FPGAs were being used, right? What, what context, why were they, um, the optimal or good
solution, uh, for what y'all were doing at GE? Maybe, maybe when you initially got introduced
to them. Uh, you know, I think, so, I mean, some of it, I think is the longevity there and the ability to, you know, change and morph the design.
I would say, like, we kind of went through a transition, you know, from when I first started or maybe even before I first started, you know, back in like, you know, 06, 07 into now where like the, it was hard to find processors that were like fast enough to do some of the things that we needed to do as well.
So if you need like hard real time stuff like data processing or you know,
I mean,
we can talk a little bit briefly about like a CT scanner is basically has an
x-ray tube and the x-ray tube shoots x-rays through the patient who's kind of
in the middle of the tube and it spins around
the patient. And, you know, apologies to any physicists, but like the, you know, the really
simple version is you basically take pictures at, you know, a bunch of different angles around the
patient and then put those all back together into a 3D model of the patient's body. And like the
detector can tell, you know, how much stuff the x-rays went
through given their attenuation and that kind of thing. Um, and so, uh, the way that reconstruction
is done is by, uh, it's a process called filtered back projection. And so when I first joined,
or when I was first there as an intern, um, we were, uh, the team was using FPGAs to do filtered
back projection because the data rates were so high.
And, you know, we're talking 850 megabit or, you know, and so like not, I mean, not high in today's
world, but high at the time. And so you're not like, and we didn't really have, I mean, you know,
I mean, gosh, in, in, you know, 2006, 2005, 2004, like what was your, what was your video card? Like a GeForce two,
maybe, or GeForce three, like the, like we didn't have this big GPU offload. And so in order to,
to make those, uh, to reconstruct those images in a reasonable amount of time,
you sort of need a dedicated hardware. And so some people would go off and make ASICs,
but like the algorithms change. And, and, uh, that's a big spot where like, you want to be able to iterate on your algorithm and you want to be able to change things and you may find,
you know, enhancements or, and so they used filtered FPGAs for filtered back projection.
So there was a, you know, PCIX board back in the PCIX days that had an FPGA on there and the
computer would catch all the data and then run it through the FPGA to do filtered back projection. The same thing on the acquisition side. So up on the rotating side,
you have, you need to somehow like tell the system to take all those pictures and that becomes a
fairly real time thing. And you, you know, you have an encoder basically that tells you where,
what angle you're at as you spin around the patient and you need to somehow correlate that to, uh, your image snapshots
and make sure that your periods are all lined up. And, you know, depending on your algorithms,
like, uh, you want them to be fairly precise because if you're assuming the angles are perfect,
any imperfection in the actual sample of the data ends up turning into like poor images, right? So,
you know, if you're,
if you think you're going every, you know, every two degrees and you're really going every 2.1 degrees, like your image is going to be a little bit messed up when you reconstruct that. And so
there's some real-time applications there. And then again, like the data rates you're running,
the other thing is the, the CT scanner spins. And so like very early versions of ct scanners would spin once to like
get up to speed spin around the person once to act to acquire and then spin once slowly to stop
and they had a cable that basically like spooled up and unspooled when they did this and then they'd
reverse that process backwards right um it turns out like that has a lot of problems right and so and then you also have like a lot of spinning things are easy when you don't
have a patient in the center, but because you have a patient in the center, there's
no like coaxial place to like, you know, put like butt up to optics or anything.
So you have to somehow get all of this data from the rotating side to the stationary side.
And there's a bunch of like, you know, somewhat proprietary technology that most of the
CT scanner systems use these days. And we call that a slip ring, but basically you have to shoot
the data across a gap. And in order to get the data across a gap, you need to run a custom protocol.
And so, you know, we can talk about a 8B, 10B protocol. So we have, you know, as you're shooting
data across what is effectively a capacitor, it's like an AC coupled link.
And so you need to limit the number of ones and zeros that you run in a row because otherwise you start to bias that capacitor and it stops acting like a short at high frequencies, right?
Because capacitors at high frequencies are shorts.
And so you're looking for like protocols that do some
kind of encoding 8B, 10B basically takes every eight bit code word that you want to send,
converts it into 10 bits, and it has a positive disparity and a negative disparity so that,
and then the algorithm keeps track of whether you have too many ones or too many zeros going,
and we'll pick the opposite code word in order to balance the link out. And so with 8B, 10B, if I remember right, you get like five,
you can have five bits of a one or five bits of a zero
before you're guaranteed to see a transition,
but that helps keep your link, you know, DC balanced
at the cost of a 20% efficiency hit.
So, right, so every, like eight eight bits i'm meant to send
i'm actually having to send 10 bits so i have i'm paying a 20 penalty in order to get this encoding
and so you you know as you look at other other systems um like ethernet i think uses 6466 so
they're paying something like yeah what is that, 3% of a penalty for their encoding.
So it's a lot more efficient.
But the number of the run link limit is like 80 bits instead of five.
And so that didn't work for the physical ring technology.
So you had FPGAs up on the rotating side in order to encode that data and shoot it across this link.
And FPGAs on the receiving side to decode that data and, you know,
split it all back out and, uh, you know, kind of all kinds of, of control. I mean, most of the
subsystems I think had FPGAs and, you know, they were used for various things. Um, we also did a
lot of soft core processors there. And so like rather than buy a, you know, like 68,000 microcontroller or, you know, buy a microcontroller, we would stick one in the FPGA.
And so we had a number of subsystems that would run on soft core processors.
And, you know, they didn't need to be super performant.
They're mostly like you kind of let the FPGA do the real time stuff.
And they just kind of monitor and report and do like setup and cleanup at the end.
And, you know, so there's, there it's not to minute, there's a lot of software there,
but it doesn't have to be, you know, you're okay on a 150 megahertz processor, you know,
and as long as it can talk ethernet or, you know, what have you. So.
Right. And so you mentioned, uh, the, uh, PCI X, uh, board. So this was, y'all were not actually
designing the, uh, boards at that PGA. So on,, y'all were not actually designing the boards at the FPGAs
or on, you were just integrating them into your system. Is that right? Nope. That,
that was a custom board. And at the time, I'm not sure what nowadays you can go out to
an integrator and go find like a compute module, right? So you can get a PCIe card with an FPGA on
there to do general purpose compute.
I'm not sure what existed back in that time, but that was definitely custom hardware.
And I mean, so it was like that was a custom design.
And then, you know, you put that into a PC that it was, you know, mostly off the shelf.
And that would oftentimes those like those were, you know, they had a custom heatsink on there and everything.
So, you know, they're they're in there doing that as we progressed.
You know, I kind of alluded to this before.
Some of that stuff got eaten by GPUs because it's easier to just go buy a commodity GPU and do your back projection on a GPU based algorithm.
You know, you can use OpenCL or what have you and do that a little more efficiently.
So in later generations of the system, you see the FPGA is kind of moving out of the actual like back projection and the image processing flow and more, but staying and even increasing in the like the data acquisition and the data capture side. Right. And so on a later gen system, for example, we had an
FPGA that would catch all of the data from the slip ring and then turn that into TCP packets.
And so we had our own custom like TCP offload engine there where the we can buy a commodity
computer and the commodity computer can can connect with 10 gigabit Ethernet, one or more
10 gigabit Ethernet to this card,
open up a TCP stream, and then just catch all of the data basically flying in the system
and keep track of it and store it to disk and everything.
So that got us kind of out of some of our like, can we put our custom stuff inside this
other computer?
And you get into like warranty challenge, logistics challenges, and all kinds of stuff there right that's really interesting were you all using um fpgs pretty
much from all specific vendor or did y'all have a number of different vendors that you're we were
mostly uh an altera who's now intel but now going to become their own thing again. Right. Shop, you know, through over like GE does use a
bunch of them, but like my team specifically was mostly Altera just because we had a good working
relationship with them. And, you know, what, what we found is like, you know, over the course of my
career there, you notice we were like cutting edge technology, you know, early, you know,
we're having a hard time getting, you know, 10 gigabit transceivers. And so like we were using external parts and a Zowie interface to go from an FPGA
that didn't have internal transceivers to get a 10 gig interface. Cause that was cutting edge at
the time. And then, you know, fast forward a few years and it's all the FPGAs have, you know,
you can buy an FPGA with 10 gig, like 10 gig, that's no problem. You know, what about 28 or, you know, 32 or so. So like kind of over the course of my career there, you see, we saw kind of move from
like the high end part families to often the more mid range part families, just because technology
and especially in the communication space, just really just keeps going up and up. And like,
at some point, you know, there's only so
much reasonable amount of data that you can like capture. And there's only so much you can ship
across a slip ring and that kind of thing. So you have some like other limits that don't allow you
to just like, continue chasing the like forever technology curve. Right. And you mentioned your
team there, what was the composition like of your team? And you know, were y'all directly
interfacing with, you know, all the folks that, you know, we're designing the composition like of your team and you know were y'all directly interfacing with
you know all the folks that you know were designing the mechanical parts of the
the ct scanner as well or what was the kind of team structure like there yeah we were so we were
organized there uh on a hardware team so i was on a team of about uh i mean it varied you know
somewhere in the 30 to 50 range but that included all of your like electrical engineers and all of your mechanical engineers.
So basically all of the like traditional, you know, like hard hardware engineering roles were all on the same team.
And we all reported to like one or more managers kind of in the same organization.
And then our embedded team, you know, had, you know, they were sometimes with us and sometimes not dependent, you know, like all these big companies like to reorg frequently to, you know, change things up a little bit.
But we worked, I mean, it's certainly the, those of us on the FPGA side worked very closely with our embedded software engineers because, you know, a lot of our interfaces and their interfaces are like, you know, tied together, you know, very, very closely. And so I can't just go change
registers without letting my software guy know, like, Hey, we're gonna, we need to make this
change. And like, we have to do it together so that, you know, we don't break the world. So.
Right. Absolutely. Is the, um, uh, were the soft cores that you mentioned that were, uh, on the
FPGAs were those, uh, programmed by the embedded team or was that more things that,
um, you know, you mentioned it's more like reporting and that sort of thing. Would y'all,
would your team be writing the software that kind of, uh, ran on the software? The embedded team wrote, wrote most of the software on the microcontrollers. So we did,
uh, we had a number of ways to test our hardware, you know, some through using some, uh, APIs that
those guys provided for us as well as, uh, you know, we could get in there in our later gen products, we could get in there with, we had a little shim, basically, that would run and we could get in there with a Python tool. And, you know, like exercise or registers or do various testing. And so that, you know, that was, we would write a lot of that, because that's a lot of like, boring, low level functionality, that's mostly just a smoke test that you know everything
got done exactly right um and that but all of the application code was done up on the uh embedded
by the embedded team so gotcha and and how was the uh what was the programming process like so
you kind of like alluded to that with you know getting in there and being able to debug and that
sort of thing but obviously with fpgas you have to store a bit stream somewhere
what would that look like in this architecture yeah so uh so with fpgas you know uh most of
these designs um had onboard flash of some type and so you know it it varied i mean early early
on parallel nor flash was kind of the like popular thing and, in fact, a lot of those early designs had a CPLD,
which is basically a little tiny FPGA that, um, that often has its own internal flash.
And so it would have a little like shim loader that would know how to go master the, um,
the nor flash, the parallel nor flash and load bit streams into the bigger FPGA.
Um, in later gen devices, I mean, parallel flash becomes kind of expensive in terms
of pins, because you know, you have a whole address bus and a whole data bus out there.
And you start burning up a lot of pins. And it actually isn't very fast either in, you know,
in the grand scheme of things. And so a lot of the later gen designs started using serial flash.
And so but in the same way, it's serial nor often quad spy and the,
the processor inside the, like the soft core processor, or in some cases, a hardcore processor
sitting outside would be able to have access to getting to that part, either through FPGA logic,
like a spy core in the FPGA, or it would have, you know, physical wires on the bus. So it could
reprogram that. And then we'd reboot the board
or tell it to reconfigure, and it would reconfigure itself out of Flash.
Gotcha, gotcha. That makes sense.
Yeah, that means anytime you want to,
like when you're debugging in a system,
anytime you want to try a new bitstream,
you have to burn a new bitstream and get it into the software load
or patch the software somehow,
and then run Flash download on the system and get it all patched in there and then reboot the system and bring it all back up so you can test it.
So, you know, it ends up being kind of a like especially system level test.
It becomes kind of a like it's an expensive thing to do for time because you have to like bring the whole system down and bring it back up every time you do it. So we would often have, um, I had a, like a CI environment, uh, with, uh, a copy of each of the boards from the system that were all connected up on the same network together. Um, but weren't,
I mean, obviously there's no like table to move up and down and there's no system to spin. And so,
you know, but with FPGAs, we could build fake things.
And so I could build a hardware module that looked like an encoder, right?
And so it could spin.
It could generate pulses.
And so I could drive my product design with this fake encoder in the FPGA by, you know, wiggling some bits and turning some things on that we didn't use in product.
But I could simulate that there on the bench.
And so we did a lot of that as well.
And kind of we got to, you know, on the later gen stuff, we had a lot of hardware in the loop testing.
So, you know,
we had CI that would run and build our FPGAs and it would auto deploy down to,
you know, our test bed and we'd run some tests against it. And, and, you know,
if you find something,
it's nice to find it there before you hand it off to the software team so that
they don't have to come back to you and tell you, Hey,
you messed something up or like it doesn't work or,
you know, that register you told me you put in here, I can't read it, you know, things like that.
So, right. Yeah. That's one of the things that I noticed kind of, uh, when, when looking at some
of the things you've written and some of the things you've worked on, it feels like, um, you
know, me coming from a bit of a software background, it feels like you've kind of had this
pattern of applying, uh, software development principles you've kind of had this pattern of applying
software development principles into hardware, or, you know, if you want to call FPGAs harder
as well, development lifecycle as well. Was that something that, you know, you were exposed to?
Or is it more just like, hey, these are obviously good principles that we should apply here?
I mean, that's something I think that I mean, I think it really was something I just saw, There's all this capability. And like, I mean, I don't know how many billions of dollars being spent on like software developer productivity, right? Like these tools and CI and, you know, all this stuff. And're looking at it, you know, I mean, it's interesting, the, the whole electrical engineering world, I think, has gone through kind of a transition over the past, you know, 20 or so years where a lot of times it was, you know, originally it was just, it was very like hardware oriented people.
And now like you need hardware and software to do almost anything. And, and so like being able to steal some of the like good work that, you know, whether it's like, let's have, you know, our bill.
I mean, I remember, you know, when I first started, um, I, you'd get to a spot in the day where it's like, well, if I don't push build now, then I don't have, I'm not done until, you know, like 6 30 PM.
And I would like to go home and eat dinner.
And so, I mean, we had engineers who like were driving home with their laptop
on the front seat of their car
while they're doing a build,
hoping that you don't get a license pull
during that portion of the build
where you don't have internet connectivity
and stuff like that.
And it was like, you know, this feels a little silly
and like, can't we do a little better? And so that was one of the things where,
you know, we look at, look at the software, like software teams have, I mean, computers are pretty
cheap. Let's go buy some computers and figure out how to automate these builds. And so when you
check in the source control, like a build should just kick off. And, um, and then, you know, if
it's, if it's a build for an hour, like that's no big deal. If it's a build for five hours, like that's a lot, a lot bigger deal. And so, you know, one of's a build for an hour, like, that's no big deal. If it's a build for five hours, like, that's a lot bigger deal.
And so, you know, one of the designs I was working on near the end of my time there had a five and a half or six hour build, which in the grand scheme of FPGAs, like, I feel like that's like, that's a middling build time.
I mean, you can have much, much worse in much bigger designs, but that was a 900,000 LE design.
So it was a big, big design.
And, um, that's one where you realize like I can make a change in the morning and be able to test
it in the afternoon. And if I'm, if I'm really lucky, I can get a fix in, in the afternoon and
get another build before I have to go home. And, and, and then your, your computer is just sitting
there all day, just like, you know, chewing on the thing. And, and, and then your, your computer is just sitting there
all day, just like, you know, chewing on the thing. And so like being able to offload that
to a server somewhere and, you know, even if it's just in the basement or in the electrical lab or
whatever, and let it go sit there and chew on it while I can go do other things is super powerful.
And, and then you don't have this like fear, like, well, if I don't, you know, if I don't get a build
going by, you know, one 15, then like, there's no way I, you know, I'm making it to pick my kid up from school or, you know, whatever, whatever, you know, especially over the last 10 or so years that there are just things that we need to do.
But, you know, when you look at the EDA tools and that kind of thing, they're really not structured to make life easy that way.
And so it's just like kind of different than like a cloud native software development where, you know, like, oh, no, this is just like I have a GitHub action and it just like does the thing.
It's like it's not so much like that. And you look at, I mean, you know, as you well know, like a Vivado install or a quarters install is like, I don't know, 40 gigabytes, 60 gigabytes.
So like, that's not something that, you know, you can just like download every day and run.
And so, you know, you need to think about like, how do we set this up in a way that makes our
developers productive? And, but it's been fun to be able to look and see all of the different
things that the software teams are doing to, you know, get better test and get better, better build times and that kind of thing and try to apply those into the hardware workflow and, you know, fight with the tools a little bit to do that.
But that's an area where I'm, I'm super interested in, like, I kind of lean, you know, I'm a hardware guy that leans towards software.
And so I like the software stuff.
And so, you know, like the flip side is there are a lot of software people kind of lean toward hardware and like
FPJs are a cool place to play too. So we kind of like live in this like Goldilocks zone, I think.
Yeah, absolutely. That's a, it, it brings up a lot of, um, ideas that, um, I won't go too deep
on this, but in my, my day job, I work for a company that makes a lot of firmware. And we've started using GitHub Action self-hosted runners. And so we all have like Raspberry Pis.
It's a remote company. And then, you know, a bunch of dev boards plugged into them. And
it is really magical to see kind of services like that that make it really easy to have your own
self-hosted runners, whether it, you know, is at your house or at your place of work or whatever, because you get that same
configuration and workflow for both the software and the hardware. And obviously,
we're not having to deal with... Well, we do have to deal with some very onerous
microcontroller programming frameworks and tools, but not quite to the degree of working with FPGAs. But I'd love to learn a
little bit more about, and maybe also like describing for the audience to talk about
maybe what that time, you know, that five to six hours is made up of. Because, you know,
when you're working with FPGAs, it's a little bit different than compiling software, right?
There's a number of steps you go through, and there's things you can do earlier in that process
to maybe catch
something, you know, before you've invested five to six hours. Um, so what, what were some of the
methodologies that y'all used and what, what tooling as well? I mean, so, uh, I mean, just
in general on the process, you know, you figure like an FPGA build kind of goes through, you know,
it depends, but I mean, call it three or four phases basically. And, you know, you can carve those up into like different ways if you want, but you essentially
have to take the design as written.
So you have some kind of RTL design that you've written in Verilog or VHDL or some alt HDL.
And you have to, you have to compile that and run it through, you know, syntax analysis,
like they call it analysis and
elaboration and some tools. And basically, you have to go through that and make sure no typos,
no like, you know, stupid problems, you know, no output pins, driving inputs, and you know,
like things, things like that, make sure all your blocks are wired up, everything, you know,
is a okay. And so that that takes, you know, a reasonable amount of time, I feel like, you know,
like the tools aren't super fast, but you know, you're talking, you know, some number of minutes for a very large design and, you know,
I don't know, like maybe 15 minutes for a very large design and not so bad, um, for stuff
underneath that. And so like if the syntax all checks out and everything, you know, no gross
errors there, then it moves on into synthesis. And so that, that takes your, your description and turns
it into an effectively like a logic map that is like, this is the logic function that you're
creating with all of this stuff. Right. And so that's where, you know, if you remember doing
truth tables back in, you know, EE, you know, whatever, 200 or whatever, where you do your
Carnot maps and your logic optimizations and all of that, the tool is doing all of that for you.
So it's, it's taking your basic design, doing some, some synthesis optimization and trying
to come out with like, you know, here's the number of, of things that you need and here's
how all this stuff is connected.
And then, uh, then you go into, uh, like a place in route.
And so, you know, you, you drop those blocks down and like those things have
to map into the FPGA that you're, you know, the, the FPGA is kind of like a big array of logic.
And so you have to map that logic, synthesize logic design into the technology that you have.
And like, there are a few different things that go into that. You have, you know, a certain number
of flip-flops and lookup tables and that kind of thing where all of this stuff has to, um, has to get, uh, mapped into, but, uh, you also
have then the interconnect between each one of those, you know, a lot of these, a lot of the,
uh, the vendors will have something like a CLB or an ALM, which is some collection of like,
it's like a lookup table and a couple of flip flops and maybe some like clock
routing and some stuff. And then all of those share routing amongst, you know, between them.
So you have to, you have to both lay this whole logic design down on that array and get everything
wired up. And then you have to say, okay, so my clock happens every 10 nanoseconds or 15 nanoseconds
or whatever. Can all the signals make it between all of the
different places in the right amount of time and if not then we're going to pick up some of these
blocks and like move them around and drop them back down and try again and so there's kind of
this you know uh there's like a simulated annealing process that happens there because
you don't want to get stuck like you you they kind of throw it all down sort of randomly ish and then like try to
optimize.
Uh, and they don't want to, you don't want to get stuck in a local minimum where you're
like, Oh, everything I do is like a little worse.
You want to make sure you can like toss a bigger change in there.
Cause you might be able to find a much, much better fit.
And so the, and, and they all, you know, all the tools have different algorithms for doing
this.
And there are, um, some strategies all, you know, all the tools have different algorithms for doing this. And there are some strategies for, you know, making that smoother.
And then, you know, the way I like to think of it is like each, each of these phases has like so many credits to spend.
And at a certain point, the tool like runs out of credits and stops.
Right.
And so it's like, it has to move on.
And so it's like, I I've gotten enough where, or I can't get any more and, you know, it moves on.
And so then, you know, it moves on.
And so then, you know, after you get there into so now you have like a design, it's all mapped down onto your chip.
Everything is kind of put together.
Then you got to go build the bit streams and, you know, all of that stuff.
So that all happens kind of at the end.
And a lot of the tools then will tell you like, hey, you know, all those timing constraints constraints that you put in upfront to say, you know, I have to make all this logic work yet you passed or, uh, you know, and sometimes you did not pass and I tried really hard, but sorry. And, uh, so those things happen.
Um, and then, you know, like, so all of that basically is dependent on the design, the amount
of logic you're inferring, right?
So when you build all this stuff, you build a small design, a few thousand LEs, it takes a few minutes.
You build something that's 100,000 LEs, it takes more minutes.
And you build something that's a million LEs, it's hours.
But those phases happen for each of them and they kind of you know each part scales i would say with the design but the analysis and uh synthesis often happens much faster than kind of the rest of it so getting
the rest of it all laid down is kind of a like it's a multivariate optimization problem that i
think is really challenging and you know there's a lot of research going on and to make that better
uh so that all happens uh that's and and like your fans are running on your laptop and you know, whatever, if you're doing all of that.
So the way we would do our workflow there,
we were using Jenkins to do our builds.
So we had a server that had our Cordis versions.
Cordis is the like software,
like Vivado from,
from Intel.
And it had them installed and it would,
you know,
monitor the source changes.
And so when you push a change of source control, then it would go pull down the design and rebuild.
There are things you can do.
A lot of the, because it's so expensive, there's incremental compilation flows where you try to save the results from your last build and only like only change the like little bit of things that you change. We didn't do a lot of that, both because it was hard.
It was hard to figure out how to like on a,
like on a cloud of multiple machines,
how to save the artifacts in an appropriate way so that they could be
recovered.
And then you,
we,
we also liked reproducible builds.
And so what that means is every time you,
you push the button,
go,
you get the same binary coming back out.
And with incremental builds,
that's not true because like some portion of that binaries like history is your last however many builds
and um that you know it can save you some time and it's useful for certain things but we like
to do clean builds from source control so we'd get if i built the same design with the same version
twice i would get the same you know like binary checksum even
on the part the coming out so that's what we like to see there um let's see what else did i miss
there well one of the questions i had was uh so the the you mentioned that the tooling right kind
of like runs out of credits at some point but you also mentioned this binary reproducibility
is our builds always deterministic from the same source with the tooling you are using?
And is that true for across all tooling for synthesis and place and route?
So I'm not sure I can answer that generally.
I don't know that that's true.
I know that for us on a clean checkout, we would get the same fit and route, assuming that you didn't change the seed. And so there was some algorithmic,
uh, determinant, like deterministic ness to, to how this worked. And so, uh, you,
some, some places, uh, I know do seed sprays too. And so you can do, uh, so like, say you,
you miss miss timing means like my timing constraint says my you know
i have to meet my setup and hold times on a flip-flop and so i have you know so much budget
to do that and i miss by like a tiny tiny bit there are certain kinds of designs where you're
really pushing the envelope of the fabric and so you know that you're going to see failures
occasionally and so uh they'll do like uh uh, you know, a seed spray technique
where they'll kick off multiple builds at the same time, but with different, uh, starting seeds.
And, uh, and then that will give you different fits for each one of the, for, you know, even
though it's the same source, because you've manually altered like some of the starting
conditions with the seed, it basically alters how it drops everything down originally.
And so that'll get you a different fit.
And you can find designs where,
you know,
you can characterize that you'll reliably meet timing,
you know,
some portion of your builds and that might be okay for a certain application.
So I think you see that in like compute type applications a lot where you have
an algorithm that's very like tight to fit or difficult to, to work with. You'll see that kind of thing where
they just need one that works and it's okay that it, you know, it takes three or four tries to get
there. Um, for the stuff that we were doing, uh, generally we didn't live like that. Uh, we would,
we're more, we were more apt to change the design to make it easier to fit on the first try, every try.
And so that was kind of where we would sit.
But we had the luxury of having fairly decent-sized parts with some extra room, and we weren't running right up to the hairy limit of what the parts could do.
And so then adding a pipeline stage or doing something just makes your life easier, And so we'd be willing to do that generally. Gotcha. So it is very nice that y'all had not unlimited logic elements,
but you know, you did have some headroom there. Were there cases that you either needed to
optimize a design to fit within the constraints? I guess whether it be logic elements or timing
or something like that.
And what was kind of your strategy and approach?
So, I mean, when I failed timing, the timing tools are pretty good, especially in your top-tier tools.
So, you know, you look at Xilinx or Intel's or, I guess, AMD and Intel now.
You look at their tools.
They're really good at telling you, you like where your critical path is. And so you can often go and just look at your design and say, okay, like, yeah, I have a lot of logic going on in this critical path. And maybe I can pipeline this
better or change something or, you know, add a register stage. One of the tricks though, is like
going back to the, like the concept concept of it having credits and giving up.
Sometimes the critical path that fails is is a critical path that fails, but it's not the one that it was working the most difficult, like the hardest.
Right. And so like some other thing might be causing like, you know, it's like squeezing a balloon.
Right. Where, you know, you get like this other bulge just like pops out somewhere else and um
and so like those are a little bit trickier problems to solve because uh you don't exactly
know what's going on there um but you know you can look at some of the fitter reports and that
kind of thing and try to figure out where it is uh where it's spending its time and decide maybe oh
maybe i have like something that's not very optimized over here and it's spending its time and decide maybe, oh, maybe I have like something that's not very
optimized over here. And it's spending way too much time trying to get that to fit. And because
it worked so hard over there, it just gave up on this, this thing it should have been able to do
easily. So that that's an area where, you know, it just, it requires some play. And, you know,
there are tricks that you can play there. I mean, oftentimes, uh, you use, uh, design partitioning and stuff. So you can go like build certain portions of the design and
lock those down and then see if, you know, like, does the, does the bulge in the balloon come out
somewhere else? And so, you know, you can kind of like, you know, characterize with it. And then,
and then if you're really having trouble, oftentimes you can get help from your FAE
and, you know, they'll take a look at their design and, you know, they can run some metrics on it and try to figure out like, here, try this,
or, you know, they, they get the experience of meeting lots of customers and seeing lots of
designs. And so they have, you know, try this optimization or try this kind of thing on the
logic side. Um, you know, one of the, uh, one design I worked on, so the, uh, we had a CPLD
that was loading the FPGAs and the CPLDs are pretty
small. So, you know, there are 1200 LE kind of parts. So there, you know, it's a very fast build,
but there's only so much space. And on, on some of those designs, um, we were, we were struggling
to like, we wanted to have a core, like a core FPGA or CPLD design that would run on all of our boards
and do the loading for everybody. Um, and, but, you know, there are a few different configure,
you know, as you, you go through the different designs, you know, somebody used active low
resets and somebody did, you know, like the board is like a little tiny bit different over here.
And so you have all of these things that you have to like, you know, you have all these knobs that
you have to turn. And, you know, we got into a spot there where, you know, it really actually mattered. Like we were running out of flip flops. And so
what you end up doing in those cases is, you know, you start thinking like, okay,
do all my counters need to be 32 bits? What if I can right size all my counters? So you make a
little flip, you make a function that then takes the parameters from the user and shrinks
everything down the way you exactly want. And so like, there are some games you can play there.
And, you know, can I run certain things? You know, can I share hardware in a certain place or,
you know, like those kinds of things. And so like, but it's all very design dependent. I think
there it's, it's really hard to have kind of a one size fits all optimization story. And I think that's true, you know, in software too, but you like, it's one of those
things you have to measure it, right? Cause you don't know where sometimes you don't know where
the problem is. Right. And you can look and depending on how the design is being built and
depending on how much IP there is, you might find places where maybe you wired something up to a pin but you know that pin is
strapped to uh like a logic high level right but the design doesn't know that and so the the
synthesizer and optimizer don't know that you could potentially stick that to a zero or a one
in your rtl and watch a whole bunch of things just optimize out because now it can make choices
about things that it knows.
And so those are like, you know, some areas that you can look at in there to see what's going on and how to save a little space. Yeah, that makes a lot of sense. We also talked a little bit
in our meeting yesterday about the use of kind of like vendor IP, right?
So there's, you know,
obviously you could write all of your RTL,
but there's also, you know,
maybe soft cores are a good example
that you can take off the shelf.
I'm curious how much you all use soft cores
and also just like maybe generally taking that
to talk about modularity
and being able to have reusable components
across these various devices.
Yeah. So, I mean, I think some of it depends on your, like your corporate strategy for what you
want to do around IP. Sometime, I mean, like we'll talk about serial rapid IO possibly,
and like serial rapid IO is a moderately complicated protocol. And there's a kind of, you know, I mean, you have say 10,000 LEs worth of stuff in there.
It's nice to be able to go grab one off the shelf and like drop it into your design and have a data sheet that says this is how it works and, and use it.
And like, that's, that's pretty good.
And so it, and so long as it works the way you want and the way it's advertised, like that's usually fine. You know, there are lots of things like, I mean, you can find IP for just about anything. I mean, the vendors do a pretty good job of providing some base IP for stuff that you need. So you have UARTs and interrupt blocks and that kind of thing. Uh, and then, I mean, beyond that, you can go out
and buy IP. So you want a TCP offload engine, you can buy that. You want an NVMe controller,
you can go buy that. You can, you know, you want a can core, you can go buy that. I mean,
there are lots, lots of these things. Um, uh, what I have found is that like, there are,
there are reasons to do that. And those can be economic reasons.
Like sometimes it just makes sense to pay the money and go buy a known thing and get support for that.
But you just have to understand that you are integrating somebody else's stuff. especially when it's like an encrypted IP core where you don't get to see the source when there's a problem or if there's a problem,
you're going to your third party support person is going to be involved in
getting you a solution.
And you're kind of at their mercy for you know,
for what,
what can happen.
And like sometimes that works out great.
And sometimes that doesn't work out so great.
And so,
you know,
generally I think philosophically I like, works out great. And sometimes that doesn't work out so great. And so, uh, you know, generally,
I think philosophically, I like, I like to design as much of the IP as I can, but there are certainly
places where you, you need to use IP. And so there are places where, especially if you need hardware
resources, a lot of these, a lot of the FPGAs now have PCIe blocks and stuff inside there. And like,
there's no way to just magically get access to them.
You have to go through some of the, the IP wizards, uh, but for stuff like, um, for flip
flops and FIFOs and stuff like that, I, I would prefer to infer them if I can. And so I would
like to write, um, RTL of some sort that actually, you know, does the correct thing and gets, uh,
and gets the implementation that you want. Uh, does the correct thing and gets, uh, and gets the
implementation that you want. Uh, dual port Rams are a good example of that. I don't like using,
uh, vendor wizards for dual port Rams if I don't have to, uh, because they're annoying to simulate.
Oftentimes you have to have, you know, encrypted simulation models and all of this stuff. Uh,
if you can find a way to write your IP, your own IP in your RTO and get it
to infer the correct structure. So you're actually using RAM when you think you're using RAM and not
using, you know, a billion registers instead of the RAM that you intended to use, which,
which can happen. You have to be careful about that. That that's certainly the preferred path,
I think for me. And so I try to limit the IP to places where it really
makes business sense. And then if you have to do that, you also have to realize like when you hit
your wagon to, to that, like you're sort of locking yourself into that IP, unless you take steps to
make sure that you have like a clean interface break, right? And like interfaces are all hard,
but if you expect to be portable and you want to go jump to, you know, some, you know, you're on
an Intel chip today and you'd like to go to a Xilinx chip tomorrow, you really need to think
about how you architect your design so that where you use IP is properly separated and you have nice
clean, like somewhat standard interfaces to get in and out of that
stuff uh so that it's easier to move move your design to you know a different device or a
different technology i mean one of the things that that we found is that even even in the same vendor
but among different device families you can have variances in the IP that they generate that are maybe just like,
it's maybe like the stereo in your car where like the models come, you know, at like slightly
different years. And so, you know, you look at an older, an older version and it still has the old
model and a new version has a new model. Uh, and they're mostly the same. But like, how do you build like enterprise modules that can use either one without really caring? And so, you know, building wrapper modules and understanding things and so we built kind of a a shim basically that had the same interface to the designs
and then would you know you'd swap in a different shim for a different core and uh but that way the
rest of the design can kind of all look the same and and you can simulate and you know do all the
things that way right that makes a lot of sense for the uh the serial rapid io um can you
talk a little bit about what that is and uh why y'all y'all were using it and as well as like what
the alternative to to using it would have been oh yeah so uh serial rapid io i think is it's a very
interesting protocol i uh we were really happy with it. It is a, it's a 30s based protocol. So
it's a high speed. Uh, I, we were, we were sitting on gen one dot three switches. So that was like
2.5 gigabit. Uh, I think the gen one dot three maybe also did, uh, 3.125 or something, but,
uh, we did 2.5 gigabit and we used it as our general networking.
So essentially in like inter subsystem, you know, between subsystems, we would use a serial rapid IO link.
And so every subsystem there, they're like entrance on to like being a part of the system was you have a serial rapid IO link.
And that's like you connect up to a server, but I was switch from,
from a networking topology.
It looks a lot like ethernet.
So you have,
you know,
like a star topology basically where you have all of your nodes and they
connect to switches and you can,
you know,
the switches are all crossbar switches and you can do all kinds of fancy
stuff.
That's,
that's in contrast to something like a PCIe,
which is also transceiver
based but PCIe and PCIe does have switches and things too but when you look at a PCIe typology
there's typically like a host controller and uh that's kind of like managing the whole system
and um when you think about um the way that the system was architected. We really wanted more of a like peer to peer, uh, networking
architecture. So PCIe, uh, didn't feel great for that. Um, and then, uh, you look then ethernet,
I mean, it's the obvious one. Cause like either it's way cheaper and everybody uses that, right?
Right. Um, the big thing that drove us to serial rapid IO is we actually wanted transceivers and we wanted transceivers because
the way that system worked was we had a synchronized clock that was running in each of the FPGAs
on all of the subsystems. So we had a concept of global time. And once, once a node joined the
network and got set up and synchronized to the global time, then it's counter was like frequency locked to all of the other boards in the
system.
So that you could schedule events and say like at system time 42,
do this.
And like some subsystems,
you know,
there were some mechatronic assemblies.
So like things had to move.
And so if they know,
like I have to be in position by system time 42,
then I better start my move at system time 27 so that I'm ready to go at 42.
And,
and you,
but you can synchronize all of your nodes that way.
And so we were,
when we were looking at building that architecture,
uh,
ethernet,
um,
didn't have a great story around that.
Um, the, there was like time synchronized time synchronous. Ethernet is't have a great story around that.
There was like time-synchronous Ethernet is a thing now.
That was kind of coming onto the scene.
But being able to, you know, all of the transceivers, you get an RX clock coming out of the thing that comes out of the CDR.
So CDR is clock data recovery.
So you basically get a clock off of the bits that are coming in the receive port and generate a frequency locked. And so you use the bit stream as your clock. And that because
all the transceivers have that intrinsically, it's just a matter of pulling that RX clock out and
running it into a counter and then doing a little bit of synchronization. When a system joins the
network to like, help get it into the you know
about the right time and like a couple of clock ticks at you know we were at 125 megahertz maybe
or 62.5 megahertz so a couple of clock ticks doesn't really matter you know you're talking
a few nanoseconds here or there but once you get them within a couple of clock ticks then all of
the events are you know the whole system any board in the system has a concept of global time and you can
schedule, you can schedule future events or do some safety critical stuff that way, because you
have some, some concept of like time. And so if you send packets with that time embedded in there,
then you can figure out, oh, like this node is behind or this nodes ahead or it, you know, like there, there's a lot of magic that you can do.
And so then in combination with an FPGA, the, the, the benefit that you have is like, I can build hardware blocks like FPGA blocks that can multiplex into that zero rapid IO stream. And on the, on a receiving side,
D multiplex right out of that receiving stream without the upstack software
having to like see or do anything with it.
So I can get hard real time FPGA to FPGA communication that doesn't enter it.
Like it's in band multiplexed in,
it's just a different packet format that goes through the wire.
Um,
and then we did,
uh, sir, rapido has this concept of message passing. And so like they have a whole format that goes through the wire. And then we did,
Surah Rapido has this concept of message passing.
And so like they have a whole standard for how you send messages.
And so we,
we built a core that did that and it does all the retransmission and everything.
But our embedded software team would run ethernet frames over that.
And so you get TCP and UDP.
So they get to live in a world that like makes sense
to them right like they can talk sockets between things and you know all all the like standard
stuff um but we get this magic fpga to fpga side channel communication right through the same
interface which was pretty cool and and the final benefit which is way different than ethernet is a serial rapid IO has a, the con like it has guaranteed packet delivery between two nodes.
And so like every packet that goes out on a link gets acknowledged at the,
like there's a hardware handshake that occurs.
And so if there's a bid air that messes that packet up or the packet
disappears or whatever the two link partners go through a recovery process and then
that gets retransmitted and that's all hardware layer, physical layer link retry basically.
And so it's, it's super nice because like once you ship a packet out, it's either getting to
the destination or like maybe the destination dropped off the network or something. And so
it'll never get there. But like at that point, it kind of doesn't matter anyway. Um, but there's, there's no way for the switches to really drop packets unless you
tell them to. And so you get this really robust communication framework, uh, that you don't
actually see. So like if I fired off UDP packets, that's pretty easy to do with an FPGA, but like
that UDP packet can get dropped basically anywhere, you know, along the, the, the because like that's how it works and like right so uh so
that's cool i would say one of the things we learned doing that is that when you're running
ethernet frames on top ethernet does have some assumptions about like congestion and like and
when like a node drops off the network and everything like backs up you can actually like
deadlock your whole network if you're not careful.
And so you have to have a way to like figure out that nodes disappeared and
like drain FIFOs and stuff. Cause otherwise you can back stuff up.
So that, you know, those, those are like exciting various, you know,
things that we discovered there, but, but generally it's 30s based,
very reliable. It was pretty easy to build RTL transmitters for that.
And we could send, you know, packets and that kind of thing. So it was a really build RTL transmitters for that. And we could send,
you know, packets and that kind of thing. So it was a really good fit, I think, for that
application. I believe that it's used in a lot of base stations as well. So you can find some,
you can find some DSPs and some stuff with some rapid IO on there. So.
Interesting. Yeah. I was going to ask how it, how it impacted the software, but the fact that you can run Ethernet over it is definitely a huge benefit there.
And I imagine that one reason why it may not be more widely adopted is you all have the benefit of having kind of like control of the whole system, right?
And that massively increases the value there, it sounds like.
Yes, it sure does yeah and i mean yeah we own the whole
network and the only things there that plug into that network were things that you know like were
our you know our subsystems right so we don't have to interop with other things um it's also
more expensive than ethernet so like if you just need ethernet just use ethernet right like that
that's the right answer but in this application it, it was, it was neat. And we could, you know, this is an area where like, because we had FPGA to FPGA
communication, we could effectively download new, new wires. So like a, a system could decide it
wants to set, you know, one subsystem and another subsystem that never talked before could get an
update. And now they can have, you know, some kind of virtual communications channel
between them. And so, you know, in, in a previous generation system, if that didn't work, you either
had to like have software, like shuffle it through a bunch of different bridges, or you'd have to add
like a physical set of wires to the system as part of an upgrade in order to allow these two
subsystems that didn't used to need to talk and now want to. And so with everybody being on serial rapid IO,
so long as everyone has the LEs available to build new catchers or new transmitters,
you have a lot of flexibility.
So it was a really neat protocol.
And because we were able to abstract a lot of that into a common core platform,
the users of that really only had to deal with the packets that they wanted
to deal with. And like the transmitters were all pretty, you know, baked.
And so you just, you know,
put in a few different things and you could pretty,
pretty easily come up to speed on, you know,
you get a lot of stuff for free, basically that like the packets going to show
up on the network and get caught by the person who, who wants to get it,
you know, with, with almost no work.
I mean, you get to like cookie cutter it down.
So that was pretty fun.
That kind of, uh, leads to one of the last things I wanted to ask about, um, uh, you
know, working on these, um, CT scanners, once the, the systems are like out the door, like
in use, right.
How are y'all updating them?
Did y'all, did y'all do things like, you know, over the air updates or what was the frequency at which
firmware or bit streams are being updated? Oh yeah. I mean, it, it varied a lot. I would,
I would say so, you know, like, like most big capital purchases, you often have a, uh,
you either have, uh, a maintenance contract where you're getting updates, you know,
as software patches come out, uh, or you have, uh, you know, in some cases you'll have, uh,
bugs or things where you're like, you have to update everything. Uh, and so like, you know,
like that stuff would go out. Uh, I mean like, and that's like the whole system software kind of as a,
as a general thing, including bit streams and all the embedded code.
You know, update frequency probably depended on the program.
You know, you're looking at like a whole software package for a medical device.
It's a huge undertaking in order to do.
So you have a lot of validation and verification that occurs in order to make that happen. Uh, you know, probably somewhere in the, like,
I don't know, two or three a year would be a pretty rapid pace for doing deployments, I would say.
Um, and you know, I'm sure that there, I mean, I, I have no idea the speed at which they're doing updates now. Uh, but for the most part, uh, that required, uh, service personnel to be on site for
that upgrade. And so
they had, you know, because it's, it, you're really, you're upgrading the whole thing. So
you're, you're doing, uh, you're doing the like host Linux OS and you're doing, you know, you're
like, you're putting a DVD or a USB drive or something into a computer and like doing like a
full on clean installation. Um, I do think they, they had some ability to push patches as
well. And so that would happen for, you know, for very minor kinds of things, but like, uh,
the other benefit with big software releases is you, you often have the opportunity to sell the
new features. Right. And so if there, if there are new features, then, you know, you want to have,
you know, you might get new features and you might get some like new covers or you might get,
you know, some new accessory or some other thing. And so those become saleable features and,
you know, they, they work with the, the customers to, you know, get those upgrade plans in place.
That's fascinating. Yeah. I feel like that, um, you know, these kinds of systems,
you mentioned the importance of them at the beginning of the show, but, um, the, all the
work and effort and the complexity of them is not something we the show, but, um, the, all the work and effort
and the complexity of them is not something we, we usually think about, right. Even I've,
I've never experienced using one of them before, but, uh, even if you did, right, you'd probably
just not be thinking about it. So it's, uh, it's really interesting to hear kind of, uh,
behind the scenes. And it sounds like that was a, uh, because of the complexity of it,
there was like a wide breadth of experience
that you got from being in that field and learning about it. Yeah. It was, I mean, it was,
it was super neat to be on that team. I mean, we had, you know, complicated, uh, you know,
field debug complicated, uh, like, like service, uh, you know, service and upgrades and like a lot of these you know
installations are are targeted for like 10 or more years of life and so you like these products
are going to go into a hospital somewhere and work for a really long time and they're going to need
upgrades and they're going to need maintenance and you know if a chip vendor obsoletes the chip
that you're using then you're going to be building replacement boards that swap that chip out with some software updates. And, and so, you know, you know, a whole like wide
breadth, but it was a really great place to, uh, you know, get my teeth cut on a very complicated
system. I mean, the, I feel like a CT is like at least as complicated as a car is today with all
of this stuff going on, possibly more so in some ways and
probably less so in other ways and it was neat you know we had a team that was doing everything
from high power electronics to high speed digital and and mechanical stuff kind of and so you get a
lot of exposure to like oh you know this is how sheet metal gets built and i mean that stuff is
super valuable when you know like especially as i moved Oxide, that was a, that's a place where, like, we get to wear a lot of hats because we're a small team here.
And, you know, like our whole team, like when I started was smaller than my hardware team was at GE when I started at Oxide.
I mean, I think I'm employee like 32, maybe something like that so the whole like everybody gets to do all kinds of stuff
and having some experience like oh yeah i know what sheet metal is and like i can you know
i can like pretend to talk to people about this stuff and maybe get to get us to where we need
to get is you know it was was super good but but yeah being on a team that makes physical things
is i mean it's just uh it's a really exciting experience, I think. And it's fun to be able to like hold the thing that you worked on it and watch it like, you know, watch GE for a long time. So I imagine, you know,
is a pretty compelling, you know, place to join. So talk to me a little bit about the decision to
do that. And also, I guess, you know, for folks who may not be familiar with Oxide, what y'all
are doing? Sure. So yeah, so Oxide is making rack scale computers for people who want to have on-prem cloud infrastructure.
So there are reasons why people might want to have hardware on-prem.
And when they want it on-prem, they would still like to feel like the cloud-native API-driven, no VMware, that kind of thing.
So Oxide's building
something for those kinds of companies, companies that, that need hardware in their data center,
and they need to own their hardware, but they, they would like the cloud experience for managing
that they don't want to have a whole team, you know, plugging in a bunch of, you know,
commodity boxes and like building out network things. So when you buy an oxide rack,
like the rack comes out of out of the crate, and you unwrap it. And we've had customers who are,
you know, up, it was up and powered up on their network in, you know, a matter of hours. And,
you know, they're ready to go. So not a data center person, spending, you know they're ready to go so not a data center person spending you know four days
doing network routing and that kind of thing all all of the the rack is all built in essentially
the rack is it's got two big switches and up to 32 sleds uh which are you know milan based
they're these like big server hardware so you know a terabyte of ram uh you know multi-core
processors lots of lots of SSDs, that kind
of thing.
So, but you buy it and you buy it like at the rack level.
So you buy a rack and you get a rack or you buy four racks and you get four racks.
So like that, that's kind of the, this is not the, you know, it's not like, oh, I need
two servers kind of use cases.
It's the like, I need 32 or I need 64 servers.
I need, you know, 4,000 servers, that kind of, that kind of a use case.
So that's, that's what we're building.
As far as moving over, I mean, you know, I, I discovered Oxide on Hacker News actually.
So I was reading Hacker News and one of the blog posts, you know, hit there and I read
it and, you know, it's, it was interesting.
I read a lot of Hacker News.
So it's like, oh yeah yeah, you know, some some other
startups doing this, like interesting thing. Cool. And so I, you know, read it and kind of looked at
it. And it was like, Okay, like, that's neat. But But as I, it kind of got stuck in my brain,
you know, like this, this like little oxide thing. And it was like, that company is like,
kind of interesting. Like, there's something weird about, you know, like, uh, Brian, our CTO has his, uh, the compensation as a reflection of values. That
was, uh, that was a blog post that I discovered. And, you know, it was just like this, this company
feels like they're doing something different and, and like, that's kind of interesting.
And so, you know, that, that was like, I wasn't looking for a job right then, but like that kind of, that seed got planted in my head. And then, um, somewhere along the, along that, that same year,
um, one of my close coworkers, uh, left and took another job somewhere else. And like,
he and I were, were super close and, you know, I like really enjoyed working with him.
And that was kind of a sad day, uh, you know, there at, I was just like, Oh man, like,
you know, like we've been working so closely together and like, he's off, you know, move,
move to a different area and, you know, like all this. So I think, uh, you know, those two things
really kind of like put me, I think more in the mood to like, look a little bit. And as I, as I
started, you know, thinking more about it and, you know,
talking to my family about, you know, well, do I, do I want to do this? It was just like, you know,
yes. Like I really do want to try this. Like this is something different and it's something,
it's a, it's an opportunity I have. And, you know, when I, when I looked on their website,
it was like, oh, they're hiring electrical engineers. Like that's, I'm an electrical
engineer. Like I might be able to do
this. And so, so, you know, I applied and went through the process and the process is like a
little bit different. You know, you do a lot of writing and a lot of storytelling in your
application process and kind of, kind of worked my way through that and got the opportunity to
talk to Brian. And, you know, I mean, one of Brian's, the first things he was like, he's like,
you know, you've done a lot of FPGA stuff. Like we don't have that much to do for FPGAs. And, you know, I mean, one of Brian's, the first things he was like, he's like, you know, you've done a lot of FPGA stuff.
Like we don't have that much to do for FPGAs.
And, you know, it was like, okay, you know, that's okay, I think.
You know, I've worked on a hardware team for, you know, 16 or 17 years.
So, like, I know how to schematic review and I know how to, you know, do this stuff.
And, like, I'm kind of looking for something, you know, a little different.
And so, anyway, that all worked out. So in May of 2001, I got the opportunity, the opportunity to
join oxide. And I mean, it's, it's funny because like my, my former coworker who left, uh, actually
ended up applying and coming over to oxide eventually. So we're working together again,
which is kind of fun as well as another one of my XGE my XGE XGE coworkers. So we have kind of a, um, a group of people that like have really enjoyed working
together with each other for quite a number of years are over here at oxide now, uh, you know,
doing computer servers and, and network switches and stuff. So, yeah, that's awesome. And, uh,
when you, you mentioned joining kind of early on, early on in the company's history, right, just 32 people and given especially the scale of what y'all are doing, that is a very small number. What was it like coming in? What was kind of the state of the product at that point? And, you know, what did you start working on? And how has that evolved over time yeah so uh state of the project we were late i mean you know like like all good
engineering projects right like you're late i feel like you're late the when you start right
but um the um i was the third or fourth electrical engineer uh on the team and so like they had
started hiring electrical engineers in 2021 but had kind of been like, you know, I mean, the company was formed, uh,
like in, uh, 2019, I believe. And so, you know, they had had some amount of time, uh, they were
trying to work with a contractor and, uh, get, get some of the electrical design done. But, you know,
uh, one of the like ethos, I think at Oxide is like really doing things from the ground up and
understanding,
uh, like understanding where you're coming from, why you're doing the thing and, and really like
from first principles. And I think it was a challenge to instill that, um, into, you know,
a, a partner or a contractor and just, and like, and it's a, it's a challenge to, you know, really
like you, we want to own this stuff and we want to, but it's hard to do that when you don't have employees, I think.
And I think that's a challenge generally.
And I'm sure that there are some times that that can work out.
But in this case, it was one of those things where we didn't want to take AMD's reference design and just plop it down on a, uh, on a, you know, a different circuit board
and go like, we wanted to do things really differently. And, and, but you need people
engaged in that and you need people who like are ready to join that kind of, uh, like movement,
I feel like. And, and without that, it's just really hard to get successful outcomes. So when
I joined, I mean, we joke, I say that like when I joined, the house was still on fire.
You know, like the guys who came before me, I mean, Tom and RFK especially had started a little earlier and, you know, they were trying to get like, you know, get the train back on the tracks.
But, you know, the like we had a lot of schematic work and a lot of a lot of stuff to do.
So we were right in the middle of I mean, theoretically, we were going to tape out the Gimlet server board.
So the server board was going to be the first board that we were going to build out.
And we were supposed to tape that out, I think, in June, maybe.
And, you know, like it kind of dragged on a little bit.
I mean, we got in there and I mean, there was a lot of, you know, a lot of cleanup and a lot of changes. And we realized like some things
weren't communicated very clearly and, you know, anyway, lots, lots of stuff. But, you know, you
had, and then, you know, thankfully, you know, I mean, we had Eric and Aaron join and Ian eventually.
And so a lot of, a lot of good people joined kind of, you know, along this, this process.
But, you know, we, we had to like write the ship and get, get the,
get the schematic out the door so we could build these parts and then,
and do bring up. And, you know, I, we have on,
I think you mentioned oxide and friends.
We have a couple of different episodes that talk through some of the,
the war stories of bringing some of that hardware up and, you know, the,
I mean, various, you know it it's
been interesting because like we're it's a remote company so like we're not all like able to go to
the office every day like and and there isn't like an electronics lab that we all share which is
kind of how life was before and so you know bringing up gimlet um i Um, I mean, I, I remember, so our, our, our contract manufacturer and partner
is in Minnesota. So that's, I'm up in Wisconsin. So it's, you know, it's about four ish hours away.
Um, and I remember, so we had been there at, in Minnesota to do bring up and had kind of gotten,
I mean, we'd gotten a lot of problems figured out on the board and, you know, because the board had
kind of been handed off and handed off, it's kind of like the, I mean, they're just like, you know, problems at all of the seams.
Right.
And, you know, it was like, you could, you could see like, had we done this differently and we had like one team own the thing, like bunch of people coming in from all over the country, getting the thing built up.
And when our two weeks was up, server still didn't boot, but it kind of mostly powered on.
But we had a pretty big ding list of a lot of rework to do and everything. And so, um, I remember, uh, like a
week or two later I had to drive halfway back there and I met one of the, uh, one of the employees
from benchmark, our ODM, you know, and our two minivans and we pull up and we, like, I pull all
of these servers out, you know, in, in the, in the parking lot of like a, a cheese store or something,
which is a very like Wisconsin thing. Right. So, uh, but you know,
we're, and putting all those in. So I brought them back to my house and then, you know, Eric is here
local. And so he and I finished some of the rework on that stuff and kind of shipped them out as we
got things to boot and everything. But yeah, you end up, you end up doing a lot of like,
you know, strange stuff. I mean, I, I had done some rework and some soldering at GE just for,
you know, being norm, you know, being like normal debug process on a circuit board and that kind of thing.
And so, but, you know, got, got even here. And they were like piled up behind me and in my office for a couple of weeks until we got, until we got them functional
enough to where like, we could legitimately hand them out to our teammates and they might actually
like boot and do things. So. Right. Well, that is, uh, uh, quite the story of the, the cheese
store handoff. Um, you know, you mentioned, uh, the kind of like owning the full stack. And I'm also I don't know if I'm saying this or admitting this, but I also read things that I think is short-sighted about that outlook is, um, yes,
right.
Like it's more effort upfront, uh, to do things a little bit different or understand the whole
system or maybe build some parts of the system yourself, but y'all have to support these
racks once they go out.
Right.
And you are going to see lots of behavior.
Right.
And so like, while there, it may accelerate the process
of developing the initial product, the long-term burden that that's going to impose actually going
to be pretty big. So I think, you know, it's a, it's a, in the context that you are working,
it's a very valid outlook to have. And, and also, you know, people externally get to benefit from
the interesting things you are doing too. Well, yeah. And I hope that our customers
feel like we, you know, there's like, you know, we're like the person that they have to go to.
Right. And then it's like, there are no excuses. I mean, we need to own our problems, you know,
to the extent that we can. And like that, that's where, you know, I know, you know, Brian and Steve
having run data centers and other things, you know, have lots of stories about like getting involved with third party, you know, third party vendors on things.
And like, there's a lot of finger pointing and a lot of stuff.
And like one of the core things I think for us at Oxide is really just, we, we, we want to own the stuff and we want to own all of it.
And, and, you know, to the extent that we can, I mean, there, there are places where like, we just can't own that because like someone won't give us the code for that.
Or, you know, you see there are lots of, lots of, you know, places like that, but we, we want to have ownership of as much of it as we possibly can so that we can both understand it, make sure that we have the right visibility into the design.
And, and then also be able to fix problems
and support our customers in the best way possible.
Yeah, that makes a lot of sense.
You also mentioned the bring up experience and I'll make sure to link to some of those
episodes as well as something you didn't mention is your blog post about working with
remote hardware teams,
which I think is really, really excellent and details some of the strategies that you all have
employed that could be useful for other folks. One of the things I think is, you know, really
illustrative of understanding an entire system is kind of like going from like power on to,
you know, getting a terminal prompt or something like that. Um, can you kind of walk
me through what that looks like for the, the oxide computer? So, I mean, uh, in a functional system,
I mean, it, we, we power it on and you, I mean, in, in under a minute, maybe a minute and a half,
you like the system is up and running and you can get to uh you know a serial
port through uh through our you know web api and that kind of thing um bring up was not that smooth
you know some of the things that we we went through i mean it took us i think before i think
december 1st was uh of that of uh 2021 was when we first saw the like characters that we intended to see come out of
the serial port on the right on the board and you know we I think we had done we had done bring up
in Minnesota in sometime in early October and so like you know there were and like there are a lot
of good stories about the power supplies and I mean we spent a lot of time with AMD and Renaissance trying to get, you know, the power supplies to handshake appropriately.
And, you know, there's a bunch of stuff that went in there.
But, you know, it was so exciting.
I remember getting on and, you know, this is like the remote company thing.
I don't have I don't have a bunch of engineers like living in my house.
Right.
So my coworker, Ericic was over here a bunch
during that time as we you know went through and it sometimes it's helpful to have two people in
a spot to you know debug but i remember uh like december 1st setting up a meet a google meet
and you know getting the whole team on there to watch like serial port characters come out the
you know out the serial port the first time and it was like it was so exciting you know we had the little you know get all the way to the end it's like oh lumos booted and out the, you know, out the serial port the first time. And it was like, it was so exciting. You know, we had the little, you know, get all the way to the end.
That's like, Oh, Lumos booted. And here's our, you know, or maybe it wasn't, it wasn't a Lumos
booting first. It was first our, our nano blurs, our little like bootloader. And, uh, you know,
one of our engineers had put a, uh, a nice little like, you know, banner there. And so, you know,
you could see like oxide and anyway, it was just, it was like super exciting to see, you know, this,
this thing that we've been working so hard on for for so many months and even some of the team members for, you know, years at this point, you know, booting system and that you don't have like a BIOS or anything like that.
What is the process for booting there and what parts of the system are involved?
So to boot this stuff, I mean, you know, there's a bunch of complicated things. And like getting back to FPGAs even, in order to get all the power supplies up in the proper order to make the AMD
processor happy and start, you know, there's a whole handshake process that happens.
So we have a sequencer at PGA on, it's a little lattice ice, ice 40,
8,000 LE, but it's, it's probably only a quarter full.
It's not doing a whole lot.
And so it does all of the power handshaking to get all of the different rails
up in the right order. You know,
you've got to bring your DDR rails up in the right order you know you got to bring your ddr rails up in a specific order you have to bring some of your core power
supplies up in you know finally and then you know at the very end you send like a pm bus message to
one of the power supplies that says like go and then then the thing goes and uh and so at that
point you have the amd processor has an internal uh an internal core called the psp and so at that point you have the AMD processor has an internal, uh, an internal core called the PSP.
And so it has some firmware that it loads out of flash and it runs its own little binary that we don't get to see.
Right.
So like, that's something AMD provides us.
Um, and so it does some wake up stuff.
It goes out and does DDR training, uh for whatever dims or it figures
out what dims are installed does ddr training and then uh hands hands that over to the the main x86
cpu and at that point we start running uh at the time we were running uh like a little shim called
nano nano bootloader uh we're now running a slightly different version of that called pico
bootloader uh but it's it's basically a rust boot based bootloader uh a couple of the software team
members and so we start executing x86 code and that does just enough stuff to start helios which
is our alumos operating system and then helios starts and so so that like, that's, that's the whole boot process really with
no BIOS. So there's no UEFI, none of that stuff is in there. It's just, we try to take our code
starts with the first x86 instruction. And then, you know, we do just enough stuff to like, you
know, set memory up and set the hardware up in such a way that it can run. And then the OS takes
over and does the rest of the setup and, you know, gets us into like multi-core mode and, you know,
all the different things that happen. Gotcha. That's really interesting. The, uh, I know there's
also a, a service processor, right. Um, and that's the one that, um, uh, y'all, uh, have, have written
your own, um, small OS for, um, what role does it play? Um, I imagine it's primarily
playing a role before that boot up process or is it ongoing? Yeah. And it's, it's kind of the
conductor for the boot up process, right? So it, um, it wake, it wakes up, um, let's see. So the
service led architecture, we basically have three main power States. We have the power State where
you're just, you're connected to the rack
but nothing i'll say nothing is up but actually a few things are up so we have ignition which is a
little tiny ice 40 fpga and he talks an 8b 10b encoded custom protocol uh to the sidecar switches
then that just provides a sled detection and basic sled power control. And so that, that is one power domain. And then, um,
when he has been instructed to turn on, or, uh, I believe in, in most cases, he turns it on
automatically once he configures out of, out of flash, um, the, the, we start the SP power domain
so that the SP comes up, SP loads, the FPGA, um, you know, the SP starts,
uh, its management network stuff. So we have all of kind of our, like, um, all of our core
management functionality, the, the SP shows up, uh, hubris boots, SP shows up on the management
network. So now, now we can actually talk to the sled almost like, like the BMC,
like a BMC and a traditional server,
right?
So it's sitting there and we can get to it.
Even though the AMD processor is off,
we have a little, the little arm based SP.
And so it's up and running and it can talk on the management network.
And so then the SP decides to turn the,
the AMD processor on.
And so the AMD processor,
then, you know,
the SP has to like tell the FPGA
to go like wiggle all the power supplies
and do the thing.
And so it goes and wiggles
all the power supplies.
And, you know,
then the AMD processor starts to boot
and the SP monitors
what the FPGA is doing.
And so then once the FPGA figures out
or the FPGA has got enough stuff sequenced,
the SP tells,
does that final PM bus command
to the core supplies to tell them to start
operating.
And then it has a serial link between it and the AMD processor.
And so it can,
it actually has it actually has two.
So it can see the serial port,
traditional serial port that you would see.
It also has a separate serial port for uh like intercommunication
so we like that's how it can tell like what os is booting and you know there are a few kinds of
power control things and like sideband that isn't necessarily like user terminal stuff and that all
goes through there so the sp is really the thing that coordinates and so when when a control plane
wants to you know like take it out or whatever we can go and have the sp cycle the sled or or we can
go upstream of that and say have ignition cycle the sled depending on on what we want to do but
the sp's job is basically to sit there and and be a management interface for the sled so that the
the rack can treat that as a um as a resource and we can get debugging information out of there we
can it monitor it it does the,
the control, the thermal loop is running on the SP. So we have, you know, fans that keep everything
cool. And so it, and temperature sensors on the board. And so the SP is doing all of that as well.
Gotcha. And, uh, another thing that I've heard about the, uh, kind of like FPGAs that are,
that are in the system. And I think you mentioned two, two different FPGAs that are in the system. And I think you mentioned two different FPGAs that are currently in the system.
I believe that you are using open source toolchains
for your Bitstream development and programming.
What has that experience been like?
And I know there's like,
lots of folks I talk to have very differing opinions
on the state of open source FPGA toolchains.
Sure.
And I'm curious about, you know, you have obviously lots of experience with proprietary
toolchains and now using open source alternatives.
What's kind of your take on the current state of them?
I mean, I think it's complicated because so like the open source toolchains are really
cool.
I think they're awesome.
But like they're in this weird spot because like they're always playing with one hand tied behind their back at least because the information about the chips isn't open. I feel like most companies would not make a processor today that doesn't have documented enough stuff so that you could get LLVM support.
Right.
But like FPGAs aren't like that.
And so like everything that the open source tool chain has done is amazing.
But it's all had to be like manually reverse engineered.
And I mean, it's just it's super painful.
And I mean, as far as like how
to get like new chip support and that kind of thing and so but i like i'm sad because i i feel
like when you look at the software world and all of the explosion that we have there right open
source compilers i mean they mostly just make sense i don't know that there's a whole lot of
of people out there thinking well i would like to hide my information from an lvm because i don't want the open tool chain to like
no you would like you you want to use like you know you want to use like these professional
grade open source tool chains by and large i mean you can also buy some proprietary compilers for
sure and they may have reasons to do that um Um, but I, I, my hope is that
over time at the FPGA companies embrace a more open interface, uh, so that we can, you know,
better have things like an LLVM story, um, with, with the stuff. Now I would say the, the, like
we're using Yosis and we're using the ice, like, um, the ice 40 tool chains there. Uh, they were,
they were great. Um, I, we haven't had any real major issues with them at all. Um, but you are
missing certain things. So, and, and, you know, like, I think this is kind of where you end up
on a lot of stuff. You're missing features that either aren't needed or aren't wanted, or, or
maybe just haven't even been developed or
difficult to develop right and so a big missing feature to me is like no chip scope or signal tap
just integrated into the tool like being able to do an incremental fit and add signal tap or
chip scope and be able to see you know what thing you're looking at inside the chip is a super critical feature that you see in the proprietary tool chains. And, and partly,
I think, because if they give you those tools, they get less support cases. Right. And so like,
you know, if you can go, you know, if you can go fish a little bit for yourself,
then, uh, it, you know, it's not a big deal, but like those tools are super powerful. And like,
you know, I can't tell you the number of times you have some like tricky problem that you need to go find.
And like being able to go stick a logic analyzer in your design, uh, is, is critical and being able
to stick it in your design, uh, and run an incremental compile is also sometimes very
tricky, like very important for debug. Like if I just went and coded my own
logic analyzer, which I can totally do in, you know, in all of these tools and drop it in wherever
I want. Um, it's a little more annoying because like I have to change my design and I have to
like stick stuff in there and I'm going to get like a massively different fit. And depending
on what kind of problem I'm looking for, you know, I may chase, I may chase that problem away,
especially, you know, if it's a timing problem or a clock crossing problem and you're looking for, you know, like
what, what the heck is going on here? And a lot of times the like signal tap isn't even necessarily
the tool that gets you to the answer, but it allows you to see like what's going on and you
get ground truth for like, okay, I'm getting corrupted data here. How could I be
getting corrupted data here? Oh, I have botched a clock cross or I have, you know, done something
wrong somewhere else. Or, you know, like, like you're sending me incorrect data, but it gets
you ground truth. And so you can see like the bits or the packet or whatever. And like that,
that's a critical feature. I think, uh, additionally, I think, uh, critical features are around, uh, like timing constraints and
timing analysis and, uh, you know, any moderately complicated FPGA.
I mean, I have to laugh because in college they would say, you know, don't worry about
designs that have more than one clock domain.
Uh, when you get to a design that needs more than one clock domain, you'll know how to
do it.
Right.
And it's like, it's just totally not true.
And when you look at like a modern design today, especially with transceivers,
you know, every transceiver, I mean, you've got two clock domains at least right there. And, you know, I, I was working on a, uh, a design where I had, you know, multiple transceiver
interfaces going into a system clock domain and then going back out another transceiver interface and so you know you might have eight or twelve clock domains in you know in play in like a
moderately you know small piece of logic and it's like like those are real problems and so being
able to get to ground truth in the tools for whether you meet timing uh is is critical there
especially with multi-clock domains and IO timing constraints.
So, you know, a lot of the IO that we're doing is super fast and being able to conclusively
prove build over build over build.
I have constrained this interface, you know, maybe it's a maybe it's a Zowie interface
or maybe it's, you know, some kind of high speed, you know, like spy interface or something
like that.
I've constrained this with these constraints.
My timing analyzer runs, I get a pass, like it's going to work.
I don't have to go chase that every single time. And I'm not worried about having that fall apart every time I build it is, is critical.
And I think those are areas where, um, it's challenging, I think with the open source
tool chains, because they like those features sometimes require more intimate knowledge of the chip and, you know, like better timing models and, you know, like, and that stuff just isn't available.
And so, like, I don't know what the answer is there, but I do feel like there, there is kind of a gap there in my experience.
And I, I hope that that gap goes away.
I hope that companies start to embrace the open tool chains.
I know I think QuickLogic is supporting open tool chain kind of like from their genesis.
And so like that's pretty cool.
I believe it might be GreenPak that's also doing something like that but uh but they're like
little tiny i mean the green packs are that's a cute little device but like they're like little
tiny little things from renaissance and um not not quite the like big size fpjas that we need
but i'm hopeful that like those are you know that's the like opening you know salvo to more
openness in the fpja so that we can get better tool chains and better tools.
And like being able to go and inspect like when I have a compiler bug or I have a IP problem, being tied to a vendor and their response cycle and their understanding of my problem and the fact that like they're not here and they can't, you know, they don't have a reproducer necessarily and they don't have my hardware. You know, being able to go look in an open tool chain is super awesome
because you can go there and say like, look, I, you know, yeah, I mean, I might not be an expert
here, but I can go like change this thing and I can go rebuild it and see if it does something.
And, you know, you get a whole different kind of investigation that you can do. So I'm hopeful
that, you know, the industry is
headed that way. That way, I think it's going to be, I don't know, I hope I see it before I retire.
So yeah, absolutely. Yeah, I did see the the quick logic stuff. I think that's been a few years
ago now. And so hopefully some of the bigger players do go that direction. But kind of
wrapping up here. Y'all shipped, you know, a rack wrapping up here, you all shipped a rack
or multiple racks.
We shipped a rack product here.
There's obviously ongoing
from the things that you work on.
There's firmware updates
and the FPGAs, right?
You could have updates as well.
But what does that kind of look like
for you now that the product
is baked to some degree? Ongo that product, and then, you know, potentially moving on to
a next product iteration? What's kind of your responsibilities and how they evolve there?
Yeah, so we're doing a lot of scale up activity right now. So you know, I mean, making a rack,
making two racks, making three racks, you know, know like it's been uh a great learning process
for us especially on the manufacturing side get it you know but we need to get to a spot where
like we don't have engineers troubleshooting uh manufacturing line failures as frequently
or you know like anything like that so we're trying to get you know upstream better testing
upstream better fixturing be able to scale up so we can build way more of these,
uh, efficiently like that. That's been a big focus for our whole team. Um, and then, you know,
like when you look at processor life cycles, I mean, we're on, we're on Milan, right? Well,
you know, uh, they've got new ones coming, right. And new ones come every couple of years.
And so, uh, you know, so we're looking at what a next gen sled looks like that fits in, in the same rack. And, uh, you know, like what
going through, you know, what, what about the current architecture are we, uh, like really
happy with and keeping and like what minor tweaks can we make to make our, you know, now that we've
been through this whole cycle once, uh, you once, are there things that we could do differently that would make debug or visibility easier?
And then you have the standard thing where a new processor family is going to bring new DDR and new power supplies and new power supply topologies.
So there's some things that are kind of forced on us um that we're going to have to change and uh some some are some
like minor architectural tweaks i think for uh how like none of it is really like surfaced out
to the user but it's stuff that will make um we'll make manufacturing better you know one of the
things like a big challenge i mean this is kind of silly but like we have a bunch of dongles that go
in manufacturing in order to like bootstrap this thing and program one of these up.
Right.
And we realized at a certain point that, you know, having multiple dongles connected is kind of annoying from a like manufacturing standpoint.
Right.
But given where we were in the program and like the resources and the timings that we needed to hit, there wasn't really an opportunity to go and like redesign or do something, you know, smarter there. And so we're working on,
you know, like a single plugin interface where we can get all of our, you know, all of our stuff
into one like consolidated, you know, dongle so we can get manufacturing programming and tests done
a little more efficiently. And that helps the, that helps our manufacturing partner so that,
you know, there's less places to go wrong and makes, you know, your cycle times go up. And so,
you know, when you think about if we want to make a hundred or, you know, a thousand servers,
like you need to do things that scale better than, you know, than what we have that, you know,
works for, you know, 10 or even a hundred racks maybe. So, right. Those are, so yeah,
those are areas that we're, we're actively working on now for sure. I mean, uh, we're,
you know, we're looking at, uh, you know, what our, what FPGA is our next sleds going to use,
um, you know, that kind of thing. So. Awesome. Well, uh, I'll definitely be following along,
uh, as y'all continue building new things and recently
been open sourcing a lot of different things.
So that's awesome to see.
So Nathaniel, thanks for joining for this episode.
It's definitely super informative.
We covered a lot of different topics and I hope I can have you back on in the future
to talk more about what y'all do in the next iterations at
oxide. Yeah. Awesome. Thanks very much. It's been fun to talk about all this stuff. I like this
stuff is so exciting and it's fun and it's, it's just nice to be able to talk about it and, uh,
you know, help get other people excited about it. And, and the stories are just fun even so.
Absolutely. Absolutely agreed. Um, all right, well have a, have a good rest of your day, Nathaniel.
Great. Thanks.