Signals and Threads - Programmable hardware with Andy Ray
Episode Date: September 9, 2020The ever-widening availability of FPGAs has opened the door to solving a broad set of performance-critical problems in hardware. In this episode, Ron speaks with Andy Ray, who leads Jane Street’s ...hardware design team. Andy has a long career prior to Jane Street shipping hardware designs for things like modems and video codecs. That work led him to create Hardcaml, a domain-specific language for expressing hardware designs. Ron and Andy talk about the current state-of-the-art in hardware tooling, the economics of FPGAs, and how the process of designing hardware can be improved by applying lessons from software engineering.You can find the transcript for this episode along with links to related work on our website.
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet.
I'm Ron Minsky.
So today, we're going to have a conversation about hardware, and in particular about how
you can take the tools that come out of the world of chip design and apply them to a much
broader space of problems than people
typically think they can be applied to.
And I'm joined in this conversation by Andy Ray.
Hello, Ron.
Good to be chatting with you.
Andy is a longtime veteran of the hardware industry.
He spent over a decade building real shippable hardware designs, working on things like modems
and video codecs.
And along that time, he also
did a lot of interesting work exploring and eventually designing his own alternative
languages for expressing hardware designs. The final one was called HardCamel, which is a hardware
design language embedded in CydoCamel, which itself is the primary programming language we
use here at Jane Street. And that work actually led him to us,
and today he works here and leads
Jane Street's hardware design team.
And so, to start with Andy,
maybe you can tell us a little bit about
why hardware is useful for a technology organization
and an organization like ours,
and what advantages it has over
traditional software style approaches.
Sure.
So, hardware allows you to build
customized architectures for a specific problem, which can be tuned to trade off at a very fine
level, lots of things like performance and cost and power usage.
That lets you design a range of different products.
Whereas with a CPU,
you're very much more limited in the software world
to the CPU design that can meet the performance
of the problem domain.
I think the sort of problems it can solve
are very, very broad.
And you can see that just because, well,
a CPU is a hardware design, in fact. And you can see that just because, well, a CPU is a hardware
design, in fact. And you can create all sorts of hybrid designs with multiple CPUs or digital
signal processors or custom hardware blocks that make up your final solution. So I think that's
why hardware exists, why it will always exist. It's the fact you can build architectures entirely suited to your
problem domain that optimize along these sort of areas. So that description on the face of it
sounds awesome. And in fact, it sounds from what you said so far, strictly superior to writing
software. I don't think that's quite true. Can you say more about what the downsides are of
operating inside of a hardware context? Oh my goodness. Yes, there are a lot. So it's fundamentally this, like hardware designs are
much more difficult to write than equivalent software. So all that flexibility in choosing
the architecture for your problem domain, you actually have to implement that. In software,
you have reams and reams of support libraries that either your organization has developed,
or that you can pull in from open source or that you can go and purchase. To some extent, that infrastructure
works in hardware with an idea of intellectual property suppliers. They're basically just
companies who supply a hardware design for you to integrate into your system. That's actually
the job that I used to do when we were developing Video Codecs.
And just to interrupt for a second,
that was a bit of terminology that really confused me
when I first encountered the hardware world.
When people in hardware say IP,
they mean something like when a software person says library.
Correct.
Which is to say some component that somebody else wrote
that you get to integrate.
Except in this case, the component is a bundle of wires
that you kind of plop into your design
rather than something that looks sort of like a module or a library.
Yeah, that's right.
I don't know why that terminology came about,
but it's just always been called IP
when you buy a hardware library design.
So as I say, there's some sort of infrastructure there
for buying external blocks to integrate with your
hardware. It's a vastly smaller ecosystem than we have in software. It's vastly more expensive.
There is in the last maybe 10 years more of an open source community around providing
hardware blocks that you can integrate, but it's still absolutely minuscule compared to software.
And then just the process of writing hardware is slow and detailed. And I'm going to say difficult,
I'm not so sure it is really technically that difficult. It's just that it's so detailed and you're dealing with such big systems that it becomes a real problem trying to manage the complexity of all these very simple bits that sit together.
Right. I think of that as one of the paradoxes of hardware.
Hardware is, in the micro, in many different ways, simpler than software.
Yes.
The thing that you're generating in a hardware design is essentially some layout of the circuits of the kind of individual gates and wires that connect them. And it's some
kind of fairly static graph that represents the structure of the computation and is converted into
like, you know, when you actually get one of these fabricated, actual bits of material laid out on a
physical surface. And understanding how those individual pieces work, at least logically how
they work, leaving the physics aside, is relatively simple.
But then having a big design that does a lot of these things is enormously hard to reason about.
It is.
And unfortunately, the abstraction tools we have, they take us some way.
So, you know, you talked about a chip design, which is like a, you can think of it as a layout of just two things, really.
Lots and lots of NAND gates.
So a Boolean AND function with the output knotted
and metal wires that connect them together.
And it's interesting because this is a universal Boolean function.
Any other Boolean function can be computed with the NAND function.
That's not true of AND, for example.
You can't create an OR with an AND, but you can create it with a NAND.
And there are, I think, 16 Boolean functions, and four of them are universal.
I think NAND and NOR are quite often the basis of technology.
NOR being an OR gate with the output inverted. We don't actually think about writing
circuits at the level of just interconnected NAND gates. An interesting aside, I believe the first
ARM processor was basically designed that way, but actually even lower. They were drawing the
transistors for the NAND gates in just a graphics package. That's how they created the very first
ARM processor, but that's not how they do it now. So we're a little bit above that. We work with a tool
called a synthesizer and it takes a slightly more abstract notion of a hardware design in which
we can think about components like adders and multiplexers. And the job of the synthesizer
will be to turn those components into the actual low-level hardware components for the chip,
which might be NAND gates if you're doing an ASIC, might be lookup tables for an FPGA.
But that being said, it's not massively above building it with NAND gates.
But really, a lot of the industry just works at this sort of level of putting together these macros, which represent adders and multiplexers and multipliers and registers, and just wiring them together and getting them to form some function.
So you were talking there about ASICs and FPGAs. Can you just quickly explain what those are? An ASIC is a custom-made chip that can perform
a single function. In contrast, an FPGA is a reprogrammable chip that can be programmed to
perform many different functions. Got it. So when you talked about what the advantages are of
hardware, you talked about how by having much more control, you get to really optimize
the things you care about, be it power consumption or performance or latency or whatever it is of the
system. Can you put a little bit of meat on the bones of that? What is the scale of the
improvements that you can get by taking something that you might do in software and moving it into
a hardware design? It's obviously going to depend on the sort of problem you're trying to solve.
But an area I know really well, video coding, I used to work on H.264 a lot, and there was
this really good software implementation called X.264, which was almost entirely written in
assembly, used the SSE instruction set, which is a vector instruction set
on the 866 processor. And it could just about manage like real-time 1080p on modern Intel
processors of the time, which were like four gigahertz processors. In order for it to achieve
that sort of performance, you had to turn off lots and lots and lots of
codec features. If you're willing to go non-real time, you could turn on all sorts of features
that would compress it better. Now there's an extremely large standard for H.264. I used to
read a lot. There's a lot of features on that thing, but you just can't do them all. And so
we used to target different markets. There was like a sort of low-end video codec,
which could fit in smaller FPGAs,
could be used for like internet-based communication.
And there was like really high-end video codecs,
which were built over multiple FPGAs,
had a really high-end feature set
and was used for professional grading coding.
So that's the sort of video that you get
over your satellite link or over your cable link.
That bit stream is like compressed
as much as it possibly can be
so they can fit more channels into that link.
We didn't have to compromise so much on the features
to do that with hardware.
We could pick the features that made the difference
that got us to the bit rate
and they could run in real time.
And you just couldn't do that
real time in software. In that world, you were looking at an order of magnitude more computation
being done by these FPGAs. These sort of things can scale massively. There are chips which do
packet processing, for example. The idea here is you you've got a switch and you want to do
packet processing to detect threats in those packets, to route these packets, all that
sort of thing.
Right, so-called deep packet inspection, right, is the term of art here.
Yep. And there are absolutely gigantic hardware designs which ingress hundreds of
network ports, process all this through one chip, clean up the network packets
and pass them on into big organizations. Can you imagine just how many x86s you would need
to do that job? Seems to be like a lot. Part of the advantage is there's like an
order of magnitude or more improvement in the bulk throughput of the system that you get there.
There's also a big improvement in latency. The time in and out of these things is much lower
and i think another thing that's interesting is much more deterministic yeah which is to say you
can build one of these designs such that it can simply consume everything that is presented to it
over a 10 gig network and you know in advance that if your design essentially compiles if you
can like lay it out on the chip, that it just works.
Which means the whole thing is simpler. In a software design, you get much less predictability,
which means you essentially have to improve the reliability by adding layers of buffering
so that you can hold onto packets for a while if you can't quite get to them in time.
That's true. And I think it's a problem that we have to deal with here at Jane Street
a lot. And that's non-determinism of our software processing systems. And there's really very
little you can do about it. That's sort of not true. There are some drastic solutions
to this with processors where you strip out your operating system. But you know, there's
an awful lot of infrastructure you'd have to build to do that whereas you know designing a hardware architecture that's
specifically for a particular task you get some like really nice graphs out of this when you
compare an equivalent hardware system to a software system like latencies drop right down
and then determinism you know, we have wonderful ones where
we do the 10th percentile, the 20th percentile, up to the 99th percentile. And the latency
variance for the hardware system will be like 20 nanoseconds across that entire range. Whereas
in software, it's really good up to the 99th percentile, and then it goes into the microseconds.
And it just tends not to be anything you can do about that in software.
You mentioned that some of the ways of getting rid of the non-determinism involve
tweaking your operating system and things like that.
But there's also some aspect of the non-determinism,
which is just fundamental to the trade-off between hardware and software.
I think that I just didn't appreciate about ordinary software, as it were, is what a bizarre
magic trick a modern CPU is playing, which is to say CPUs are fundamentally parallel machines.
They have all these circuits that can fire up to power constraints at the same time,
so you can do lots and lots of things in parallel. And then somehow we feed it this incredibly sequential programming language,
the machine language of the architecture that you're in.
And then the chip is doing a lot of work
to execute that as quickly as possible
and essentially trying to leverage
all of these parallel resources that it has.
And some of that is done by doing speculation
where you essentially make guesses
as to what the software is going to
decide to do in the future. So you can dispatch operations ahead of time in parallel. And some
of that is done by prefetching information. But this is essentially an enormously complicated
pile of heuristics, which means that you really don't have a good tool set for reasoning ahead
of time about the performance. Whereas
a hardware design, if it works, it does every clock cycle exactly the thing that you expect
it to do in that clock cycle.
Yeah, that's generally true. I mean, I should say that there is non-determinism
possible in hardware designs. So we've done a few designs here where there basically has
been basically no non-determinism in the system
that's that's not quite true there's a tiny bit around interaction between clocks in the design
a tiny bit around how like 10 gig ethernet data is packed at the lowest electrical layer but that
adds up to like plus or minus 10 nanoseconds or something. So one way to think about what's happening in hardware is you've got this enormously, massively threaded system
where each thread just does one very simple operation.
And when I say enormous, I mean millions of these threads.
But unlike software, they update in a very simple way,
which is they all have their current value,
which they send to each other as necessary,
depending on how they're connected.
And a clock tick happens.
And on that clock tick,
all the threads will read the old values of everything else,
compute new values,
and then the thing steps again and again and again.
And so there's just this much simpler sequencing
of all these parallel
operations happening within hardware. That is a massive simplification because we have
multiple clock domains and other horrible things to deal with in reality. But locally,
that is kind of how it is. That being said, we're now starting to look at designs which
use DDR memory. And there you start to introduce a small amount of non-determinism
because there's a bunch of rules
necessary to access memory these days.
We model them in our mind as just this 2D array
and you go and get one cell,
take this long, go and get another cell,
take this long, and it's all equal
and we just don't worry about it too much.
But that's just not true.
That's not how these things work at all.
They're little chip designs themselves,
which you have to send commands to.
You say like, go and open the address range over there.
And there are rules about how many of these address ranges
you can have open at one time
and a big banking structure around it.
And so ordering your accesses to these to these memories is really really important it
could be the difference between getting like 80 of the potential bandwidth out of them compared to
like five percent the potential bandwidth out of them and the other thing they do is they just
occasionally shut off and do this operation called refresh and you just can't access it then. And so we are going to hit these sort of non-determinisms. They're going to start adding
variance into our numbers. But I still think it's going to be immensely more manageable than
the numbers we get out of operating systems, which have so many other non-determinisms than
just accessing memory, like switching processes, switching cores, all that sort of thing.
In some sense, it's an issue of what the defaults are. The core language that you're working in when
you're building hardware is a deterministic language. And then in various places, you have
to interact with other systems. And weirdly, we think of the RAM in the same box that the FPGA is
as another system. You have to reach out over the network inside the computer, essentially,
to interact with it.
And that thing might be non-deterministic
and that adds non-determinism to your system.
And also you might on purpose as an optimization
add non-determinism to a design.
But the core language that you're starting with
is deterministic at its heart.
Whereas running on a CPU just is like in the presence
of other things running on that CPU
is non-deterministic and hard to reason about the timing in a way that is to some degree
just unavoidable.
Yeah, that's right.
So this overall story of why hardware is different and why it's useful and why it lets you
achieve goals that are really hard to achieve in software seems in many ways very compelling.
But I think if you've never heard of this world before, there's one enormous problem that sounds like it comes up,
which is, do you actually have to fabricate custom hardware every time you want to make a change?
One of the great things about software is you write it, and then when you change your mind
about how it works, you update the code, you compile, you roll a new version, and poof, you have
a modified version of the system. And it turns out you can get some of this in the
hardware world through various forms of what are called reconfigurable hardware. Can you tell us a
little bit about that, and I guess in particular FPGAs, which are the kind of most common form of
this and the one that we use? Yes. So to start with an FPGA, it stands for Field Programmable Gate Array.
An FPGA consists of a matrix of elements called LUTs, which stands for Lookup Table. And each of
these LUTs can implement an arbitrary Boolean function. Alongside that is what's called programmable routing which allows these
LUTs to be connected together in an almost completely arbitrary way and so an FPGA design
is effectively a static configuration of these LUTs wired together to perform some function
now it's actually a bit more complicated than that. There are other components involved,
but roughly speaking, they work kind of the same way.
They're laid out on the chip in a grid fashion,
and they're wired together with the programmable routing.
It's kind of a chip platform for,
how should you put it, emulating circuit designs.
By that, I mean they're programmed in the same way
that a proper fabricated application-specific circuit
would be done.
So you start with a hardware design,
you would go through this extremely complicated
set of tools that create some sort of technology
representation of your input circuit,
and in the ASIC world,
that would get sent off to a fab where it would be cooked and immersed in acid and lasers fired
at it and magic would happen and you'd get back this chip. That whole process could take anywhere from, you know, a few weeks if you're in mass production to six months to get your first example chip back.
FPGAs, the big advantage is you can just reset the FPGA and load a new design and then reset it again and load another new design.
And that's what the field programmable part of its name means it means you can deploy this thing and then maybe you find a bug maybe you do a version
two and you can just deploy a new version of your chip and it can be running in the field the next
day whereas with an asic you would have to go and refabricate an entirely different chip you'd have
to pull back the old hardware, send out new hardware.
And by the way, you've been using the term ASIC.
So it stands for Application Specific Integrated Circuit. It's the term we tend to use for hardware designs that have gone to a foundry. Now, a foundry is just an enormous factory
which takes customer-initiated designs and puts them through, as I say, lots of complicated
chemical and physical processes to embed a hardware design on a piece of silicon.
So it sounds like the key advantage of FPGAs is that they are reconfigurable.
They are.
What do you lose for that? In what way are ASICs better than FPGAs?
On basically every performance front, ASICs are superior. The power that an ASIC will
use could be three to 10 times less for the same design. The amount of area that we use will be an
order of magnitude less. The frequency that you can run your design at will be significantly higher.
ASICs are really good if you could, first of all, afford to make them,
you don't want to upgrade your design,
and you've got decent volume for them.
You're not going to get like 40 ASICs made.
That would be utterly ridiculous.
You need to be thinking of like 40 million ASICs being made for it to start making commercial sense.
Right, and then the economics are in your favor
in that the cost per unit is much, much smaller than the FPGA.
Oh, yeah.
The economics of this is all very interesting in the sense that one thing one can be struck by is how big the gap is between FPGAs and ASICs. The thing that
has always struck me is how small the gap is, that this borderline ridiculous thing of,
I'm just going to lay down a bunch of stuff on a chip in advance and then have it configure
itself to look like some circuit. The fact that you can get anywhere near what a real fabricated ASIC can do,
I think part of it is this economic point you were making about
the FPGAs are one of these very high volume things,
so they can be built with the absolute best technology.
They cost a lot more.
They cost quite a lot per unit.
They really do.
But if you want to have a small number of them, there's no comparison.
It's way better to get a small number of FPGAs, which you can make all individual,
make them do whatever they want, change them whenever you feel like changing them.
It's a kind of threshold issue.
It's transformative.
Without this kind of flexibility, you essentially couldn't use hardware designs
for a wide variety of technology problems.
And with them, you can.
Even the economics of just FPGAs is really interesting, actually.
If I went to try and buy the chip that we're using currently in the office, it would cost
me maybe eight grand.
But if you go and set up a deal with Xilinx that says you will take this many chips a
month for the next two years, that thing would cost you 500 bucks.
Wow.
An absolutely outrageous difference. It's all built into them, you know, because their customers
are foundries as well. They're buying time at foundries six months in advance. And the more
they know about the volumes they have to produce, like the FPGA to produce it is probably not that
expensive. You know, maybe the end of the day, tens of dollars, something like that.
But I guess it's very expensive for them to sit in warehouses doing nothing.
So let's switch gears a little.
I'd like to actually just understand a little bit more about your own background
and talk about how you got involved in hardware in the first place.
Can you give us a capsule summary of your involvement in the field?
So the first time I was ever introduced to hardware was at university.
I did a computer science and physics degree.
And in my final year, one of the elective courses I could take on the CS side was about
hardware design.
And it was only like a 12 week course, I think.
And we did a couple of projects which were in VHDL. I think one was
designing a multiplier, the other designing like this micro digital signal processor.
And I'm not sure I enjoyed the course so much, but I really enjoyed the project work.
I really, really liked that. And so after university, I had a set of career goals,
listed the things I wanted to work in. And one of them was actually hardware design.
So I promptly left university, went off and did games programming instead didn't like that so much and then after a couple of years I ended up getting a job at a an absolutely tiny embedded
software and hardware IP company I think I joined as the fourth employee and I did some work there on like a C++ video
codec for a few months and then when we're going off to lunch I mentioned to my boss that I was
kind of interested in this hardware design stuff he was like oh yeah that's good yeah we'd like
we'd like to do more of that yeah and then I found a couple of weeks later, he'd gone out and got a contract for me to write a JPEG encoder and decoder on a Xilinx Vertex XCV800, the first range of Xilinx FPGAs.
And I was young, I was stupid, so I was like, yeah, this is going to be fun.
And it turned out it was, I enjoyed it.
I'm not entirely proud of the code I wrote there, but it met all the design goals.
It hit the frequencies and the performance that it needed. So the customer couldn't complain.
And I just kept doing that. I really, really loved doing that. I still remember the first time
we took this thing and put it on an FPGA, which I think this was a card with an FPGA that actually
sat on an ISA bus, if you remember them,
and just brought this thing up. Well, it didn't work properly, properly, but it was doing real
stuff that we expected. It was just like, wow, that is incredible. Months of work just sitting
there thinking about how to build this thing, and then it's actually live on an FPGA. That is some
feeling. I still love that. I love designing FPGAs.
I love bringing them up into real systems and seeing them work.
So there's obvious delight in your voice describing this work.
And I'm curious, what is it about hardware that you find so engaging as opposed to software work?
I mean, I should say I enjoy writing software as well.
But I do prefer writing hardware. I think it's the
satisfaction you get when you have a working system. I think it's a function of the amount
of effort you have to put in up front. There's this big, long time of coding and potential
frustration and fixing stuff and doing simulations.
And finally, you've got something that you can put to a hardware.
It's like, there aren't many shortcuts in hardware design.
It's like with software,
you can maybe get a bit of it written
and do a bunch of testing
and check it into your repository
and have a little library for other people to use.
There's none of that in FPGA design.
Like not until you've basically got the whole thing written,
can you get any
sort of payoff for this project. I don't know, that works for me. I like it. I get a big
buzz out of getting FPGA designers working.
I think there's just another aspect to it. I find that I have to build these mental models
of what I'm creating. I'm writing it in code, but the code is an expression of a mental
model I have of individual pieces of a hardware design, and then the systems scaled out and viewed
as a whole. And that's something I just enjoy, a way I like thinking, I think.
Right, so it's this kind of graph-structured computation that you have floating around in
your head. Yeah, something like that. Something like that. The's this kind of graph-structured computation that you have floating around in your head.
Yeah, something like that.
Something like that.
The models are kind of interesting.
So as you sort of scale out,
you're thinking about how components fit together.
You're not thinking about the hundreds
of individual signals that are connecting them.
You're thinking about, right, there's a data bus.
It's that wide.
This end's running at this clock speed.
That end's running at that clock speed.
What is the bandwidth I'm getting across that?
And then you scale out with other components
and you're trying to hit your constraints of the clock speeds
of the RAM and of the PCIe bus and making this thing such that,
you know, data can flow in the front end and out to the back end
with nothing stalling, but using just the right
amount of resources.
And that's like system level modeling.
And it's all done in my head and Visio occasionally.
So that's how you got into this business of doing hardware.
So what led you from there to start experimenting with alternative hardware design languages?
It was frustration with the tools that we were using to build hardware, in particular
testing stuff.
And so languages like Verilog and VHDL, which are these hardware design languages that most
chips in the world are built with, there are a few other options.
I guess these days, really, Verilog is the dominant hardware design language.
And you do two tasks in this language. The first one is you write down the hardware design that
you want. The second thing you do is build little tests for that hardware design. And we have to do,
like, in hardware, we do this at every level of abstraction of the design, or every, like,
layer of the design, from the smallest components all the way up through the hierarchy to the biggest components. We're writing
these test benches all the time and testing the corner cases of bits of hardware, then testing how
multiple bits fit together and what happens. And that was hard work in Verilog. It's not a real programming language. The writing of test benches is a
software task, but Verilog does not have a software core. I think they've kind of improved
this a little bit with languages like System Verilog, but I still think fundamentally it's
not a very good software language. Yet over half your job is testing and it's trying to build
these little software frameworks to test your hardware design. And I just got very frustrated
with that. I thought there must be better ways. So I started looking around, seeing what was out
there. I came across actually, I think this must be about 2003, this guy had written a compiler in OCaml for a language
he called Confluence, which was a new style RTL or hardware design language, which was based on
functional concepts. And because this is the hardware design world, almost nobody looked at it.
He tried to set up a business.
It was clearly better in many respects than the stuff that was there.
Nobody bought it.
And I think out of frustration, after a while, he gave up on that.
And he threw together this little For a Camel module package and stuck it on the internet.
And that was called HD Camel. And I saw that come out and something clicked in my head.
I was like, I really like the concepts of confluence. This is the same thing in a real programming language.
That is amazing. I was like straight on there. I want to do it this way. And so I started digging
into that thing and I produced a bunch of external libraries for it. I added wires into HD camel. And then after a while, I just basically
rewrote it. That was in a camel still. I then did a version in F sharp. That was largely just a work
related thing. We used windows systems and a camel to this day is not the most friendly windows
based compiler in the world. That is a true fact. Although people are working on it still. I know they are. So I used F sharp. I liked F sharp actually. And then I managed to get
switched over to Linux again at work. And then I started writing in earnest, I guess,
what is hard camel today and was basically the third version of it that I've worked on.
And I think by far the best designed version
of it. And so, yeah, what did hard camel give me? Well, I love writing hardware in it. I think it
provides some really nice abstractions for designing stuff, but that's not primarily
why I love it. I really love it because I can write my test benches in a camel.
And I find a camel an extremely pleasant language to write stuff in.
For a number of reasons, it gives great abstractions. It's actually incredibly simple.
It's like the core of a camel is put together by just a few really, really orthogonal concepts
that you can stick together and do really powerful stuff with. And I think I'm massively more productive in writing and testing hardware
using a camel than I was in Verilog.
So it sounds like part of the advantage there is just having the same functionality in some ways
available in a really good general purpose programming language. I guess another thing that has struck me about hard camel is that it does a really good job
of giving a point around which to coordinate lots of different kinds of tooling.
How is hard camel structured?
And what are the kind of flows that you can build on top of it?
Hard camel is basically a camel library.
Now, technically, computer scientists
would call it an embedded domain-specific language, which it is. But at the end of the day,
that doesn't really matter. It's a library which exposes a bunch of functions for describing
hardware designs. And those functions are things like hardware adders of four bits, a 10 input multiplexer,
32 bit register, that sort of thing.
And then we basically use the host language, which is OCaml, to take these individual components
and wire them together into a graph.
So fundamentally, the design API of HardCamel produces this graph. We then supply a bunch of tools which
could work on this graph and do some interesting things.
The most important one is we can produce a simulation
model of the hardware.
So this is basically instantaneous.
We need to be able to model the design while we're developing.
And so HardCamel provides its own simulator and a wider tool set, including a waveform viewer. So waveforms are a kind of
graphical way of showing what the hardware is doing over multiple clock steps. So you monitor
like multiple, what we tend to call signals within the design. A signal can be one bit, it can be 32 bits,
and these signals will be drawn out horizontally. And within a waveform, you might be able to see
like 100 clock cycles worth of transitions for that signal, and also what that signal is doing
relative to other ones. So one of the things that has been interesting at Jane Street is taking that flow and trying
to make it fit in with the way we actually just develop software generally at Jane Street.
So yes, the design work and the way we're thinking about hardware architectures is kind
of different, but we've leveraged the very good build system technology at Jane Street and the editor
integration that comes along with testing frameworks at Jane Street, which we write nearly
all our tests using a framework called Expect Tests. So you write a little test module, you put
this bit of syntax around it, and you write a little test that prints out some result.
And then the framework will take that result
and paste it back into your test code.
And then you can check your test and its result
into the repository.
And our continuous integration systems
will constantly rebuild all our code all the time.
The really useful thing about this
is if something somewhere changes
and it happens to break your test, you know about this immediately.
And in fact, the person who broke it gets to fix your test,
which is even better.
I think this is really interesting because it really does feel like
writing normal Camel software at Jane Street when you're writing hardware.
And I think this is actually part of a more general phenomenon,
which is there are areas of technology
which have a kind of software mindset,
maybe you'd call it,
where things like continuous integration,
build systems, integrated testing, code review,
all of that is just part of the warp and weft of how you operate.
And there are areas where it's not like that.
And hardware is one of these areas
where that's just not the culture. It's not how people approach the problems. And this totally fits into
the tools. The existing tools are often GUI-based, and they let you do all of these things. You can
look at waveforms and run test benches and all of that. But it's not designed for this kind of
thoroughly integrated quality control process that is relatively common
in the software world. Another totally different area that I think has the same problem is networks.
When you think about how networks are set up, basically in most places, networks are managed
by dint of having extremely careful network engineers who go in and just reconfigure the
damn devices and try and do it
right almost every time. And they're amazingly good at it, but oh my God, is it not a way I
would want to live? And there's in fact a whole movement in the direction of software-defined
networks, which is essentially the same idea of trying to take the configuration and management
of networks and apply to it more or less the regular tool chain
that we are used to applying to software. So I think it's a very powerful and I think in many
ways underapplied way of improving various kinds of technological flows.
I should give some of these high-end hardware design tools their due. Languages like System Verilog and tools like ModelSim, if you spend enough
money, they start piling features on you with code coverage and checkerboards for simulation
coverage and automated tools for generating constrained random inputs. I kind of mean like you can specify the sort of shape of inputs you
want to put into your system and a solver in the tool will go away and generate this for you.
That's all very cool. It's all very, very expensive. I still don't think it's as good
as just having a decent software language in the first place.
So another thing that strikes me about the way in which you talk about
hard camel and the advantages of hard camel and of embedding in a language like OCaml
is you talk almost entirely about testing as opposed to the advantages of the level of the
actual hardware design. Can you say more about why that is? Why is the advantage so focused on the testing side?
I think it's because I consider the abstraction level of designing hardware in Hard Camel to be the same as the abstraction level of designing hardware in Verilog. We are working with the
same sorts of components. There is, however, a very big difference, which is with Verilog,
we just have a couple of primitives, like what are called parameters, basically numbers you can
use to configure your circuit, and special for loops, which can be used to generate multiple
copies of some part of your circuit, perhaps based on parameters,
and special ifs which can conditionally generate
parts of your circuit.
And you can get surprisingly far with those primitives
for creating configurable logic.
You certainly can go way further with HardCamel.
That being said, there are parts of the design
where configuration doesn't really matter.
So I think the overall point was more, what is the abstraction level of designing circuits
in HardCamel?
I kind of don't focus on HardCamel as being especially better than Verilog because the
abstraction level is basically the same.
It's a really interesting problem though.
I mean, trying to raise the abstraction level of
hardware design has been on academia's mind since the seventies. They've been trying to do it for
like 50 years and there are only really two successful outcomes from all that work. I would
say one is like high level synthesis, which I'm not sure I define as particularly successful,
but the other is a language called BlueSpec, which has a whole new model for writing hardware,
which I think is absolutely fascinating and a brilliant idea.
And they tried for 15 years to get people to use this thing, and they finally just gave
it away free, which i think reflects really poorly
actually on hardware designers in general like here's new good ideas if this was software would
be all over them right why aren't we using these good ideas when they come up in defense of hardware
engineers i feel like there are lots of great ideas in software that take an abominably long
time to be picked up i think my favorite example of this
is Garbage Collection, which was invented in the mid-50s and hit the mainstream in, say,
1995 with Java. So that's a good 40-year gap. So perhaps we should give the hardware engineers a
break. Maybe you're right. Maybe we're all not trying hard enough.
And the problem of coming up with these languages for hardware is a harder problem and has taken longer to achieve kind of reasonable things to
point out yes so i describe a little bit about how these systems work so high level synthesis
basically what it does is it takes c code and then creates parallel hardware designs from that c code
i just fundamentally think that's a bad idea. It's analyzing a serial
instruction stream and trying to extract parallelism from it. But why would you choose C
to do that with? There is one reason why it's taken off. It's that the hardware design engineers
will tend to know C. And so you're giving them a language they can actually use to create hardware
designs with. And while I knock it, I think in its domain,
it can be incredibly good.
Streaming DSP style designs,
things where you're doing a lot of operations
like additions within for loops.
The fact that this can be turned into hardware
is still very, very cool.
And you can put all sorts of compiler hints within your input C code
so that you can achieve different sets
of performance targets from the same input code.
And I think that is actually quite powerful,
the fact that for certain types of designs,
it can create a range of architectures for you for free.
It's just that it's not clear
that there are that many sort of design domains
it's that good at.
On the other hand, you've got blue spec, which is based on this notion of atomic actions.
Atomic action is basically a rule which has a predicate and a function which reads the current state and updates it.
And this rule will fire when its predicate is true.
And the entire system is basically a big long list of these rules.
And the model it follows is that it will non-deterministically choose one of the rules
that can currently fire and execute it. Once that's done, it will non-deterministically
choose another rule, execute it and go back. The compiler is super smart and it knows the dependencies between the rules and will create a scheduler which can fire these rules in blocks.
So you get hardware parallelism.
That's sort of the basic underlying sets of technology for executing blue spec style circuits, what they've done is built something that looks like an object-orientated
programming model for different modules within your design to interact with each other.
So in HardCamel, we basically have signals that we send to and from modules. Quite often,
signals are related to each other. You might have a valid signal that is related to a data bus.
And it's really important that those signals align properly. In blue spec, you can just call a method on a module and it does all the wiring for you. Underneath, it's still going to produce
a hardware architecture at the level of hard camel. But when you're programming with it,
you can just like call functions. That's just incredibly powerful. Your function
might be, add this data to the FIFO. What you don't have to care about when you do that is
whether the FIFO is full. The hardware will deal with all of that stuff for you. It will just hold
off the rule until it can actually be executed. So maybe a way of describing the difference is
that hard camel is built around this little core calculus in the middle of it, which is this thing that kind of represents the heart of something like Verilog or VHDL, which more or less has the core circuit design.
And then you write a bunch of OCaml code for generating stuff in this language, in this underlying calculus.
And the code for doing the generation can be very modular and generic. So for example, we have protocol specification languages
where we write down in a different domain-specific language
some specification for some hardware packet we want to parse.
And then we emit OCaml code for interfacing with that data
off of the back end of that.
And then we can take that same specification
that we used in a software context
and use it to emit hardware.
So that's like a highly leveraged,
very generic thing that you can do.
But the things that you emit in the end are not composable. You can generate them using a modular
and composable system, but the thing at the bottom is this kind of messy circuit thing.
And then something like blue spec, you have essentially a higher level of representation
that's more abstraction friendly. It's easier to build components that can be combined together,
but there's still some extra computation
that has to be done that takes that kind of representation
and converts it down to essentially
the wire level representation
that you need to really generate an FPGA,
which is equivalent to the representation
that HardCamel uses natively.
So one thing that makes me nervous about this whole story
is the thing that you said at the heart of this description
of these atomic guarded actions is non-determinism. They non-deterministically apply rules. Does that
mean that when you design something in a blue-spec style system, you end up with something that has
fewer deterministic guarantees? I'm guessing here a little bit. I have not written a lot of
blue-spec. I've mainly read papers on it. but it tends to be that it doesn't have to make many
non-deterministic choices. The model is one rule fires at a time and you choose it
non-deterministically. The reality is hundreds of rules fire at a time, all the ones that are
currently enabled. And then it builds schedulers, which guarantee fairness among rules.
So it will, like the same circuit,
you can't really have non-determinism in an FPGA, right?
It's going to be deterministic at some level.
And the compiler will make it deterministic.
And then you have pragmas,
which you apply in your source code
to guide the compiler to make the right choices
when it's
picking amongst rules. So when there was like a non-deterministic choice for it, it'll either
pick a fair schedule or you can guide it to make certain rules more important than other ones.
It's where actually the practical reality of blue spec is not quite as beautiful as the core
calculus suggests. There are like these little hacks that have to go on to make it work in reality,
but then everything's a compromise, right?
Nothing new there.
So are there any advantages that you see
of the approach that Hard Camel takes
over the BlueSpec approach?
It seems like there are clear benefits
to the BlueSpec style system.
Is there anything, I mean,
one obvious advantage of Hard Camel
is it's embedded in OCaml
and that makes for very smooth integration
with the rest of our software stack.
But I'm wondering, just kind of at the more abstract design level, if there are any benefits of the hard camel style approach.
I think even designers using blue spec would have a language like Verilog or hard camel for the cases where they need absolutely precise control over the function of a bit of logic.
So it seems to me the only way that you can improve on the standard model of hardware design,
where you have absolutely full control, is to hand some control to the compiler.
And what that tends to mean is you no longer control precisely the wiggling of the signals.
And there are cases where you have to control it,
like when you're interfacing with a DDR memory, for example.
It is the case that hard camel is at a level where it's like what you design is what you get.
It's exact.
You control everything.
You know, I think you need that ability.
And you give up some of that when you go to a
higher level of abstraction for sure. But for a lot of logic inside the design, I think generally
we will end up using blue spec at Jane Street because of the abstraction, whether that's
directly using blue spec, the open source compiler, or trying to build some model of that technology ourselves. I'm not too sure.
But I think we really do care about abstractions here. And the fact we have none in hardware is
just annoying. Everyone else has them. I want them too. Here's another language-related question
about all of this. If you look at the world of hardware design, we are not unique in having
something that looks sort of like HardCamel.
So there's a library in Scala called Chisel, which has similar goals and aspirations.
There's also a bunch of work on doing similar things in Python.
And Scala is a language which is in any case relatively similar to OCaml.
But I'm kind of curious what you think about the trade-off between
using OCaml for a system like this and using Python.
They're both software systems.
Both, I think, are better approaches to designing hardware than trying to use VHDL and Verilog.
I think the frameworks that I've seen there, MyHDL, that's a useful system.
I've seen it more used in the test space than in the actual design space. There's a new
one called PyMetal. They've actually done something quite interesting in sort of building a framework
in which you can plug models at different levels of abstraction into your system. So you can start
with basically a high level Python implementation of a system and refine parts of it but not all of
it at once you can work on one part down to the gates level and then work on another part and move
it down through levels of abstraction i think that's actually quite interesting it's something
we can also do in hard camel although we haven't kind of formalized doing that with an api but we
have all the sort of hooks that we would need to do something like that. Chisel is another example, which is a system that's very like hard camel,
but written in Scala as the host language. One of the big advantages of Chisel is that it's
actually taught at a university at Berkeley. And there's quite a lot of IP around chisel, especially to do with RISC-V CPU designs.
I think there's another area where functional programming particularly
shines for hardware design. Actually, a lot of the problem in designing is creating what we
call combinational logic. And that is basically functional logic. When you write it, all you're
doing is take a function which
takes some inputs, does something with them, transforms them to some outputs, and you compose
them all together. OCaml is extremely elegant expressing that sort of thing.
A thing you mentioned along the way there was that there's not yet any university
churning out HardCaml engineers. HardCam Hard camel is an open source project. There's
some amount of public communication and discussion that you've done over the years about it. We
continue to release new versions of it, reflecting the work that we've done here. What are your hopes
and aspirations about hard camel as an open source project? Well, I would like people to use it for
sure. I think it's going to be hard to get people to use it. Over the years, there have been maybe
three people have come along, looked at it, thought, wow, that's cool, and actually used it
in anger and contributed stuff back. I'd like more people to use, but I think we're lacking,
well, we're lacking a couple of things. First of all, we haven't put out enough of our libraries,
although that should be changing literally this week. I've just opened about 11 new hard camel libraries, which gets basically our internal tooling for hard camel
out into the real world. But where we're still lacking a little bit is actually realistic
designs built with hard camel that are sitting out there for people to learn from, to make decisions as to whether
they think the framework is worthwhile using. And we would like to release a lot more of this stuff,
but it becomes a bit harder to unwind what code we want to open source, what code we don't want
to open source. But we will try and do that. So yeah, I think the onus is really on me to provide a bit better open source set of
libraries so that people can really come along and use it in anger.
Chisel does better than us, as they've got this enormous RISC-V hardware design framework
that they've released open source.
So if you want to go and learn how Chisel works, there's an enormous body of code for
you to go out and do that.
And most of the stuff that we build internally is just not stuff that's of general interest.
Yeah, unfortunately.
I think that's true.
We've talked some about how hardware can be useful more generally.
How does hardware come up and what kind of problems do you see us addressing in the kind
of financial and trading context in which we operate?
So really the focus for us is around network packet processing.
I'd be surprised if we end up writing an FPGA design that isn't at some fundamental level
connected to a 10 or 25 gig network and processing packets.
A platform that we've been working with the last year or so is basically like a special
network interface card with an FPGA on it.
And the packets flow into it.
It can flow up to
the host. We've got full standard driver layers for this card, but we can also put our custom
logic in it. Now, when you think about like a generic network interface card, all it can really
know about are the generic protocols of networking. So the IP4 protocol, the TCP protocol, the Ethernet
layer, and they can do some work for you here.
They'll insert checksums for you. They might root packets or filter packets based on IP4 fields.
But when you get into the actual data within the packet, they can't do anything generic with it
because it can be anything. However, we can write designs which can look into that data because we know what we're
connected to. And one example of that is for a specific exchange with a specific packet format,
we can actually ingress their market data, pull it apart, not just the IP level, but actually all
the way into the packet data data and then do some filtering or
splitting or reconstruction of that data in various different ways.
One is we split it into different groups and send it out of different interfaces, which
means that downstream systems have to see less data.
Another way we could do that is conditionally sending parts of packets up through the PCIe bus to some
host software which reduces the load on both the bus and on the software and the amount of data it
has to look at and there are a number of like other sort of similar styles of system where we
can because because we can customize it for the specific link, we get to choose
what we do with the data in there.
And I think the background fact about trading that justifies all this is that
trading systems have to consume an enormous amount of market data that comes from a bunch
of different exchanges at shockingly high rates.
The US markets will peak at several million messages per second at the busiest moment
in the day.
And you want to be able to chew through those messages quite quickly and being able to have
different processes that see different subsets of the data, have some of that transformation
happen off of the CPU and on more specialized and more efficient hardware is just a big
step up in the kind of
performance miracles that you can reasonably achieve yeah and the problem is worse than that
right because you're not connected to one of these data sources you might be connected to eight of
these data sources and you know there are issues there like eight data sources well that's multiple
cards for a start you've got to have you've've got to pass all these things in software. That's like a hundred gig of data near enough.
And it's just so easy to see CPUs getting behind in that case. And they do. And it's annoying
because it always happens when the markets are busiest, which is when you do your best trading.
And so hardware, yeah, definitely can make a real difference there because it can choose through 100 gig of data.
It doesn't care.
You just put eight cores down there, one each for each of the connections, and they're just
going to choose through it.
They won't slow down.
They'll just do their job.
One of the magic tricks here of hardware is that the determinism is such that if you understand
the size of your problem, you can just know that your design will be able to successfully
get through all that
data no matter what they throw at you. There's just an upper limit of how much information can
come across a 10 gig NIC. There is a bound and you can be confident that you can chew through all
that data at line rate. And so you're just not going to be surprised. You're just not going to
fall over when things get busy, at least at the hardware level itself. Yeah. So like where we
build our tests for these systems, we basically test them
always at line rate, just back-to-back packets all the time.
I wonder if we could be doing a better job of actually stress testing our software systems
to really see where they start to fall over.
So I guess at this point I just want to thank you for joining me.
This has been a really fun conversation.
My pleasure.
You can find links to some of the things we talked about, as well as a full transcript of the episode at singlesandthreads.com. You can also find some blog posts that Andy has written
about Hard Camel on blog.janestreet.com, and the core libraries and tools are open source
and available for you to try out on GitHub. Thanks for joining us, and see you next week.