Signals and Threads - Programmable hardware with Andy Ray

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet. I'm Ron Minsky. So today, we're going to have a conversation about hardware, and in particular about how you can take the tools that come out of the world of chip design and apply them to a much broader space of problems than people typically think they can be applied to. And I'm joined in this conversation by Andy Ray. Hello, Ron.

Starting point is 00:00:31 Good to be chatting with you. Andy is a longtime veteran of the hardware industry. He spent over a decade building real shippable hardware designs, working on things like modems and video codecs. And along that time, he also did a lot of interesting work exploring and eventually designing his own alternative languages for expressing hardware designs. The final one was called HardCamel, which is a hardware design language embedded in CydoCamel, which itself is the primary programming language we

Starting point is 00:01:00 use here at Jane Street. And that work actually led him to us, and today he works here and leads Jane Street's hardware design team. And so, to start with Andy, maybe you can tell us a little bit about why hardware is useful for a technology organization and an organization like ours, and what advantages it has over

Starting point is 00:01:21 traditional software style approaches. Sure. So, hardware allows you to build customized architectures for a specific problem, which can be tuned to trade off at a very fine level, lots of things like performance and cost and power usage. That lets you design a range of different products. Whereas with a CPU, you're very much more limited in the software world

Starting point is 00:01:53 to the CPU design that can meet the performance of the problem domain. I think the sort of problems it can solve are very, very broad. And you can see that just because, well, a CPU is a hardware design, in fact. And you can see that just because, well, a CPU is a hardware design, in fact. And you can create all sorts of hybrid designs with multiple CPUs or digital signal processors or custom hardware blocks that make up your final solution. So I think that's

Starting point is 00:02:19 why hardware exists, why it will always exist. It's the fact you can build architectures entirely suited to your problem domain that optimize along these sort of areas. So that description on the face of it sounds awesome. And in fact, it sounds from what you said so far, strictly superior to writing software. I don't think that's quite true. Can you say more about what the downsides are of operating inside of a hardware context? Oh my goodness. Yes, there are a lot. So it's fundamentally this, like hardware designs are much more difficult to write than equivalent software. So all that flexibility in choosing the architecture for your problem domain, you actually have to implement that. In software, you have reams and reams of support libraries that either your organization has developed,

Starting point is 00:03:04 or that you can pull in from open source or that you can go and purchase. To some extent, that infrastructure works in hardware with an idea of intellectual property suppliers. They're basically just companies who supply a hardware design for you to integrate into your system. That's actually the job that I used to do when we were developing Video Codecs. And just to interrupt for a second, that was a bit of terminology that really confused me when I first encountered the hardware world. When people in hardware say IP,

Starting point is 00:03:35 they mean something like when a software person says library. Correct. Which is to say some component that somebody else wrote that you get to integrate. Except in this case, the component is a bundle of wires that you kind of plop into your design rather than something that looks sort of like a module or a library. Yeah, that's right.

Starting point is 00:03:52 I don't know why that terminology came about, but it's just always been called IP when you buy a hardware library design. So as I say, there's some sort of infrastructure there for buying external blocks to integrate with your hardware. It's a vastly smaller ecosystem than we have in software. It's vastly more expensive. There is in the last maybe 10 years more of an open source community around providing hardware blocks that you can integrate, but it's still absolutely minuscule compared to software.

Starting point is 00:04:32 And then just the process of writing hardware is slow and detailed. And I'm going to say difficult, I'm not so sure it is really technically that difficult. It's just that it's so detailed and you're dealing with such big systems that it becomes a real problem trying to manage the complexity of all these very simple bits that sit together. Right. I think of that as one of the paradoxes of hardware. Hardware is, in the micro, in many different ways, simpler than software. Yes. The thing that you're generating in a hardware design is essentially some layout of the circuits of the kind of individual gates and wires that connect them. And it's some kind of fairly static graph that represents the structure of the computation and is converted into like, you know, when you actually get one of these fabricated, actual bits of material laid out on a

Starting point is 00:05:18 physical surface. And understanding how those individual pieces work, at least logically how they work, leaving the physics aside, is relatively simple. But then having a big design that does a lot of these things is enormously hard to reason about. It is. And unfortunately, the abstraction tools we have, they take us some way. So, you know, you talked about a chip design, which is like a, you can think of it as a layout of just two things, really. Lots and lots of NAND gates. So a Boolean AND function with the output knotted

Starting point is 00:05:50 and metal wires that connect them together. And it's interesting because this is a universal Boolean function. Any other Boolean function can be computed with the NAND function. That's not true of AND, for example. You can't create an OR with an AND, but you can create it with a NAND. And there are, I think, 16 Boolean functions, and four of them are universal. I think NAND and NOR are quite often the basis of technology. NOR being an OR gate with the output inverted. We don't actually think about writing

Starting point is 00:06:26 circuits at the level of just interconnected NAND gates. An interesting aside, I believe the first ARM processor was basically designed that way, but actually even lower. They were drawing the transistors for the NAND gates in just a graphics package. That's how they created the very first ARM processor, but that's not how they do it now. So we're a little bit above that. We work with a tool called a synthesizer and it takes a slightly more abstract notion of a hardware design in which we can think about components like adders and multiplexers. And the job of the synthesizer will be to turn those components into the actual low-level hardware components for the chip, which might be NAND gates if you're doing an ASIC, might be lookup tables for an FPGA.

Starting point is 00:07:15 But that being said, it's not massively above building it with NAND gates. But really, a lot of the industry just works at this sort of level of putting together these macros, which represent adders and multiplexers and multipliers and registers, and just wiring them together and getting them to form some function. So you were talking there about ASICs and FPGAs. Can you just quickly explain what those are? An ASIC is a custom-made chip that can perform a single function. In contrast, an FPGA is a reprogrammable chip that can be programmed to perform many different functions. Got it. So when you talked about what the advantages are of hardware, you talked about how by having much more control, you get to really optimize the things you care about, be it power consumption or performance or latency or whatever it is of the system. Can you put a little bit of meat on the bones of that? What is the scale of the

Starting point is 00:08:18 improvements that you can get by taking something that you might do in software and moving it into a hardware design? It's obviously going to depend on the sort of problem you're trying to solve. But an area I know really well, video coding, I used to work on H.264 a lot, and there was this really good software implementation called X.264, which was almost entirely written in assembly, used the SSE instruction set, which is a vector instruction set on the 866 processor. And it could just about manage like real-time 1080p on modern Intel processors of the time, which were like four gigahertz processors. In order for it to achieve that sort of performance, you had to turn off lots and lots and lots of

Starting point is 00:09:05 codec features. If you're willing to go non-real time, you could turn on all sorts of features that would compress it better. Now there's an extremely large standard for H.264. I used to read a lot. There's a lot of features on that thing, but you just can't do them all. And so we used to target different markets. There was like a sort of low-end video codec, which could fit in smaller FPGAs, could be used for like internet-based communication. And there was like really high-end video codecs, which were built over multiple FPGAs,

Starting point is 00:09:38 had a really high-end feature set and was used for professional grading coding. So that's the sort of video that you get over your satellite link or over your cable link. That bit stream is like compressed as much as it possibly can be so they can fit more channels into that link. We didn't have to compromise so much on the features

Starting point is 00:09:56 to do that with hardware. We could pick the features that made the difference that got us to the bit rate and they could run in real time. And you just couldn't do that real time in software. In that world, you were looking at an order of magnitude more computation being done by these FPGAs. These sort of things can scale massively. There are chips which do packet processing, for example. The idea here is you you've got a switch and you want to do

Starting point is 00:10:26 packet processing to detect threats in those packets, to route these packets, all that sort of thing. Right, so-called deep packet inspection, right, is the term of art here. Yep. And there are absolutely gigantic hardware designs which ingress hundreds of network ports, process all this through one chip, clean up the network packets and pass them on into big organizations. Can you imagine just how many x86s you would need to do that job? Seems to be like a lot. Part of the advantage is there's like an order of magnitude or more improvement in the bulk throughput of the system that you get there.

Starting point is 00:11:00 There's also a big improvement in latency. The time in and out of these things is much lower and i think another thing that's interesting is much more deterministic yeah which is to say you can build one of these designs such that it can simply consume everything that is presented to it over a 10 gig network and you know in advance that if your design essentially compiles if you can like lay it out on the chip, that it just works. Which means the whole thing is simpler. In a software design, you get much less predictability, which means you essentially have to improve the reliability by adding layers of buffering so that you can hold onto packets for a while if you can't quite get to them in time.

Starting point is 00:11:40 That's true. And I think it's a problem that we have to deal with here at Jane Street a lot. And that's non-determinism of our software processing systems. And there's really very little you can do about it. That's sort of not true. There are some drastic solutions to this with processors where you strip out your operating system. But you know, there's an awful lot of infrastructure you'd have to build to do that whereas you know designing a hardware architecture that's specifically for a particular task you get some like really nice graphs out of this when you compare an equivalent hardware system to a software system like latencies drop right down and then determinism you know, we have wonderful ones where

Starting point is 00:12:26 we do the 10th percentile, the 20th percentile, up to the 99th percentile. And the latency variance for the hardware system will be like 20 nanoseconds across that entire range. Whereas in software, it's really good up to the 99th percentile, and then it goes into the microseconds. And it just tends not to be anything you can do about that in software. You mentioned that some of the ways of getting rid of the non-determinism involve tweaking your operating system and things like that. But there's also some aspect of the non-determinism, which is just fundamental to the trade-off between hardware and software.

Starting point is 00:13:00 I think that I just didn't appreciate about ordinary software, as it were, is what a bizarre magic trick a modern CPU is playing, which is to say CPUs are fundamentally parallel machines. They have all these circuits that can fire up to power constraints at the same time, so you can do lots and lots of things in parallel. And then somehow we feed it this incredibly sequential programming language, the machine language of the architecture that you're in. And then the chip is doing a lot of work to execute that as quickly as possible and essentially trying to leverage

Starting point is 00:13:35 all of these parallel resources that it has. And some of that is done by doing speculation where you essentially make guesses as to what the software is going to decide to do in the future. So you can dispatch operations ahead of time in parallel. And some of that is done by prefetching information. But this is essentially an enormously complicated pile of heuristics, which means that you really don't have a good tool set for reasoning ahead of time about the performance. Whereas

Starting point is 00:14:05 a hardware design, if it works, it does every clock cycle exactly the thing that you expect it to do in that clock cycle. Yeah, that's generally true. I mean, I should say that there is non-determinism possible in hardware designs. So we've done a few designs here where there basically has been basically no non-determinism in the system that's that's not quite true there's a tiny bit around interaction between clocks in the design a tiny bit around how like 10 gig ethernet data is packed at the lowest electrical layer but that adds up to like plus or minus 10 nanoseconds or something. So one way to think about what's happening in hardware is you've got this enormously, massively threaded system

Starting point is 00:14:48 where each thread just does one very simple operation. And when I say enormous, I mean millions of these threads. But unlike software, they update in a very simple way, which is they all have their current value, which they send to each other as necessary, depending on how they're connected. And a clock tick happens. And on that clock tick,

Starting point is 00:15:11 all the threads will read the old values of everything else, compute new values, and then the thing steps again and again and again. And so there's just this much simpler sequencing of all these parallel operations happening within hardware. That is a massive simplification because we have multiple clock domains and other horrible things to deal with in reality. But locally, that is kind of how it is. That being said, we're now starting to look at designs which

Starting point is 00:15:40 use DDR memory. And there you start to introduce a small amount of non-determinism because there's a bunch of rules necessary to access memory these days. We model them in our mind as just this 2D array and you go and get one cell, take this long, go and get another cell, take this long, and it's all equal and we just don't worry about it too much.

Starting point is 00:16:03 But that's just not true. That's not how these things work at all. They're little chip designs themselves, which you have to send commands to. You say like, go and open the address range over there. And there are rules about how many of these address ranges you can have open at one time and a big banking structure around it.

Starting point is 00:16:21 And so ordering your accesses to these to these memories is really really important it could be the difference between getting like 80 of the potential bandwidth out of them compared to like five percent the potential bandwidth out of them and the other thing they do is they just occasionally shut off and do this operation called refresh and you just can't access it then. And so we are going to hit these sort of non-determinisms. They're going to start adding variance into our numbers. But I still think it's going to be immensely more manageable than the numbers we get out of operating systems, which have so many other non-determinisms than just accessing memory, like switching processes, switching cores, all that sort of thing. In some sense, it's an issue of what the defaults are. The core language that you're working in when

Starting point is 00:17:09 you're building hardware is a deterministic language. And then in various places, you have to interact with other systems. And weirdly, we think of the RAM in the same box that the FPGA is as another system. You have to reach out over the network inside the computer, essentially, to interact with it. And that thing might be non-deterministic and that adds non-determinism to your system. And also you might on purpose as an optimization add non-determinism to a design.

Starting point is 00:17:36 But the core language that you're starting with is deterministic at its heart. Whereas running on a CPU just is like in the presence of other things running on that CPU is non-deterministic and hard to reason about the timing in a way that is to some degree just unavoidable. Yeah, that's right. So this overall story of why hardware is different and why it's useful and why it lets you

Starting point is 00:17:57 achieve goals that are really hard to achieve in software seems in many ways very compelling. But I think if you've never heard of this world before, there's one enormous problem that sounds like it comes up, which is, do you actually have to fabricate custom hardware every time you want to make a change? One of the great things about software is you write it, and then when you change your mind about how it works, you update the code, you compile, you roll a new version, and poof, you have a modified version of the system. And it turns out you can get some of this in the hardware world through various forms of what are called reconfigurable hardware. Can you tell us a little bit about that, and I guess in particular FPGAs, which are the kind of most common form of

Starting point is 00:18:39 this and the one that we use? Yes. So to start with an FPGA, it stands for Field Programmable Gate Array. An FPGA consists of a matrix of elements called LUTs, which stands for Lookup Table. And each of these LUTs can implement an arbitrary Boolean function. Alongside that is what's called programmable routing which allows these LUTs to be connected together in an almost completely arbitrary way and so an FPGA design is effectively a static configuration of these LUTs wired together to perform some function now it's actually a bit more complicated than that. There are other components involved, but roughly speaking, they work kind of the same way. They're laid out on the chip in a grid fashion,

Starting point is 00:19:30 and they're wired together with the programmable routing. It's kind of a chip platform for, how should you put it, emulating circuit designs. By that, I mean they're programmed in the same way that a proper fabricated application-specific circuit would be done. So you start with a hardware design, you would go through this extremely complicated

Starting point is 00:19:58 set of tools that create some sort of technology representation of your input circuit, and in the ASIC world, that would get sent off to a fab where it would be cooked and immersed in acid and lasers fired at it and magic would happen and you'd get back this chip. That whole process could take anywhere from, you know, a few weeks if you're in mass production to six months to get your first example chip back. FPGAs, the big advantage is you can just reset the FPGA and load a new design and then reset it again and load another new design. And that's what the field programmable part of its name means it means you can deploy this thing and then maybe you find a bug maybe you do a version two and you can just deploy a new version of your chip and it can be running in the field the next

Starting point is 00:20:56 day whereas with an asic you would have to go and refabricate an entirely different chip you'd have to pull back the old hardware, send out new hardware. And by the way, you've been using the term ASIC. So it stands for Application Specific Integrated Circuit. It's the term we tend to use for hardware designs that have gone to a foundry. Now, a foundry is just an enormous factory which takes customer-initiated designs and puts them through, as I say, lots of complicated chemical and physical processes to embed a hardware design on a piece of silicon. So it sounds like the key advantage of FPGAs is that they are reconfigurable. They are.

Starting point is 00:21:37 What do you lose for that? In what way are ASICs better than FPGAs? On basically every performance front, ASICs are superior. The power that an ASIC will use could be three to 10 times less for the same design. The amount of area that we use will be an order of magnitude less. The frequency that you can run your design at will be significantly higher. ASICs are really good if you could, first of all, afford to make them, you don't want to upgrade your design, and you've got decent volume for them. You're not going to get like 40 ASICs made.

Starting point is 00:22:13 That would be utterly ridiculous. You need to be thinking of like 40 million ASICs being made for it to start making commercial sense. Right, and then the economics are in your favor in that the cost per unit is much, much smaller than the FPGA. Oh, yeah. The economics of this is all very interesting in the sense that one thing one can be struck by is how big the gap is between FPGAs and ASICs. The thing that has always struck me is how small the gap is, that this borderline ridiculous thing of, I'm just going to lay down a bunch of stuff on a chip in advance and then have it configure

Starting point is 00:22:41 itself to look like some circuit. The fact that you can get anywhere near what a real fabricated ASIC can do, I think part of it is this economic point you were making about the FPGAs are one of these very high volume things, so they can be built with the absolute best technology. They cost a lot more. They cost quite a lot per unit. They really do. But if you want to have a small number of them, there's no comparison.

Starting point is 00:23:04 It's way better to get a small number of FPGAs, which you can make all individual, make them do whatever they want, change them whenever you feel like changing them. It's a kind of threshold issue. It's transformative. Without this kind of flexibility, you essentially couldn't use hardware designs for a wide variety of technology problems. And with them, you can. Even the economics of just FPGAs is really interesting, actually.

Starting point is 00:23:27 If I went to try and buy the chip that we're using currently in the office, it would cost me maybe eight grand. But if you go and set up a deal with Xilinx that says you will take this many chips a month for the next two years, that thing would cost you 500 bucks. Wow. An absolutely outrageous difference. It's all built into them, you know, because their customers are foundries as well. They're buying time at foundries six months in advance. And the more they know about the volumes they have to produce, like the FPGA to produce it is probably not that

Starting point is 00:24:02 expensive. You know, maybe the end of the day, tens of dollars, something like that. But I guess it's very expensive for them to sit in warehouses doing nothing. So let's switch gears a little. I'd like to actually just understand a little bit more about your own background and talk about how you got involved in hardware in the first place. Can you give us a capsule summary of your involvement in the field? So the first time I was ever introduced to hardware was at university. I did a computer science and physics degree.

Starting point is 00:24:31 And in my final year, one of the elective courses I could take on the CS side was about hardware design. And it was only like a 12 week course, I think. And we did a couple of projects which were in VHDL. I think one was designing a multiplier, the other designing like this micro digital signal processor. And I'm not sure I enjoyed the course so much, but I really enjoyed the project work. I really, really liked that. And so after university, I had a set of career goals, listed the things I wanted to work in. And one of them was actually hardware design.

Starting point is 00:25:09 So I promptly left university, went off and did games programming instead didn't like that so much and then after a couple of years I ended up getting a job at a an absolutely tiny embedded software and hardware IP company I think I joined as the fourth employee and I did some work there on like a C++ video codec for a few months and then when we're going off to lunch I mentioned to my boss that I was kind of interested in this hardware design stuff he was like oh yeah that's good yeah we'd like we'd like to do more of that yeah and then I found a couple of weeks later, he'd gone out and got a contract for me to write a JPEG encoder and decoder on a Xilinx Vertex XCV800, the first range of Xilinx FPGAs. And I was young, I was stupid, so I was like, yeah, this is going to be fun. And it turned out it was, I enjoyed it. I'm not entirely proud of the code I wrote there, but it met all the design goals.

Starting point is 00:26:05 It hit the frequencies and the performance that it needed. So the customer couldn't complain. And I just kept doing that. I really, really loved doing that. I still remember the first time we took this thing and put it on an FPGA, which I think this was a card with an FPGA that actually sat on an ISA bus, if you remember them, and just brought this thing up. Well, it didn't work properly, properly, but it was doing real stuff that we expected. It was just like, wow, that is incredible. Months of work just sitting there thinking about how to build this thing, and then it's actually live on an FPGA. That is some feeling. I still love that. I love designing FPGAs.

Starting point is 00:26:46 I love bringing them up into real systems and seeing them work. So there's obvious delight in your voice describing this work. And I'm curious, what is it about hardware that you find so engaging as opposed to software work? I mean, I should say I enjoy writing software as well. But I do prefer writing hardware. I think it's the satisfaction you get when you have a working system. I think it's a function of the amount of effort you have to put in up front. There's this big, long time of coding and potential frustration and fixing stuff and doing simulations.

Starting point is 00:27:27 And finally, you've got something that you can put to a hardware. It's like, there aren't many shortcuts in hardware design. It's like with software, you can maybe get a bit of it written and do a bunch of testing and check it into your repository and have a little library for other people to use. There's none of that in FPGA design.

Starting point is 00:27:41 Like not until you've basically got the whole thing written, can you get any sort of payoff for this project. I don't know, that works for me. I like it. I get a big buzz out of getting FPGA designers working. I think there's just another aspect to it. I find that I have to build these mental models of what I'm creating. I'm writing it in code, but the code is an expression of a mental model I have of individual pieces of a hardware design, and then the systems scaled out and viewed as a whole. And that's something I just enjoy, a way I like thinking, I think.

Starting point is 00:28:21 Right, so it's this kind of graph-structured computation that you have floating around in your head. Yeah, something like that. Something like that. The's this kind of graph-structured computation that you have floating around in your head. Yeah, something like that. Something like that. The models are kind of interesting. So as you sort of scale out, you're thinking about how components fit together. You're not thinking about the hundreds

Starting point is 00:28:34 of individual signals that are connecting them. You're thinking about, right, there's a data bus. It's that wide. This end's running at this clock speed. That end's running at that clock speed. What is the bandwidth I'm getting across that? And then you scale out with other components and you're trying to hit your constraints of the clock speeds

Starting point is 00:28:53 of the RAM and of the PCIe bus and making this thing such that, you know, data can flow in the front end and out to the back end with nothing stalling, but using just the right amount of resources. And that's like system level modeling. And it's all done in my head and Visio occasionally. So that's how you got into this business of doing hardware. So what led you from there to start experimenting with alternative hardware design languages?

Starting point is 00:29:20 It was frustration with the tools that we were using to build hardware, in particular testing stuff. And so languages like Verilog and VHDL, which are these hardware design languages that most chips in the world are built with, there are a few other options. I guess these days, really, Verilog is the dominant hardware design language. And you do two tasks in this language. The first one is you write down the hardware design that you want. The second thing you do is build little tests for that hardware design. And we have to do, like, in hardware, we do this at every level of abstraction of the design, or every, like,

Starting point is 00:30:03 layer of the design, from the smallest components all the way up through the hierarchy to the biggest components. We're writing these test benches all the time and testing the corner cases of bits of hardware, then testing how multiple bits fit together and what happens. And that was hard work in Verilog. It's not a real programming language. The writing of test benches is a software task, but Verilog does not have a software core. I think they've kind of improved this a little bit with languages like System Verilog, but I still think fundamentally it's not a very good software language. Yet over half your job is testing and it's trying to build these little software frameworks to test your hardware design. And I just got very frustrated with that. I thought there must be better ways. So I started looking around, seeing what was out

Starting point is 00:30:58 there. I came across actually, I think this must be about 2003, this guy had written a compiler in OCaml for a language he called Confluence, which was a new style RTL or hardware design language, which was based on functional concepts. And because this is the hardware design world, almost nobody looked at it. He tried to set up a business. It was clearly better in many respects than the stuff that was there. Nobody bought it. And I think out of frustration, after a while, he gave up on that. And he threw together this little For a Camel module package and stuck it on the internet.

Starting point is 00:31:41 And that was called HD Camel. And I saw that come out and something clicked in my head. I was like, I really like the concepts of confluence. This is the same thing in a real programming language. That is amazing. I was like straight on there. I want to do it this way. And so I started digging into that thing and I produced a bunch of external libraries for it. I added wires into HD camel. And then after a while, I just basically rewrote it. That was in a camel still. I then did a version in F sharp. That was largely just a work related thing. We used windows systems and a camel to this day is not the most friendly windows based compiler in the world. That is a true fact. Although people are working on it still. I know they are. So I used F sharp. I liked F sharp actually. And then I managed to get switched over to Linux again at work. And then I started writing in earnest, I guess,

Starting point is 00:32:37 what is hard camel today and was basically the third version of it that I've worked on. And I think by far the best designed version of it. And so, yeah, what did hard camel give me? Well, I love writing hardware in it. I think it provides some really nice abstractions for designing stuff, but that's not primarily why I love it. I really love it because I can write my test benches in a camel. And I find a camel an extremely pleasant language to write stuff in. For a number of reasons, it gives great abstractions. It's actually incredibly simple. It's like the core of a camel is put together by just a few really, really orthogonal concepts

Starting point is 00:33:17 that you can stick together and do really powerful stuff with. And I think I'm massively more productive in writing and testing hardware using a camel than I was in Verilog. So it sounds like part of the advantage there is just having the same functionality in some ways available in a really good general purpose programming language. I guess another thing that has struck me about hard camel is that it does a really good job of giving a point around which to coordinate lots of different kinds of tooling. How is hard camel structured? And what are the kind of flows that you can build on top of it? Hard camel is basically a camel library.

Starting point is 00:34:04 Now, technically, computer scientists would call it an embedded domain-specific language, which it is. But at the end of the day, that doesn't really matter. It's a library which exposes a bunch of functions for describing hardware designs. And those functions are things like hardware adders of four bits, a 10 input multiplexer, 32 bit register, that sort of thing. And then we basically use the host language, which is OCaml, to take these individual components and wire them together into a graph. So fundamentally, the design API of HardCamel produces this graph. We then supply a bunch of tools which

Starting point is 00:34:47 could work on this graph and do some interesting things. The most important one is we can produce a simulation model of the hardware. So this is basically instantaneous. We need to be able to model the design while we're developing. And so HardCamel provides its own simulator and a wider tool set, including a waveform viewer. So waveforms are a kind of graphical way of showing what the hardware is doing over multiple clock steps. So you monitor like multiple, what we tend to call signals within the design. A signal can be one bit, it can be 32 bits,

Starting point is 00:35:28 and these signals will be drawn out horizontally. And within a waveform, you might be able to see like 100 clock cycles worth of transitions for that signal, and also what that signal is doing relative to other ones. So one of the things that has been interesting at Jane Street is taking that flow and trying to make it fit in with the way we actually just develop software generally at Jane Street. So yes, the design work and the way we're thinking about hardware architectures is kind of different, but we've leveraged the very good build system technology at Jane Street and the editor integration that comes along with testing frameworks at Jane Street, which we write nearly all our tests using a framework called Expect Tests. So you write a little test module, you put

Starting point is 00:36:20 this bit of syntax around it, and you write a little test that prints out some result. And then the framework will take that result and paste it back into your test code. And then you can check your test and its result into the repository. And our continuous integration systems will constantly rebuild all our code all the time. The really useful thing about this

Starting point is 00:36:43 is if something somewhere changes and it happens to break your test, you know about this immediately. And in fact, the person who broke it gets to fix your test, which is even better. I think this is really interesting because it really does feel like writing normal Camel software at Jane Street when you're writing hardware. And I think this is actually part of a more general phenomenon, which is there are areas of technology

Starting point is 00:37:07 which have a kind of software mindset, maybe you'd call it, where things like continuous integration, build systems, integrated testing, code review, all of that is just part of the warp and weft of how you operate. And there are areas where it's not like that. And hardware is one of these areas where that's just not the culture. It's not how people approach the problems. And this totally fits into

Starting point is 00:37:30 the tools. The existing tools are often GUI-based, and they let you do all of these things. You can look at waveforms and run test benches and all of that. But it's not designed for this kind of thoroughly integrated quality control process that is relatively common in the software world. Another totally different area that I think has the same problem is networks. When you think about how networks are set up, basically in most places, networks are managed by dint of having extremely careful network engineers who go in and just reconfigure the damn devices and try and do it right almost every time. And they're amazingly good at it, but oh my God, is it not a way I

Starting point is 00:38:11 would want to live? And there's in fact a whole movement in the direction of software-defined networks, which is essentially the same idea of trying to take the configuration and management of networks and apply to it more or less the regular tool chain that we are used to applying to software. So I think it's a very powerful and I think in many ways underapplied way of improving various kinds of technological flows. I should give some of these high-end hardware design tools their due. Languages like System Verilog and tools like ModelSim, if you spend enough money, they start piling features on you with code coverage and checkerboards for simulation coverage and automated tools for generating constrained random inputs. I kind of mean like you can specify the sort of shape of inputs you

Starting point is 00:39:08 want to put into your system and a solver in the tool will go away and generate this for you. That's all very cool. It's all very, very expensive. I still don't think it's as good as just having a decent software language in the first place. So another thing that strikes me about the way in which you talk about hard camel and the advantages of hard camel and of embedding in a language like OCaml is you talk almost entirely about testing as opposed to the advantages of the level of the actual hardware design. Can you say more about why that is? Why is the advantage so focused on the testing side? I think it's because I consider the abstraction level of designing hardware in Hard Camel to be the same as the abstraction level of designing hardware in Verilog. We are working with the

Starting point is 00:40:00 same sorts of components. There is, however, a very big difference, which is with Verilog, we just have a couple of primitives, like what are called parameters, basically numbers you can use to configure your circuit, and special for loops, which can be used to generate multiple copies of some part of your circuit, perhaps based on parameters, and special ifs which can conditionally generate parts of your circuit. And you can get surprisingly far with those primitives for creating configurable logic.

Starting point is 00:40:36 You certainly can go way further with HardCamel. That being said, there are parts of the design where configuration doesn't really matter. So I think the overall point was more, what is the abstraction level of designing circuits in HardCamel? I kind of don't focus on HardCamel as being especially better than Verilog because the abstraction level is basically the same. It's a really interesting problem though.

Starting point is 00:41:02 I mean, trying to raise the abstraction level of hardware design has been on academia's mind since the seventies. They've been trying to do it for like 50 years and there are only really two successful outcomes from all that work. I would say one is like high level synthesis, which I'm not sure I define as particularly successful, but the other is a language called BlueSpec, which has a whole new model for writing hardware, which I think is absolutely fascinating and a brilliant idea. And they tried for 15 years to get people to use this thing, and they finally just gave it away free, which i think reflects really poorly

Starting point is 00:41:46 actually on hardware designers in general like here's new good ideas if this was software would be all over them right why aren't we using these good ideas when they come up in defense of hardware engineers i feel like there are lots of great ideas in software that take an abominably long time to be picked up i think my favorite example of this is Garbage Collection, which was invented in the mid-50s and hit the mainstream in, say, 1995 with Java. So that's a good 40-year gap. So perhaps we should give the hardware engineers a break. Maybe you're right. Maybe we're all not trying hard enough. And the problem of coming up with these languages for hardware is a harder problem and has taken longer to achieve kind of reasonable things to

Starting point is 00:42:29 point out yes so i describe a little bit about how these systems work so high level synthesis basically what it does is it takes c code and then creates parallel hardware designs from that c code i just fundamentally think that's a bad idea. It's analyzing a serial instruction stream and trying to extract parallelism from it. But why would you choose C to do that with? There is one reason why it's taken off. It's that the hardware design engineers will tend to know C. And so you're giving them a language they can actually use to create hardware designs with. And while I knock it, I think in its domain, it can be incredibly good.

Starting point is 00:43:07 Streaming DSP style designs, things where you're doing a lot of operations like additions within for loops. The fact that this can be turned into hardware is still very, very cool. And you can put all sorts of compiler hints within your input C code so that you can achieve different sets of performance targets from the same input code.

Starting point is 00:43:33 And I think that is actually quite powerful, the fact that for certain types of designs, it can create a range of architectures for you for free. It's just that it's not clear that there are that many sort of design domains it's that good at. On the other hand, you've got blue spec, which is based on this notion of atomic actions. Atomic action is basically a rule which has a predicate and a function which reads the current state and updates it.

Starting point is 00:43:58 And this rule will fire when its predicate is true. And the entire system is basically a big long list of these rules. And the model it follows is that it will non-deterministically choose one of the rules that can currently fire and execute it. Once that's done, it will non-deterministically choose another rule, execute it and go back. The compiler is super smart and it knows the dependencies between the rules and will create a scheduler which can fire these rules in blocks. So you get hardware parallelism. That's sort of the basic underlying sets of technology for executing blue spec style circuits, what they've done is built something that looks like an object-orientated programming model for different modules within your design to interact with each other.

Starting point is 00:44:53 So in HardCamel, we basically have signals that we send to and from modules. Quite often, signals are related to each other. You might have a valid signal that is related to a data bus. And it's really important that those signals align properly. In blue spec, you can just call a method on a module and it does all the wiring for you. Underneath, it's still going to produce a hardware architecture at the level of hard camel. But when you're programming with it, you can just like call functions. That's just incredibly powerful. Your function might be, add this data to the FIFO. What you don't have to care about when you do that is whether the FIFO is full. The hardware will deal with all of that stuff for you. It will just hold off the rule until it can actually be executed. So maybe a way of describing the difference is

Starting point is 00:45:39 that hard camel is built around this little core calculus in the middle of it, which is this thing that kind of represents the heart of something like Verilog or VHDL, which more or less has the core circuit design. And then you write a bunch of OCaml code for generating stuff in this language, in this underlying calculus. And the code for doing the generation can be very modular and generic. So for example, we have protocol specification languages where we write down in a different domain-specific language some specification for some hardware packet we want to parse. And then we emit OCaml code for interfacing with that data off of the back end of that. And then we can take that same specification

Starting point is 00:46:17 that we used in a software context and use it to emit hardware. So that's like a highly leveraged, very generic thing that you can do. But the things that you emit in the end are not composable. You can generate them using a modular and composable system, but the thing at the bottom is this kind of messy circuit thing. And then something like blue spec, you have essentially a higher level of representation that's more abstraction friendly. It's easier to build components that can be combined together,

Starting point is 00:46:43 but there's still some extra computation that has to be done that takes that kind of representation and converts it down to essentially the wire level representation that you need to really generate an FPGA, which is equivalent to the representation that HardCamel uses natively. So one thing that makes me nervous about this whole story

Starting point is 00:47:01 is the thing that you said at the heart of this description of these atomic guarded actions is non-determinism. They non-deterministically apply rules. Does that mean that when you design something in a blue-spec style system, you end up with something that has fewer deterministic guarantees? I'm guessing here a little bit. I have not written a lot of blue-spec. I've mainly read papers on it. but it tends to be that it doesn't have to make many non-deterministic choices. The model is one rule fires at a time and you choose it non-deterministically. The reality is hundreds of rules fire at a time, all the ones that are currently enabled. And then it builds schedulers, which guarantee fairness among rules.

Starting point is 00:47:48 So it will, like the same circuit, you can't really have non-determinism in an FPGA, right? It's going to be deterministic at some level. And the compiler will make it deterministic. And then you have pragmas, which you apply in your source code to guide the compiler to make the right choices when it's

Starting point is 00:48:06 picking amongst rules. So when there was like a non-deterministic choice for it, it'll either pick a fair schedule or you can guide it to make certain rules more important than other ones. It's where actually the practical reality of blue spec is not quite as beautiful as the core calculus suggests. There are like these little hacks that have to go on to make it work in reality, but then everything's a compromise, right? Nothing new there. So are there any advantages that you see of the approach that Hard Camel takes

Starting point is 00:48:34 over the BlueSpec approach? It seems like there are clear benefits to the BlueSpec style system. Is there anything, I mean, one obvious advantage of Hard Camel is it's embedded in OCaml and that makes for very smooth integration with the rest of our software stack.

Starting point is 00:48:47 But I'm wondering, just kind of at the more abstract design level, if there are any benefits of the hard camel style approach. I think even designers using blue spec would have a language like Verilog or hard camel for the cases where they need absolutely precise control over the function of a bit of logic. So it seems to me the only way that you can improve on the standard model of hardware design, where you have absolutely full control, is to hand some control to the compiler. And what that tends to mean is you no longer control precisely the wiggling of the signals. And there are cases where you have to control it, like when you're interfacing with a DDR memory, for example. It is the case that hard camel is at a level where it's like what you design is what you get.

Starting point is 00:49:38 It's exact. You control everything. You know, I think you need that ability. And you give up some of that when you go to a higher level of abstraction for sure. But for a lot of logic inside the design, I think generally we will end up using blue spec at Jane Street because of the abstraction, whether that's directly using blue spec, the open source compiler, or trying to build some model of that technology ourselves. I'm not too sure. But I think we really do care about abstractions here. And the fact we have none in hardware is

Starting point is 00:50:11 just annoying. Everyone else has them. I want them too. Here's another language-related question about all of this. If you look at the world of hardware design, we are not unique in having something that looks sort of like HardCamel. So there's a library in Scala called Chisel, which has similar goals and aspirations. There's also a bunch of work on doing similar things in Python. And Scala is a language which is in any case relatively similar to OCaml. But I'm kind of curious what you think about the trade-off between using OCaml for a system like this and using Python.

Starting point is 00:50:47 They're both software systems. Both, I think, are better approaches to designing hardware than trying to use VHDL and Verilog. I think the frameworks that I've seen there, MyHDL, that's a useful system. I've seen it more used in the test space than in the actual design space. There's a new one called PyMetal. They've actually done something quite interesting in sort of building a framework in which you can plug models at different levels of abstraction into your system. So you can start with basically a high level Python implementation of a system and refine parts of it but not all of it at once you can work on one part down to the gates level and then work on another part and move

Starting point is 00:51:31 it down through levels of abstraction i think that's actually quite interesting it's something we can also do in hard camel although we haven't kind of formalized doing that with an api but we have all the sort of hooks that we would need to do something like that. Chisel is another example, which is a system that's very like hard camel, but written in Scala as the host language. One of the big advantages of Chisel is that it's actually taught at a university at Berkeley. And there's quite a lot of IP around chisel, especially to do with RISC-V CPU designs. I think there's another area where functional programming particularly shines for hardware design. Actually, a lot of the problem in designing is creating what we call combinational logic. And that is basically functional logic. When you write it, all you're

Starting point is 00:52:23 doing is take a function which takes some inputs, does something with them, transforms them to some outputs, and you compose them all together. OCaml is extremely elegant expressing that sort of thing. A thing you mentioned along the way there was that there's not yet any university churning out HardCaml engineers. HardCam Hard camel is an open source project. There's some amount of public communication and discussion that you've done over the years about it. We continue to release new versions of it, reflecting the work that we've done here. What are your hopes and aspirations about hard camel as an open source project? Well, I would like people to use it for

Starting point is 00:52:59 sure. I think it's going to be hard to get people to use it. Over the years, there have been maybe three people have come along, looked at it, thought, wow, that's cool, and actually used it in anger and contributed stuff back. I'd like more people to use, but I think we're lacking, well, we're lacking a couple of things. First of all, we haven't put out enough of our libraries, although that should be changing literally this week. I've just opened about 11 new hard camel libraries, which gets basically our internal tooling for hard camel out into the real world. But where we're still lacking a little bit is actually realistic designs built with hard camel that are sitting out there for people to learn from, to make decisions as to whether they think the framework is worthwhile using. And we would like to release a lot more of this stuff,

Starting point is 00:53:51 but it becomes a bit harder to unwind what code we want to open source, what code we don't want to open source. But we will try and do that. So yeah, I think the onus is really on me to provide a bit better open source set of libraries so that people can really come along and use it in anger. Chisel does better than us, as they've got this enormous RISC-V hardware design framework that they've released open source. So if you want to go and learn how Chisel works, there's an enormous body of code for you to go out and do that. And most of the stuff that we build internally is just not stuff that's of general interest.

Starting point is 00:54:27 Yeah, unfortunately. I think that's true. We've talked some about how hardware can be useful more generally. How does hardware come up and what kind of problems do you see us addressing in the kind of financial and trading context in which we operate? So really the focus for us is around network packet processing. I'd be surprised if we end up writing an FPGA design that isn't at some fundamental level connected to a 10 or 25 gig network and processing packets.

Starting point is 00:54:53 A platform that we've been working with the last year or so is basically like a special network interface card with an FPGA on it. And the packets flow into it. It can flow up to the host. We've got full standard driver layers for this card, but we can also put our custom logic in it. Now, when you think about like a generic network interface card, all it can really know about are the generic protocols of networking. So the IP4 protocol, the TCP protocol, the Ethernet layer, and they can do some work for you here.

Starting point is 00:55:27 They'll insert checksums for you. They might root packets or filter packets based on IP4 fields. But when you get into the actual data within the packet, they can't do anything generic with it because it can be anything. However, we can write designs which can look into that data because we know what we're connected to. And one example of that is for a specific exchange with a specific packet format, we can actually ingress their market data, pull it apart, not just the IP level, but actually all the way into the packet data data and then do some filtering or splitting or reconstruction of that data in various different ways. One is we split it into different groups and send it out of different interfaces, which

Starting point is 00:56:15 means that downstream systems have to see less data. Another way we could do that is conditionally sending parts of packets up through the PCIe bus to some host software which reduces the load on both the bus and on the software and the amount of data it has to look at and there are a number of like other sort of similar styles of system where we can because because we can customize it for the specific link, we get to choose what we do with the data in there. And I think the background fact about trading that justifies all this is that trading systems have to consume an enormous amount of market data that comes from a bunch

Starting point is 00:56:57 of different exchanges at shockingly high rates. The US markets will peak at several million messages per second at the busiest moment in the day. And you want to be able to chew through those messages quite quickly and being able to have different processes that see different subsets of the data, have some of that transformation happen off of the CPU and on more specialized and more efficient hardware is just a big step up in the kind of performance miracles that you can reasonably achieve yeah and the problem is worse than that

Starting point is 00:57:31 right because you're not connected to one of these data sources you might be connected to eight of these data sources and you know there are issues there like eight data sources well that's multiple cards for a start you've got to have you've've got to pass all these things in software. That's like a hundred gig of data near enough. And it's just so easy to see CPUs getting behind in that case. And they do. And it's annoying because it always happens when the markets are busiest, which is when you do your best trading. And so hardware, yeah, definitely can make a real difference there because it can choose through 100 gig of data. It doesn't care. You just put eight cores down there, one each for each of the connections, and they're just

Starting point is 00:58:11 going to choose through it. They won't slow down. They'll just do their job. One of the magic tricks here of hardware is that the determinism is such that if you understand the size of your problem, you can just know that your design will be able to successfully get through all that data no matter what they throw at you. There's just an upper limit of how much information can come across a 10 gig NIC. There is a bound and you can be confident that you can chew through all

Starting point is 00:58:35 that data at line rate. And so you're just not going to be surprised. You're just not going to fall over when things get busy, at least at the hardware level itself. Yeah. So like where we build our tests for these systems, we basically test them always at line rate, just back-to-back packets all the time. I wonder if we could be doing a better job of actually stress testing our software systems to really see where they start to fall over. So I guess at this point I just want to thank you for joining me. This has been a really fun conversation.

Starting point is 00:59:02 My pleasure. You can find links to some of the things we talked about, as well as a full transcript of the episode at singlesandthreads.com. You can also find some blog posts that Andy has written about Hard Camel on blog.janestreet.com, and the core libraries and tools are open source and available for you to try out on GitHub. Thanks for joining us, and see you next week.

Signals and Threads - Programmable hardware with Andy Ray

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.