Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays (2)

Starting point is 00:00:00 Welcome everybody to our next session. Today we're really going to talk about FPGAs. But before just an overview where we are. So this is the second to last topic that we're going to discuss. And the next thing will be the last thing will be CXL which is something like very new fresh off the mill basically new technology that's just been started and we're fortunate to have some yeah already have some resources to work on this so we can tell you a bit about this kind of new protocol and interconnect. Marcel will do that in one of the next sessions. Here's kind of the overview of what we're going to do. We'll see, maybe I can finish up FPGA today, maybe not.

Starting point is 00:00:56 If I'm not finishing it up, we're going to finish it up on the 19th and then end with some QMA or maybe I'll already do the summary there, because as I said the last week I'm not going to be here we will have one more task to do so the locked free data structure so this is a lock free skip list that should be actually quite fun and interesting to implement. And there's, I mean, we talked about this, there's many different ways in implementing lock-free data structures. And we'll give you one way, we'll give you one example how to do this, and then you can test yourself if you can make it more efficient, et cetera.

Starting point is 00:01:41 And then, I mean, I know not everybody's here today, but who would actually be interested in the data center tour? Ah, cool. So I'll probably ask Marcel to organize this, because I'm not here. But this will be fun. You'll see all the hardware.

Starting point is 00:02:05 So maybe reach out to Biosyn soon to set this up. It's getting increasingly complicated to get into the data center because there's lots of security measures in there. But we have good access. So you can actually go there, see the hardware, don't touch it. But I mean, it's better not to touch it.

Starting point is 00:02:27 It's not necessarily because you're not allowed to touch it, but because you might get hurt if you touch it. And we can also show you a CXL server. Yes, we can also show you a CXL server there. But for that, I guess you shouldn't take pictures, right? But you can see it. So that's actually also cool. Right. And yeah, so we'll, we might swap around the last couple of sessions a bit. So, but that will announce if there's, if something happens there, you will get the info through Moodle, etc. Okay, so today, we'll see how far we get.

Starting point is 00:03:07 For sure, I'll do the FPGA introduction, so you'll see what an FPGA is, and then we'll also see the architecture, get a bit of an overview in programming. So, unfortunately, I haven't gone as deep there as I would have liked. So, I mean, this will further iterate in the next sessions, but there's a lot of good work on this and a lot of material that you can find.

Starting point is 00:03:36 But you will get a hint on how this works and some bit of understanding. This session is more really about the architecture. So you get an idea how this architecture or this kind of processor is different from a regular processor. So an understanding of how you program this or how it works internally. The way you program it is a bit more high level. Of course, you could program it at this low level,

Starting point is 00:04:03 but that would be super hard. There's a lot of tooling around it, but you get an understanding why it works as it works. This is my idea. Then, we'll talk about the design flow briefly and then how to use it in data processing. This might run over to next session, we'll see. Okay, so some of this you already know, but this is kind of always the same reasons why we have to do what we have to do, right? So you already know we're kind of getting bound by physics. So I, we cannot like increasingly scale down the circuitry. It's just not economical anymore. Basically, we cannot put ever more power into the CPUs because we have increasing, let's say, leakage. So, meaning that the more dense the CPU gets, the more power we have

Starting point is 00:05:08 to put in just to keep them running. This is shown by these trends. For a long time, we were just using single-core, we were just scaling down the circuitry in size, basically, and just increasing the frequencies, and with that would get better and better performance. And this, to some degree, has stopped in the early 2000s. So then, all of a sudden, we started with increasing the number of logical cores and not increasing the power anymore. And this is also called the Dennard scaling.

Starting point is 00:05:48 By just keeping the same chip space and just decreasing the circuitry, we can keep the current constant and use the same kind of power, but pack everything more densely. But at a certain point, this stops because of the leakage current. So then there's basically the problem that we cannot decrease the size anymore because we just have to put more power into it.

Starting point is 00:06:22 And a current CPU basically is like a heating plate, right? If you have a stove at home, this is basically what's going on in the CPU. And we're just putting like cooling system on top of it to just remove this kind of power. So it's really like you turn off your stove at home and you put like a cooler on top or a ventilator to keep it at room temperature somehow.

Starting point is 00:06:48 So that's basically your CPU today in terms of power density. And that's not super economical, of course. So we somehow need to move somewhere else. So one idea initially was more coarse, but this also has, as I said, some limitations because of the Dennard scaling. So the idea of Dennard scaling is we're doubling the transistor density, but even if we keep the same chip space, then the power consumption would be the same. And this stops because, as I said, of the power leakage or the current leakage. And so this means we cannot increasingly reduce the size.

Starting point is 00:07:40 We cannot increasingly increase the density. So one way is just add more cores. But at a certain point, this also doesn't work because we cannot just power, like keep up the power consumption. So, and one way to deal with this is basically just not use all of the cores at the same time. So you've seen this in the M1, for example, right?

Starting point is 00:08:05 We have the efficiency cores and we have the performance cores, and we can do either or, and we then basically, we don't use everything at the same time. And there's also like specific architectures like ARM Big Little, where you have also big cores and small cores,

Starting point is 00:08:23 and we're just using part of the chip space, and by that, basically also reducing the overall power consumption. The unused physical cores are called dark silicon. Using this dark silicon, we can still improve, we can kind of specialize part of the hardware, but we're not really that efficient. So, and this also basically means we have some problem and we kind of need an alternative.

Starting point is 00:08:55 And the alternative is basically going into ever more specialized devices. So I showed you this last time already, right? So we see we have this very flexible CPUs and multi-core CPUs today. Then we have more specialized hardware and accelerators like GPUs and there's other accelerators, especially for machine learning today, like TPUs. And then we can get even more specialized. And an FPGA is is a kind of hardware that you can really specialize for certain type of application, but you can reconfigure it. Then, at the far end,

Starting point is 00:09:35 there would be the ASIC, the application-specific integrated circuit. This means you build a chip for exactly one application. And if you're into like real high frequency stuff, so high frequency trading, Bitcoin, mining, whatever, all the good stuff, basically this means you'll probably, wherever there's a lot of money, basically people will put these very tightly integrated chips in there. Or cases where you just need high throughput networking. So for networking, you will also build specialized circuitry just to do that fast. But this will be only for one type of processing. And this is also often a problem.

Starting point is 00:10:22 So if you have networking, your protocols change, you need new hardware because you cannot change it. An FPGA is kind of a nice middle ground because it can be highly specialized but it can also be adapted. You can also change it. This is what we're going to talk about today. The field programmable gate array type of processors. So here's a comparison between an ASIC and an FPGA. So as we, the ASIC is hardware or is a processor that's specifically designed for one type of application. Of course, you need some kind of flexibility to adjust things, but you will really tightly integrate the hardware or tightly specialize the hardware for one application. So, and with that, of course, you get a very high

Starting point is 00:11:25 performance. You're not sending any instructions, you don't do all this decoding, all of the stuff that the CPU has, like these pipelines, pipeline breakers, etc. You don't really have, because this is all in chip, this is all is all in hardware. So the circuit is fixed and you cannot reconfigure it. This is really like in terms of the design flow. If you look at the graph or the stuff on the bottom here. Let me use the laser pointer again.

Starting point is 00:12:01 So this is basically how you're designing an ASIC. So you have like a basic design, hardware design, right? Real chip design, how the circuitry is. Then you have your tape out, which is basically you're printing this on some kind of, I mean, traditionally you would have some kind of film, like projector film,

Starting point is 00:12:23 where you would put this and then make an image. Today, this will be digital. You send it to the manufacturer, and there it will then actually be put in silicon. There will be tests, but after you've already done the tape-out, so after you've produced the image for the chip, so the layout of the circuitry, any kind of bug fixing, any kind of changes are super expensive.

Starting point is 00:12:56 And you're basically stuck with the hardware for some time. You really have to produce many of the chips in order for this to pay off. In the FPGA, you have a hardware that's reconfigurable. You can design your hardware, you implement it and you test it, and you can do bug fixing and you can redesign. You can basically go back, put a new image or a new design and layout on the chip and then run it again and test it again. And this is why this is really also used for prototyping. A lot of the hardware that you're buying today, like some kind of controllers, initially they will be built in FPGAs to figure out how they work and what are good

Starting point is 00:13:46 designs. How do you design this in hardware? Once you've factored in everything, once all of the bugs are out, then you might produce an ASIC with this. Or you just keep the FPGA because you can later on reconfigure it again. So we'll see some use cases for this. So, the FPGA has the good or the benefit of giving you free choice of the architecture. Internally, of course, there is an architecture in an FPGA. We'll look at some details there.

Starting point is 00:14:26 So how this actually looks like internally. What are the basic building blocks? So there is predefined circuitry in the FPGA, but it's built in a generic way, so you can actually implement something else on top of it. So you can design basic logical gates and build more advanced things out of those. You have a very free choice of how your processing is done in the architecture. You're doing very fine-grained pipelining,

Starting point is 00:15:06 communication, and you have distributed memory in the chip. They're much slower than regular CPUs or regular chips processors. So they're in the 100 MHz range, something like that, not in the GHz range as you would have with CPUs. So this one, for example, runs with 12 megahertz here with a multiplicator, so it can go a bit faster, but the base frequency would be 12 megahertz on this one. But also you have a very low power consumption. So this thing here, you can see there's no cooling, nothing.

Starting point is 00:15:43 If you have a Raspberry Pi, for example, the better ones or the modern ones, they will at least have a cooler on top, maybe even a fan, because they're going to be warm. And if you really run a lot on a Raspberry Pi, for example, I mean, this is kind of same size, right? Also chip wise and if you run I don't know some data processing performance benchmark something like that then the Raspberry Pi frequently will basically overheat and Shut down and this won't happen with this kind of FPGA because the power consumption is really low so the FPGA is really kind of a middle ground between dedicated hardware and More flexible CPU that you you can use for anything.

Starting point is 00:16:35 And it's really good to try out things. And it's really good to have very specialized applications where it doesn't really make sense to get an ASIC yet, or because you want to try something out, for example. So how is the architecture different? From a high level to a low level. We're going to start very high level. So if you think about the classical von Neumann model, which is still the kind of brain or mind model that we use if we're thinking about a CPU,

Starting point is 00:17:06 it's not really accurate anymore completely, but you can see that at least the programming is designed in that way. So in there, you basically have like the memory is physically separated from the processing CPU. And to some degree, this is still true, right? So the RAM is like a separate entity that's not on the chip. You have a pipeline of instructions and the information is processed in this pipeline.

Starting point is 00:17:34 And there's a lot of logic and cycles, like a lot of processing, that just deals with getting new instructions and decoding them and pipelining them, etc. So if you remember how do we execute instructions on the CPU, there's a lot of work that just deals with this. And we really have to think about this, like how do we pipeline stuff, etc. CPU will deal a lot with these micro operations etc. And although a lot of this you don't have to deal with, a lot of chip space is used for this.

Starting point is 00:18:13 So I mean we're writing C++, this will be decoded into some assembly, but then even the assembly will be broken down into micro-operations. And this is done on the CPU. There is specialized hardware just for this. And the FPGA doesn't have any of this. The FPGA doesn't deal with instructions. Basically, the FPGA, the design of the hardware, the layout of the circuitry, defines the function, defines what is done on the CPU, on the FPGA, so the processor. You have basically flip-flop registers and RAMs that are distributed across the chip

Starting point is 00:18:58 and are wired to logic blocks. I'll show you exactly how this looks like. And with this, this is highly parallel, right? You have many of these building blocks. I'll show you exactly how this looks like. And with this, this is highly parallel. You have many of these building blocks and each of those can work independently. Essentially, every logic block can be addressed completely independently and can work completely independently. So, all of the information can be processed in parallel. You can also build long pipelines, right? So there's circuitry, you can connect these logic blocks and then you can say, well,

Starting point is 00:19:33 I'm going to implement my instruction decoding on the FPGA. You can do that. But of course, it's not going to be super efficient. So you really want to make sure that you use this parallelism. And the latency, like how much time it takes to process something, really just depends on the signal propagation. So depending on how you wire the circuits, I mean, basically the flip-flops and the lookup tables will have a certain delay. And just how fast these can basically switch, the flip-flops and the lookup tables will have a certain delay.

Starting point is 00:20:08 And just how fast these can basically switch, this means how fast your program will run. And this means more complex programs, more complex circuitry will take a bit more time. Easy circuitry will almost be instant, right? So it just depends on how fast you get the data into the device. Typically, an FPGA is built into some kind of card and it has some additional logic on there. So, even if you look at this one here, it's not just the FPGA. So, this has the FPGA on here, but it also has lots of other controllers on there

Starting point is 00:20:53 that help you with the input and output. There might be actual DRAM modules, so there is some memory on the FPGA itself, but there might be also some additional memory. Then you might have some flash storage, you might have a network interface, so a lot of FPGAs are actually used for networking. Or you use it for some video encoding or decoding, and then again you might already have some video and peripheral ports that you don't have to program yourself every time. And that you don't spend chip space on. So this will already be provided. So there's different options on how to integrate this.

Starting point is 00:21:37 So I said like many FPGAs and especially the FPGAs that you could program here in the data lab. These are all accelerator cards. So then usually they will be connected through PCI Express. But the FPGA can also sit in other places. So one thing that we also have is disks that have FPGAs. So you can do some encoding, decoding, pre-processing already on the data path. So, say for example in SATA, but of course also PCI Express, etc.

Starting point is 00:22:14 And then you have an acceleration there. And there's even co-processors that use FPGAs or that are FPGAs. So, you have some special processors, for example Intel had a processor where there is a regular CPU and then an FPGA coprocessor. And you have UPI as a connection between the two. So you are basically on the internal fabric of the CPU and the sockets, and you have the connection there.

Starting point is 00:22:52 FPGAs are used in research and industry, so I have a few examples before we go into the details of how the FPGAs are built up. And one prominent example is the Microsoft data centers. So if you think about the Azure cloud, so that's like Microsoft's cloud offering. There, they have an FPGA in every single server for the network. So all of the networking there is basically built on FPGAs and you can use the FPGAs to reduce data movement.

Starting point is 00:23:37 So all of the virtual network functions are basically offloaded to the FPGA. And while there are smart NICs and everything, using an FPGA makes this very flexible for the data center. So they can basically have all kinds of different operations. They can reuse this for different purposes and also specialize this. And here the FPGA, again, is managed through a PCIe interface. So this is basically just like a network card, but the network card is actually an FPGA. So this basically gives you all kinds of flexibility to pre-process. I don't think that Microsoft actually exposes this.

Starting point is 00:24:24 So it means you cannot really change this, but Microsoft can really do some a lot with this. So basically adapt their networking very flexibly. Then there's also the, as I said, like FPGAs for storage and here basically like any kind of new storage device, etc. typically is prototyped in FPGA. Because all of the protocols, etc. for interconnects you can implement in the FPGA and the FPGA can be quite fast. I mean, if you implement it correctly, you can get very low latencies. You're not going to get super high bandwidth or you're always bound by the interconnect that it's connected with.

Starting point is 00:25:13 And depending on the FPGA, you might not be able to reach kind of memory type bandwidth. But still, you can flexibly build new kind of layouts. So there was the eBACS example in From ETH. So ETH Zurich the systems group. They do a lot of research in FPGAs. They also did a lot of research with this catapult or the Microsoft's FPGA interface in the network. And here they basically tried using the FPGA in between the hard drive and the workstation or the database. And then you can basically offload some of the computation to the FPGA.

Starting point is 00:25:59 And there was even a database startup in Berlin that built special accelerators for MySQL, but they now were bought by Altera or something. One of the FPGA vendors, I think. There was actually really just low-level acceleration. if you go back to our lecture series, you can actually find their talk. So we had them here at some point present some details about their architecture. But unfortunately, they're gone now.

Starting point is 00:26:35 So I don't know actually if there's still a lab here or not in Berlin. And so here, for example, eBIGS implemented projection, selection, and group by aggregation in the FPGA. Then the database would basically offload this to the FPGA and the database can deal with the rest. It just gets already the pre-projected data from the FPGA. Today, you can even buy disks which already have built-in FPGA. And today you can even buy disks which already have built-in an FPGA. So in that you can basically do the same thing yourself. So Samsung

Starting point is 00:27:12 sells a disk that has an FPGA that you can reprogram and then use as a regular disk but already do some of the computation on the FPGA. The last example that I want to show is the Intel processor. There was the Intel processor, like a Xeon processor, which had an FPGA built into the processor. So there basically you have a coprocessor where you have on the one hand you have the regular CPU, so Xeon processor, and then you have an FPGA and they're connected via UPI and PCIe lanes. So you can directly basically use the FPGA as a coprocessor within the same socket essentially. And there is basically two sockets, so then you would just have the UPI or in this case even QPI,

Starting point is 00:28:14 but then they also built one where the FPGA was directly within the chip. And that means you can basically reprogram this and you have very tight integration into the processor. You can offload some of the computation that would make more sense or that's more efficient on the FPGA than on the regular processor. So say, for example, networking. So this is something that you can directly do, even IP networking. You can directly implement in hardware and then the CPU doesn't have to do anything anymore. Or any kind of encoding and decoding if you're doing video processing, for example. Or you're doing simple database workloads, you can also offload those. The FPGA takes some time to reprogram,

Starting point is 00:29:07 and I will talk a bit more about this, and especially to design the mapping, so how is the layout of the FPGA, but it's reasonable amounts of time, especially if you already have different kind of layouts, just mapping and programming the FPGA is not that expensive. Okay. So with this, let's dive a bit deeper into the FPGA architecture, right?

Starting point is 00:29:38 So what really does the chip look like internally and the key to reprogrammability are lookup tables. So this is actually quite simple, right? So the idea is that you basically have logic tables. So you basically just build like these lookup tables, like you probably know from your math classes or something where you basically, we want an end gate. You have a look something, where you basically want an AND gate. You have a lookup table where you define all of the inputs,

Starting point is 00:30:11 like binary inputs, and you define the correct unary output. So if you want an AND gate, for example, your input needs to look like this. So if you have two zeros, the output will be zero. Or two false, the output will be false. If it's false and true, the output will be false. If it's true and false, the output will be false. And then, if it's true and true, the output will be true.

Starting point is 00:30:39 This is how we can basically, having a lookup table, we can program an end gate. Same with an OR gate, same with an XOR gate, NOR, etc. So all of that is basically just built with these lookup tables. So you're building, you're just setting up lookup tables or something like really custom designed. And then you get the correct output. And this is built into the hardware. So this is really one of the basic logic blocks,

Starting point is 00:31:09 is you have these lookup tables in hardware. So you basically map such a lookup table into an SRAM. Remember, this is one type of storage cell. You're mapping it directly into the storage cells and then you're using a multiplexer to basically switch the correct storage cells as an input. So, if you have, and this is, so lookup table is like abbreviated with lot, so you have an input lot, then you need 2n bits of SRAM to store the lookup table, but it's probably 2 to the power of n. I have to check.

Starting point is 00:32:00 Anyway, you basically then build a tree of multiplexes to read a given configuration. So the way this works is here in this lookup table, we're storing our here in the SRAM, we're storing the lookup table and then we're using multiplexes. So these are basically switches and We're basically switching them based on the input. Here, this would be a four-input lookup table. For the two inputs, we only would need this subpart here, for example. I'll have another example in a bit where I do take exactly this. Then we're switching based on the input. So here I'm writing 0, 1, 1, 0, whatever. And then based on the input that I have, I'll look this up, right? So I'll basically see which one did the input choose, right? So this input will basically basically choose the first level on each of the first entry of each of the two switches.

Starting point is 00:33:11 Then this will basically choose in between those two outputs. So you get the same structure that you would have in the lookup table essentially. Let's say reversed. So if we look at this input 0, 1, 2, 3, then this 0 would basically be this row here. And by switching here, we can just do exactly this lookup and our output will be the output that we have here. So basically, if in here I will say 0 0 for example, then I will go here 0 and 0. So here I would switch to this route and here to this and I get

Starting point is 00:33:55 this output. So this would be this lookup for example. And this is the basic principle how we're building different kind of logic blocks in the FPGA. Just using this. Here I have an example of four input lookup table. This for example has these kind of four input lookup tables. Modern larger FPGAs will typically have six input lookup tables. So meaning you can have six inputs in there. And especially based on this, you have to always double basically the size. This is why I'm saying this is probably wrong. So if you have five inputs, obviously you need to double this again.

Starting point is 00:34:44 Six inputs, you need to quadruple this. Then in order to keep this info or keep the output stable, because this basically is done by switching. We have a current running through the sim, like a clock, and the lookup table will basically be running based on the, or be working based on the input coming in, right? So we're writing an input to the input here

Starting point is 00:35:24 in the clock cycle, then we need to keep the input stable for a clock cycle. And this is what the D flip-flop does. So this is basically a simple memory cell where the info will go in. So this is basically our input here. Then we still have the clock. And then we get basically the output which is just

Starting point is 00:35:48 the same input, but it will be stable. So even if this briefly just is inputted by, let's say, rising current, for example, then this will be stable here until we have the next clock cycle. And we can also, rather than getting the same input, we can also get the inverted input. So if this is 1, the input is 1, then we'll have a 1 coming out here and a 0 coming out here. And then we also have a set and reset port where we basically can override this. So with the set port we can say, well everything should be, no matter what the input is,

Starting point is 00:36:28 or everything should be zero, no matter what the input is. And then there's also, I forgot what that does. So we can also put one and one, and then I think the whole logic is switched, for example. So we need this to keep the input stable and it gives us additional configurability because we can just negate the output of the lookup table here, or we can overwrite it. So with this, we can then basically build the whole cell.

Starting point is 00:37:02 So this is like a mini example for a tool lookup where we would have like two inputs and this again it's the same as before so we have in this case we only need because it's only two lookup we only need four memory cells these can either be zero or one and then we have two multiplexes that basically do the first or the second input actually. And then one multiplexer that switches to the first input. And then if we have like one instance of this, so this would be basically the lookup table. So these are all of the inputs that we can have, and this is the lookup table. So these are all of the inputs that we can have and this is the lookup

Starting point is 00:37:45 table. So this is actually what we need to store in the lookup table. And then you can see if we put this here just like as we would have it in the lookup table. We store this in this SRAM. Then if we switch, say for example our input is 1, 0. So then in this multiplexer, we're going to 1. So this would mean on the second input, for example. In this multiplexer, we set 0. So we're going to this multiplexer, we'll have 0, and which is exactly what we're looking up here.

Starting point is 00:38:21 So this is basically really how this works. This is what the internal logic does in the FPGA. And of course, this is really simple. But building this, you have these basic logical building blocks that you can use. And it's very flexible, right? So right now, we're talking about two inputs. But as we said, in a big FPGA this would be six inputs. Multiple of those will be actually connected within a logical block. You're not going to do the programming on these two lots. Of course, you can really just do these small inputs, but you will have larger groupings. So in, let's say, in the current

Starting point is 00:39:06 Xilinx architecture or Altera architecture there would Xilinx is, I think, AMD and Altera is the Intel. Of course, everything bought up in different ways at some point.

Starting point is 00:39:24 It's a different two types of vendors, or the two predominant vendors, I would say. They would be called slices or adaptive logic modules, and I think in OpenCL it's called elementary logic units. And there you're packing different parts together. So it could, for example example be two lookup tables with four inputs and then the flip-flops and you might additionally have some carry logic. So if you're doing some mathematics, you basically want to carry the value over if you're running over with some multiplication or addition or something like that.

Starting point is 00:40:08 So then you can basically carry this over also to other lookup tables. So you don't have to implement additional carryover logic yourself. So then this is already built in. So you have some arithmetic carry logic. You have the multiplexers for getting the output here. And you have the lookup tables. Then these might be connected again to logical islands, so then you might have two to four logical elementary lookup units connected to the logical islands, which then are connected to the switch

Starting point is 00:40:48 matrix. So, then you have basically logical islands in here, many of those, and connected with some lookup tables and connected to some input output and some memory connections or some memory blocks basically. And again these are called differently in different type of architectures or depending on the vendor. I'll also show you how this works internally. So here we can see we have these logic blocks which are built of the lookup tables of the carry logic and the flip-flops these are then packed together and then we have this switch matrix and the switch matrix again is kind of this networking so this basically connects all of these logic blocks together. So we have a set of connections just on the chip in between these logic blocks. So we have hundreds to thousands of logic blocks

Starting point is 00:41:51 and then we have this matrix and this is again configurable. This basically then means based on how we connect these, then we kind of get the connections of what is connected, like how is which logic block connected to which other logic block. And this means we want to have multiple connections in between the logic blocks, so it's not just like what is shown here, right? It's not just a single connection, but we'll have multiple connections connections and then we can switch which ones are actually active and which ones are not and how are they connected so if we have like three connections here for example we'll have um switch or programmable links in here and these again just work with memory cells right so basically we can do all the different kind of connections that are possible

Starting point is 00:42:46 so if we have like a crossing here right then we can just switch them straight we can do these round or like left to upper or left to lower or upper to right etc so all of these individual switches are possible and for each of those there's basically for each of the connection there's a small memory cell which again we can program at programming time right so when we write the design on here then we can say okay i want this connection to be activated and the other connections are not going to be activated it's just done by this memory cell we'll be basically basically saying, okay, we switch this on, and then this means, okay, these two logic blocks, for example,

Starting point is 00:43:30 these two outputs are connected together. That's not everything on there. So we have all the logic blocks, and they can store a little bit of data in the flip-flops, etc. But then, some stuff we always need. So if this is a PCI Express board, we need to communicate through PCI Express. Of course, we can implement this on the FPGA, but this will take already some space on the FPGA. We always have to place and

Starting point is 00:44:06 route all of this. So it doesn't really make sense to put this stuff on the FPGA, but there is some dedicated hardware or hard logic on the FPGA or on the accelerator board, which will not be implemented using the reconfigurable logic. This is then called hard IP. There can also be soft IP, so this would be some kind of software implemented on there, some kind of firmware on there, that's basically designed by the manufacturer and will be placed on there so you can easily use it, but you don't have any influence on it. And that's basically how to communicate with the system. Then you also have I.O. blocks. So this is basically input output.

Starting point is 00:44:55 If you say, for example, on this one, you want to communicate with the LEDs. So you want to show something here. Then there needs to be an input output block that will send something to the LEDs. I'm going to switch this on in a bit, so then you can see. And then there's also block RAM. So in order to have fast little memory on there, you have kind of small blocks of memory in there that you can also connect to the individual

Starting point is 00:45:26 logical islands so that you can store some intermediate states you can store some input etc or do some random access basically but this is really small right this is kilobytes per individual block and then you might have some floating point units, some digital signal processors, anything where it doesn't really make sense to build this in the hardware because you can just be much more efficient if you don't do it in logical blocks, in these elementary units. If there's special hardware that can do floating point operations very efficiently, you don't really want to do this in the FPGA resources. I mean, this one, for example, doesn't have this, but a larger FPGA specialized for video processing will have something in there for this, like a digital signal processor.

Starting point is 00:46:21 And then you might even have a complete processor on there. But the actual layout of the chip will be something like this. So you have the logical islands, you have the block RAM and you have some other kind of units there interconnected. And then using the switch matrix, you can connect however you want. And of course it makes sense to basically, if you want to use the block RAM in logical units to place them close to the block RAM, rather than going far, because otherwise you're gonna use a lot

Starting point is 00:46:53 of the switch matrix just for these connections. And you might run out of switch matrix, so your program cannot be placed. Again, something that you don't really have, like technically you could do this yourself, but practically this will be done by you, by algorithms. So which logical block is placed where, this is like there's a place and route optimization that's done for you. And this is actually what costs most time in programming the FPGA. And not in terms of like, of course, you programming the whatever you want will probably

Starting point is 00:47:29 cost most time, but then in terms of what the compiler etc. has to do, this is where all the time goes in making sure that the placement on the chip is actually good. Okay, so I'll show you two examples again, So just so you get like an idea of size. So this would be like a older Intel card. So an Intel ARIA 10, 11, 50 GX. We might have that in the data center. I don't fully remember. And here you can see this elementary logical units.

Starting point is 00:48:05 You have 400,000 of these. So this would be these kind of sub-parts of the logical islands. And so there you have two lookup tables with six inputs and four flip-flops. And then, of course course the carry logic and the logical islands would then be 10 of these elementary logical units. So this means then we would have like 40,000 logical islands roundabout. And we have 2.5 thousand BRAM blocks with 20 kilobytes each. So you can see we're in the megabyte range in terms of memory in there. That's directly placed inside. We will also probably have some,

Starting point is 00:48:57 I don't have numbers for this, we'll have some memory, some DRAM on the board itself, but inside the processor, the memory that's inside in there, that's in the megabyte range. And then I have this one here. This is what I have here, right? So this is this thing. That's a Lattice ICE40XH8K breakout board. So the breakout board is like a prototyping board. So the ICE 40xh is the FPGA type and 8K because it has 8,000, close to 8,000 logical cells. And here the logical cells is one lookup table with four inputs and a flip-flop. And then we have programmable logic

Starting point is 00:49:46 blocks, which is basically our logical islands again, which consist of eight logical cells. And we have 32 BRAM blocks with 4 kilobytes each, so 128 kilobit in total. Kilobit, 4 kilobit each, so 128 kilobit in total. Kilobit. 4 kilobit each. So 128 kilobit in total in the whole thing. And that's actually all we have in there. So there's no additional memory on this. Everything else has to go through the USB connection here. And just as an overview, this is basically how this is set up internally.

Starting point is 00:50:23 So this is actually the 1K, with 1,000 logical cells. That's what they're called here. And you can see, again, it's kind of this matrix layout. So we have a two-dimensional layout, where we have the programmable logic blocks, each of those with a four-input lookup table, with some carry-on logic, with the D-flip-flop, and then with kind of the switch. So we have the layout looks like the schematic is a bit different. Here we basically have the input from this side. We have the switch here that basically will then switch the multiplexer that will switch the lookup table and do the output, which then goes into the D flip-flop, where we can switch if we want to use like this or we want to directly route it.

Starting point is 00:51:37 Here we only have the direct output, not the negative output, but we can also reset this. We have the clock and we have carry logic here as well. And then we can see eight of those will be in a programmable logic block. We have the memory in between. Again, we'll have to switch matrix and we'll have some IO bank, which then on the one hand will connect us to this USB driver and on the other hand has some external connectors. You can see here I could connect a lot of stuff if I want and I have the LEDs. So this is what would be connected through these input output banks here.

Starting point is 00:52:22 So with this, do we have questions so far? No? Then we'll do a five-minute break here and then look into how we program this in the next part, after the break. So, let's talk about FPGA programming. So how do you actually program this? So you saw how the internals work and that you have these individual memory cells. Basically that will define your hardware and the circuitry. And I mean the very basic way to program this, I mean, if you're adventurous, you have to know exactly the layout of the chip. And then there's a small program that just basically tells the chip to write the memory cells as you want them. So you will basically say this memory cell

Starting point is 00:53:17 1, this memory cell 0, etc. exactly in the layout that the chip has. But of course, you don't want to do this because that's going to be much too hard and very error-prone to do yourself. So you need kind of a bit more high-level way of doing this. And there's two or there's a couple of different ways of programming these. There's the hardware description language and there's two prominent languages.

Starting point is 00:53:47 So one is Verilog and one is VHDL. And typically you can use both on the different FPGAs. So they do more or less the same thing. It's just like a different syntax. And it's kind of this register transfer level abstraction so you have a low level of abstraction but you have very full control of the hardware so you're basically defining signals and you're finding connection between the signals and what to do what to store how etc but not the placement right so this you have like a generic circuitry and then your tooling basically has to define or has to figure out if this fits on your chip or not or how to best place this in order to get good

Starting point is 00:54:35 timing. And so because this is more tricky and often well it needs a lot of understanding of the circuitry and thinking in signals. And because of that, there's also more high-level languages. There's OpenCL, which is a language that you can use even to program GPUs and just regular CPUs, but also FPGAs, and you can also translate code that you wrote for a GPU to an FPGA. So that's very generic. Then there's HLS, so high-level synthesis from Vivado, so that's vendor-specific.

Starting point is 00:55:22 That's where you also have kind of a C++-like interface to programming the FPGA. And they, on the one hand, are much more easy to program and they provide all of the I.O. components. So, they will provide libraries of components that then will be also placed in your circuitry to make the programming much more easy. So, that's of course also true for Verilog and VHDL.

Starting point is 00:55:50 There are components that you can reuse and you don't have to program all the details. So, then, well, I mean, syntactically, VHDL is related to Ada and very long to see, so it looks similar to a basic programming language, but the semantics are different. So the statements that you write, they're not executed sequentially, but all executed concurrently. So depending on whenever the input values change of the statements, so you have to think about it. So this directly maps to logical blocks on the circuitry. And the functions that you're writing, essentially,

Starting point is 00:56:43 or the statements that you're writing, essentially, or the statements that you're writing, they will be executed exactly when there is a new signal on the logical unit. And if you have multiple statements in parallel, they will be executed in parallel, depending on how they're basically connected to each other. There's no function calls. You have kind of modules and these modules will then connect it depending on how you basically define the modules inside them and in the logic, they will be connected to the through the switch matrix.

Starting point is 00:57:29 If basically depending on what you're using VHDL, very log or high level synthesis, then this code will be translated to specific hardware structures. And you really need to know the relations between the hardware constructs and the code that you're writing to get hardware structures. And you really need to know the relations between the hardware constructs and the code that you're writing to get efficient code and correct designs. I mean, there is a lot of tooling around this. But essentially, the challenge is,

Starting point is 00:57:59 so you're programming your logic. And then basically, the circuitry needs to be routed through this and you need to make sure that the frequency is is not too high or isn't yeah basically not too high so you're not basically using too fast frequency so that the signals can actually flow through your program fast enough or through your secretory so it's really thinking about signals so you can have like many signals and many inputs in parallel and then a pipeline of signals

Starting point is 00:58:29 and depending if this pipeline is long or it is too long then you either have to reduce the frequency of the clock or you have to input in between some memory so in order to store some intermediates, in order to then basically use the next clock cycle to continue with this. And this of course is not everything you can do yourself, but there's basically tooling that will give you the timings on how the different gates in the different circuits can be connected and then you can tweak based on saying, okay, I need to integrate something else here.

Starting point is 00:59:11 This I can do basically with clock cycles, this I can do without clock cycles, I can just use the input as fast as I get it because I know the input will be processed within a clock cycle, no problem. Okay, so here's a simple example. I mean, basically, just in terms of translation, right? So you don't have to do this translation. This will be done for you. But in order to figure out how this is done. So if you do something like a conditional statement, this will be equivalent to a multiplexer in hardware. So you have your different kind

Starting point is 00:59:52 of inputs would basically be the two kind of options that you have and your selector. This is basically the switch, how you select which statement or which condition is actually taken. So this is one way of thinking about this. So the multiplexes give you conditions, then any kind of signal changes on clock events will be done through these flip-flops. So, if you're using the clock, then you'll be using the flip-flop to stable while during the state changes or for one clock cycle. And you can also basically program this. You will see this in an example later on.

Starting point is 01:00:58 I have two examples. Unfortunately, I couldn't fully consolidate this. So I have a Verilog and a VHDL example. The Verilog example on the chip and the VHDL example here in the slides, where you can basically see that we're actually working with these kind of signals. So this is VHDL and like all of the hardware stuff, people like weird and and boxed abbreviations. So VHDL is basically VHSIC, hardware description language, and VHSIC is very high speed integrated circuit

Starting point is 01:01:37 hardware description language. And here you have basically three things that you're using. You have the basic signals. So this is really the input. And say if I'm writing something to the chip, like through USB, this will be the input that goes in. Or my clock could basically be an input, for example. And then I can also have like a...

Starting point is 01:02:05 or this would be a signal basically. And then I can have inputs and outputs. So I have this input from the USB, I have the input from the clock, etc. I have inputs from other circuits and I have outputs. Basically that again will go to other circuits. This basically defines how things are connected and then we have the architecture itself that basically tells us how the logic the lookup tables will

Starting point is 01:02:37 be specified. And of course if you get more it gets more complex we'll have to combine multiple lookup tables. And you can use arithmetic operations, things like that, and that will then be translated into these logical blocks. So I have a simple example here, like a multiplier, you will basically have two 16-bit INTs as an input, multiply those for a 32-bit output out of this, which will go into the register, be stored there for a clock, and then be moved on to whatever next processing you might want to have. And if you want to program this in VHDL, then you will need these kind of signals.

Starting point is 01:03:32 So you have the clock, the two input vectors or the inputs and the output vector. And then you will have an architecture, which is then basically really just doing this kind of, let me get pointer again, really on the clock event, basically applying the multiplication and putting this on the output. And that's all we're doing, right? And this will be then really translated into the output. That's all we're doing. This will be then really translated into the hardware. This is basically what VHDL would look like, like these basic blocks. Verilog, I'll show you a very simple example, will look quite similarly.

Starting point is 01:04:21 Maybe I'll show you the example right now. So, this is a very simple counter. This is what I will flash on here in a second. And this is basically, this is very log in this case. And so, here you can see we have these modules. The modules have input and output, which would basically again be the circuitry. Here as an input, we're using the clock. The FPGA has a clock signal, basically. Our output is the output blocks, which are going to the LEDs.

Starting point is 01:05:23 The LEDs are up here, and you'll see this in a bit. Then we have a register. So this is basically, we're using some of the BRAM and then on the positive edge of the clock, find my mouse, where's my mouse? Here, so on the positive edge of the clock, so whenever the clock goes one, basically, or plus, then we're increasing the counter. So the counter is 12 megahertz.

Starting point is 01:05:58 So we're going to count quite fast and because of that we're only using like the upper signals of the counter because otherwise this would flash too fast so you don't really see it so we're using um eight bits of the the counter uh register and up assigning this to the led so this is basically the input is the is the the clock the clock will go into the register will basically be counted in there or will go in the, like on every positive edge, will basically increase the counter. And the counter is assigned basically to this register. And this again is assigned then or is counted in the register and this output is assigned to the LEDs. So this is really, really basic circuitry.

Starting point is 01:06:53 We will not take much of the chip space in here. That's all we do and then we can basically also, I'm going to show you this. So what we can do is we can basically just... To find the mouse again, we can just see it basically has no state right now. So that's why the LEDs are not doing much. So now it's basically doing synthesis and now programming the chip. So it's writing the programming on. Now the program is on and it's just going to continue to count.

Starting point is 01:07:54 So it's basically counting. You can see that it's just basically counting up in binary the bits and ever starting again because we're not using the this not just this is basically the the counter is actually larger but we're just using a few bits out of the counter that's just going to continue yes 12 megahertz so it's basically So it's basically... It should be 3 seconds, right? So the whole thing is more like 8. Sorry? It should be 3 seconds till the counters pull right. So it looks more like it's...

Starting point is 01:08:30 Well, we're only using a subset of the... 6 to 19, which is... I can do the math online right now. So, I mean, if you... Let me So we're using 26 to so 19 to 26 which is I would need a calculator to get this right So what's 19, 2 to the power of 19 is like, it's like, it's like, sorry, no, it should be 500,000 something roundabout.

Starting point is 01:09:25 Like two to the power of 20 would be 1 million, right? So 500,000 now 12, like 500,000 divided by 12 megahertz would be like how often, how frequently this should be updated. So that's 12. 0.04. And I mean, this would basically, this, the lowest one, would be the one that's like every, and then the highest one 26 is like 1 million 2 million 4 million 8 million 16

Starting point is 01:10:17 20 32 well let i let marcel do the counting uh and the calculations round about, but this should correlate with the 12 MHz basically. It makes sense. Make sense? So this will be sub-second, but the other ones you can see is in the second range. Yeah, I mean with MHz it's roundabout right, right? So it's, I mean we're in the millions, so megahertz would be in the millions of frequency.

Starting point is 01:10:56 So, sounds roundabout right. And that should be exact. I mean, this is actually the thing, right? If the 12 MHz are exact, we should actually be able to exactly count how frequently this is updated. So this is also how this, in the end, would work if we're counting or if we're synthesizing. So we really have to know how the signal processes through there, how often this is updated,

Starting point is 01:11:29 in order to then be able to predict how much we can put into one circuit. So let me move this. We have to... Okay. So, nice little adventure. If you ever want to try this, come by. I can surely borrow this to you. Okay, so let's look at the design flow in more detail.

Starting point is 01:12:11 And that's just going to be where we're going to stop today after the design flow, actually. So, we're basically, I mean, what we have to do is we have to specify the hardware. We have to somehow tell the chip or the compiler, etc., what exactly should be put on the hardware, how the layout should be. And from a logical level or from a programmer level, we have to adapt our algorithm to the parallelism of the FPGA and to the design of the FPGA. So, we do this in the hardware definition language or high-level language. So, here I showed you Verilog. I showed you VHDL a bit. I actually wanted to do this with VHDL, but apparently the pipeline is not there yet.

Starting point is 01:13:03 So, this doesn't 100% work. It's a bit more complicated. So this is why we're using Verilog on this one. So we're writing this code. And then this needs to be translated. So the code that we wrote, the code that you saw, is not a low-level description of circuits. So this basically needs to be further translated into logic-level representation.

Starting point is 01:13:33 So meaning into this lookup table kind of block representation. And this is called synthesis. So this part is actually not that expensive, because it's a simple translation to logic blocks, and that then somehow have to be put on the FPGA. But then the expensive part is the place and route. So first we just have these logic blocks that are somehow connected, but now we actually have to place it on the fpga itself and the fpga has very specific logic blocks and the fpga has the switch matrix and we have to make sure that we connect the switch matrix correctly and we place everything correctly so

Starting point is 01:14:20 we're using the chip space efficiently we have to make sure that the signal transmission is fast enough through Through the whole chip etc. Because if I mean, of course we can design a very long pipeline Of things but then the signal won't go through fast enough So even in 12 megahertz like even on this chip We can design something that won't go through within 12 MHz. And then, basically, we would have to clock down and again, there we need to know what's the clock frequency we can actually use. So, this is a lot of work. So, this is basically NP-hard problems and this takes a lot of time. And The larger the FPGA and the more complex your circuit,

Starting point is 01:15:06 the longer this takes. This can easily go into the days of processing, in order to actually execute this. Then, this place and route, this mapping that already gives us this two-dimensional architecture of the circuits, needs to be configured and put on the FPGA. So, we're basically building just a binary representation. So, usually it's an ASCII representation that is really like the chip layout. Basically, all of the memory cells on there, we're saying this is one, this is zero, the one, etc. So, the memory cells for the lookup tables, the memory cells for the switch matrix, etc. All of these we can independently configure and this needs to be be written into this format and then be flushed onto the FPGA. And again, this is not super expensive.

Starting point is 01:16:12 The most expensive part is here, this place and route. Or this place and route, so the implementation to the FPGA itself. So here, a quick example of the Xilinx workflow. So we're starting with our high definition, or hardware definition language, so VeryLog, for example, that we turn into a collection of net lists. So basically logical blocks that can be placed on an FPGA that are not designed yet. So we have a basic representation of logical hardware without any way of how to put this onto the FPGA, just how we designed this.

Starting point is 01:17:01 Then the actual place and route is done. And this means basically we translating and combining the input nest list. So initially this is individual models. Then we combine them and into like a generic database file, meaning like one big generic file, which is still not exactly how we would put it to the FPGA. It's not, but it's like one big schematic of the implementation.

Starting point is 01:17:39 Then we need to actually map this to the exact specification of the FPGA, meaning do we have four lookup tables, four input lookup tables, do we have six input lookup tables, what are the logic blocks, etc. Again, not super expensive, but already some work to make it more hardware specific. And then we do the place and route so basically place these individual components and connect them through the matrix and this is as you can imagine quite expensive because all of the blocks are identical right so there's many different ways of placing this and many different ways of connecting this and we need to find the one that

Starting point is 01:18:25 actually is actually most efficient or at least one that actually works so in many cases if we really want to fill up the fpga this is a lot of work to basically move around and figure out where we place this if the fpga has many more logical blocks than we actually use, then it's easy, because then there's a lot of space to move around. If we're filling it up, then it's going to be very costly. And there's even ways that we can actually, in modern tooling, we can actually divide this up into submodules and then reuse these submodules and also reconfigure these submodules. So because this is quite expensive, we can say, well, I'm using this subpart and I'm replacing this subpart and I have different implementations for this subpart, which adhere to these timing constraints, etc.

Starting point is 01:19:18 And then I can reconfigure these parts more efficiently because doing the whole image and basically the whole layout for everything, as I said, can take a day or two for a larger FPGA. Then, once we have the design, we have to generate the bitstream. So, this is just like generating this FPGA readable format. It could be ASCII, so often this is really ASCII, and then it just needs to be programmed on the FPGA. This is usually flushing, and this is basically where you saw that the lights were kind of low. This is where really the program is written onto the FPGA, so it's reconfigured.

Starting point is 01:20:01 We write all the individual memory cells, and then we're in an undefined state for some moment, and then once the clock starts again and regular processing, then we can actually do the processing. And then the FPGA will just continuously do whatever it's programmed for. This is Xilinx. So Xilinx is one of the big vendors. It was bought up by AMD. You can find lots of FPGA cards from Xilinx. They have lots of complicated and expensive licenses for everything.

Starting point is 01:20:40 This is why I actually bought this one, because for this there's also an open source toolchain. So there's a tool, a synthesis tool called IOSIS, where you can basically do all the synthesis, which also includes a place and route tool, an open source place and route tool that's called NextNR. And then, specifically for each FPGA, you basically need the tools that will map to exactly this hardware. So, then we need to know what are the logic blocks in here, what are the programmable logic blocks, where is the BRAM, etc. So, the exact layout of the model, like first the basic blocks, depending on the hardware, then the exact layout of the hardware, saying here I have 8000 lookup tables, so where are they, how can I connect them?

Starting point is 01:21:39 So, for this I need hardware-specific tools, and that's in this case IceStorm tools, which then I at least need the IcePack for creating the bitstream and IceProg for writing the FPGA. So IceProg would then be the tool that just writes the program onto the FPGA. And IOSYS is very locked. There's also GHDL, which is an open source VHDL compiler. You can also use this for simulating the FPGA. So there's a complete simulator that will compile this into binary and you can run it on your regular CPU and try it out.

Starting point is 01:22:24 But there's also a connector to IOSYS to then use the same kind of tool chain. And there's of course also alternatives to this, but this seems to be the most prominent way of doing this. But this one is still prototypical, which is why I didn't really try it out in this case. So this is, and this is what I showed you here today, right? So using IOSOS for the compilation and like in the mapping etc. or that the basic translation

Starting point is 01:22:56 into the netlist and then next PNR which is part of the IOSOS suite for the place and route and then the IStorm tools for actually programming the FPGA. Okay, questions so far? Yes? How good is the vendor support for OPCSOS software? For this one, okay, because they particularly built it for that. For the others not much. So I think, I mean, they really live off these expensive licenses and this you need license servers and everything so it's quite complicated. Even for academia we have to go through a lengthy process to get access to licenses for the hardware. But IOSIS and NixPNR, there is some tooling for

Starting point is 01:23:53 Xilinx hardware. But it's experimental. So I think it cannot do everything, but at least you can try it out. We could try it out at some point if this works. But of course, the regular tools, they are much more convenient. They have a lot of other stuff. But all of the timing stuff, etc., you can also do with this tool change. Okay. No more questions? Great. Then thanks a lot for listening. Next time I'm going to

Starting point is 01:24:28 finish this up and I'm going to summarize the whole course also, since we'll have some additional time and we can go do some Q&A. So that's going to be on Wednesday next week. Tuesday you're going to get the next exercise. And then, the final topic will be by Marcel. Unfortunately, I'm not going to be here, but this is something, this is really hot stuff. So, I really recommend going there, listening in, because at least as much or what we see right now, this is where all of the accelerator connections will go in the future.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.