Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays (2)

Starting point is 00:00:00 Darmstadt again, I think this is the last one for this semester. And Andreas Kipf from TU Nuremberg will present today. He's an expert on learned indices and learned components for database engines. And he will talk about what we learned and what we will learn. I think Nils also posted the abstract in the Moodle, so you can see if that's interesting to you. The Zoom link is also in Moodle. So if you're curious about that, feel free to join.

Starting point is 00:00:36 That should be fun. Then Tuesday, we're going to have our guest, Piotr Ratusniak from Intel Poland. I found an image on his LinkedIn page. So he's an expert in FPGA systems, and he will talk about FPGA in practice. So I don't have the title. He sent me an email today. He's going to send the title as soon as I have the title and the abstract.

Starting point is 00:01:04 I'm going to also put it into Moodle. But I'm pretty sure this will be interesting and fun. And I hope you will join and have a lot of interesting or have a lot of questions to him and see how this is working. We actually also have Intel FPGAs in the data lab, so in our data center. So we can also bug him about these things. So this is actually quite complicated to program.

Starting point is 00:01:34 So maybe you can shed some light on how to make this easier. Then there is a survey on the data center visit in Moodle. If you want to visit the data center on Tuesday, I mean, I've already said this last time, so the two of you probably heard this last time as well. Please fill in the Moodle. So fill in the survey in Moodle, because we need some numbers. I mean, basically, the question is,

Starting point is 00:02:07 if we're more than a certain number of people, we'll have to split groups. So because we should not have more than a certain number of people in the cages or basically the area with the servers at a certain point in time because of the security measures in there. So there's basically some fire extinguishing system that works automatically and that basically you have a certain, I think 20 or 30 seconds to leave the room. And if we're too many people, this might get close.

Starting point is 00:02:46 So then that might be annoying. And it's kind of getting just a bit crowded. So it's good if we know some numbers. If you fill in the Moodle, then we know, okay, if it's way above the number that we should, or close to the number where we should split groups, then I will organize this a bit more structured if we're a smaller group then we can just walk there we can see everything etc

Starting point is 00:03:13 otherwise we'll have to split and then you maybe saw the email there was a bit of a mistake on our end in how what we sent out or how we structured the repository. And yeah, so we sent out a bit more than we wanted initially. One person nice and kindly enough made us aware of that immediately, but at that point already at least three people had checked out the repository. And so because we want to make a plain level play field for everybody, we said, okay, let's not treat this in a weird way or something or whatever. So we'll provide the advanced test and the benchmarks for everybody for this one. Basically means everybody gets all the info.

Starting point is 00:04:14 I mean, we sent out even a bit more, but if you get that info, it's very easy to get caught up in our plagiarism tool. So we're not going to send out our solutions. But everybody gets the advanced test and the benchmarks. We have the same chances to implement it in the same way. So that was unfortunate, but again, I think with this, everybody has kind of the same chances and you get to see the insights of one of the tasks. Maybe that's also a good thing for you. Okay, so I think that's almost it. Yeah, so these are all the announcements that I had. So just a reminder where we are. We're all the way at the bottom, right? So,

Starting point is 00:05:05 everything covered so far. We're in... Forgot to switch this on, just a second. So we're here, so you know your last task. We're going to have Piotr's presentation on Tuesday, and then I'll do a very brief summary next week, so maybe give a bit of an outlook what else we have, walk through quickly what we discussed. If you have any questions next time about the topics we discussed. So, on Wednesday, of course, then you can ask them here. As we said, we'll try to have on top of this, I mean, there will be one more feedback round or not feedback round, one more interview round for the last task. So if you haven't presented your task yet and you're not presenting your task tomorrow, so the buffer manager, then you're in the skip list round.

Starting point is 00:06:19 So this will happen later, already in the semester break, once you finished the task, right after that. And then we said we're also going to do one session, so Marcel will coordinate this, where we present solutions. So this was like one of the feedbacks that we got, like multiple times. So what do very good solutions look like? Right. So, and then we said, okay, let's try to find your solutions, ask you to present them, whatever was quite good and discuss in the audience.

Starting point is 00:06:58 So this will be a separate session sometimes later, probably around like once the last task is finished. So, more like a plenary session. There's nothing here yet. Marcel will coordinate this. Okay, so with this, lost're going to go to the second half of FPGAs, which is basically programming. So how do you program an FPGA? And I'm going to say again, I'm not an expert, so you will notice while I'm presenting this. So this is something, and it's also quite tedious. This is why I brought last time this small device,

Starting point is 00:07:50 because this is something where at least you don't have to deal with all of the licensing, etc. issues. If you're curious, feel free to stop by and try it out with that device. We also have regular devices at the data center, but I know the setup, this is really tricky, just getting the licenses, et cetera, right? Getting a pipeline set up that you can develop. So that's something that needs quite a bit. It has a steep learning curve, let's say. There's some nice books on this as well, but these are also more on the high level. There's also lots of practical books on how to implement FPGAs. We have some at the chair. If you ever should be curious about it, feel free to stop by.

Starting point is 00:08:37 With this, let me switch gears and go back to where we left off last time, which is FPGA programming. Okay, so there's multiple ways how we can program an FPGA. I mean, technically we could also write kind of the schematics, like basically the image that is flashed on the FPGA. So this is more or less just like the image that is projected on the die when it's like the photo image, more or less, with the difference that you're basically filling out these lookup tables, right? So you're basically saying, okay, I want these lookup tables being this and this.

Starting point is 00:09:27 And I want these connections being this and this in between, so how the switching network and how are the connections in between. Of course, you don't want to do that, because it's very specific to just a single FPGA. It's super error-prone, etc. So you want some more high-level abstraction. And there's, again, multiple levels of high-level abstractions.

Starting point is 00:09:53 There's the hardware description language, typically Verilog or VHDL. So this is basically, as a high-level abstraction, it's still low-level. So we're still talking about signals here and different modules that are connected to each other. And you have full control here on what is actually put on the FPGA,

Starting point is 00:10:19 but you don't have control of the layout. So the layout, this is something that the tool chain will do for you. So basically, you build these small submodules, et cetera. But then how this is actually stitched together, how this is best placed on the FPGA, how the lookup tables exactly look like,

Starting point is 00:10:37 this is done by the compiler, essentially, or the compiler tool chain. Still, it's fairly low level and you have lots of control, but it's also quite hard. It's more like thinking like an electrical engineer rather than thinking like a programmer, in terms of how do I pass the signal through my device, essentially. But there's also high-level synthesis, so high-level, basically, abstraction,

Starting point is 00:11:09 which looks more like C or C++. And this is OpenCL or Vivado HLS, so high-level synthesis language. This looks like regular programs, essentially. I mean, you still have to be aware of how the FPGA works. I mean, as always, right? So you need to be aware of how the hardware internally works. Otherwise, your program might work, but it's not going to be efficient.

Starting point is 00:11:39 And so it's much easier to program. And you have lots of pre-configured components. So you have some I.O. components. You also have some components pre-configured in Verilog or VHDL. So you don't have to basically do all kinds of gates or multipliers or something. You don't have to structure them yourself. And depending on what logic is already on the FPGA, the synthesis will then basically pick those

Starting point is 00:12:08 rather than you defining this in hardware yourself. But here, you might have even more libraries, et cetera, that then will be translated into modules that then will be mapped into substructures on the FPGA. OK. So there's lots of text there. This is basically from getting from code to circuit. Circuit is, you have a higher level programming language depending again on the two abstractions.

Starting point is 00:12:40 So more the hardware description language, the more low level abstraction or the high level abstraction, where you actually write more like regular programs. If you have these hardware description languages, they have fundamentally different semantics, right? So this is not just writing loops and object oriented, but you're writing signal passing. So you're writing basically modules that have input signals and output signals. And these then you connect to each other in different ways. And it's also not like if you're writing code within a module,

Starting point is 00:13:21 it's not like executed in sequence, but everything that's in parallel will be executed in parallel. Everything that's independent of each other is a separate part of the chip essentially being activated at the same time. And this is basically whenever you have some, like you have an input value that goes into your module or that might be spread out to multiple

Starting point is 00:13:47 modules at the same time, all of that is executed completely in parallel, just like the signal passes through. So there's, I mean, you might have function calls, but they don't really have meaning. It's really just how things are connected on the chip, essentially. You can do some abstractions in essence or you can basically organize your code in modules which then relate to basically hardware sub-modules. So you have some sub-structures on your FPGA that you're then designing. So a couple of lookups or lots or logic blocks, basically, or logic islands, however you want to call them, that then work as one functional unit.

Starting point is 00:14:42 So you could basically, again, create your floating point arithmetic unit. So this is something that you could define. And either it's used by multiple different parts, or you can replicate this on the FPGA multiple times. So this is basically what might be happening. And so then these have input ports and output ports, basically data coming in to this module and data coming out of this module. And this is basically where then this wiring happens. And so again, if you need this in different areas or different parts, either you can share the same module,

Starting point is 00:15:18 but then, well, you need to make sure that this is actually somehow shared across time, or you can have multiple of these modules. Each sub-part of the code is then translated to this hardware structure. So these modules that you program will be translated. So there's still some higher level function. You're not writing lookup table code, but you're writing multiplier or something like this, or you're

Starting point is 00:15:48 writing some kind of cases like switching logic basically, so a case statement, and this will be then translated into something like a multiplexer, etc. And well, if you want to be efficient on this, I mean you can always program everything, right? But if you're just doing this without knowing what's going to happen, it probably won't be very efficient. So as a designer, you really need to know the relations between the hardware definition language or HDL or VHDL or Verilog and the hardware constructs that come out in order to produce correct and efficient designs. So this is basically what we're doing, right?

Starting point is 00:16:33 We're not producing a binary or something. We're producing a design on the chip, basically a layout of how this will later be executed. You can actually, and I don't have images for that, that might actually be interesting. Also you can, in the tool chains, usually you can analyze a lot. So you can, on the one hand, you can simulate everything,

Starting point is 00:16:59 learn your FPGA design. Also during the design, like while you're programming this, there's frequent steps where you're basically simulating different sub steps, where you're basically analyzing different parts of your code in order to figure out in what kind of like how much code, like how much space will this use on the FPGA. I mean, this is also a limiting factor. The more code you have, the more space this will need. The more you parallelize, the more space you will need, et cetera. And then you can analyze how fast does my signal pass through this, for example. So does this work within a single clock step, for example, or do I need multiple, like do I need to break this up into smaller

Starting point is 00:17:41 pieces, for example? And so there's lots of tooling around this. And there's, of course, also an iterative process. And then eventually, you will basically create the actual design and then flash it onto the FPGA. So I brought a couple of small examples of how code actually translates then into hardware designs. So this is fairly simple.

Starting point is 00:18:14 So this is a simple conditional statement, which will basically give you more. So if you have this conditional statement, this is very log without lots of the additional details. And I've mixed this in the slides a bit. So you cannot basically mix the code that I have on my slides and get like a or put it together and get a working program because I'm mixing very log and VHDL on different slides. But here you can see, say for example, this is a functional model module.

Starting point is 00:18:51 So we have the process basically with an input cell here would be at the input. And you can also see this in the hardware structure. And here we have basically a selection or like a conditional statement. If this input cell for selection is 0, then our output is, I've mixed up the under script here, is i0. If otherwise, the output is i1.

Starting point is 00:19:23 And so this will directly basically implemented in the multiplexer. So we have basically our input to this process. I mean, essentially we have three signals coming in here, but the one that we're looking at is the cell. And then we're basically choosing between I0 and I1. And this is done immediately, right? So as soon as we have something lying here,

Starting point is 00:19:54 like some kind of input here, then this is basically switched through, right? So there's no real time in between. This is not like there's no clock working here or something. As soon as there's input here, input here and here, right? Then we get an output. If this input changes immediately, this output will change. Of course, it takes a certain amount of time, but it's not within the clock cycle. Something is below that. A bit more complicated is the D flip-flop. So this is basically one of the structures that you already saw in these logical islands. Also in the small FPGA, this is like a really central part of the FPGA as soon as we're working with the clock. In general, we can also do un-clocked code,

Starting point is 00:20:46 meaning that we're not using the clock at all, but then it basically, as soon as we're putting stuff, like we're putting information on the inputs, it will take some time for the signal to traverse the overall circuitry. Right? And for this time, basically, we cannot, like, this, basically, we have to wait. And this is usually where then the clock comes in, where we basically say, how long does

Starting point is 00:21:15 it take to get through this whole module? This should be one clock cycle, essentially. And we can build our modules or substructures as large as can be done within a clock cycle. And then, in order to basically keep this information somewhere and not have different parts of the circuit being in different state, but basically say, OK, within this clock cycle, for example, I'm doing my arithmetic. So I'm doing some calculations within a clock cycle.

Starting point is 00:21:52 And in order to synchronize somewhere, we're using these dflip-flops. So essentially, they basically switch, or they keep their state until they are in a new clock cycle. And there's two different versions of the flip-flops basically, meaning they're basically switching either on like a, how do you say, like a positive signal or a negative signal. So either if the clock changes from 0 to 1 or if the clock changes from zero to one

Starting point is 00:22:25 or if the clock changes from one to zero. So this will be two different instantiations of these D flip-flops, how you build them internally. And then you, I mean, we already heard, right? There's a couple of different other things. So there's a set and a reset input typically. And we have like the input and we have the output and the negative output,

Starting point is 00:22:49 just to make like some switching logic easier. So with the reset, we can basically say, well, whatever comes, this will always be zero. With the set, we can say, whatever comes, this will always be one. And with D, basically, the queue will change to the d with the next switch. And it will keep this value until it switches again.

Starting point is 00:23:14 And this we can basically use to store information. Or we can use it to also have this kind of switching logic based on this clock event. So this is the main important part. This is like with the clock event, we're doing something else. If we don't need it, so this might also be the case that we don't need the clock.

Starting point is 00:23:36 So we have our logical unit or logical island, and we don't wanna use the clock in there, then we can basically just disable this part or we can like short circuit around it in the logic unit so it's not based on the clock but it's just like passing through the signal as fast as possible. So now with this in a bit more elaborate example, so essentially, I already mentioned this, all code is eventually turned into registers and gates in these logic blocks.

Starting point is 00:24:18 And we organize those in modules. And the modules have input, and the modules have output. And in between these modules, then we're going to put these flip-flops in order to basically keep the state for a clock cycle. And then we can basically say with the next clock cycle, so we have a stable signal as an input, which means while the clock has changed, we have a stable signal, then the signal can pass through all of the components, all the way to the next flip-flop, and with the next clock cycle,

Starting point is 00:24:55 it will basically be available at the next module, or at the output. I mean, it's available at the output as soon as it passes through, but basically the flip-flop will synchronize this. I mean, it's available at the output as soon as it's passed through, but it will basically, the flip-flop will synchronize this. The flip-flop basically just makes sure that we're adhering to the frequency or the clock frequency of the chip. And so here you can see, again, a simple example. So we have a case statement or an if statement.

Starting point is 00:25:25 If A equals B, then we're adding 1 to C and put it on D. Or we're adding 2 to C and put it on D. And I mean, this selection statement, this, again, we know would be built something with a multiplexer. And this we can see here, right? So here we have our multiplexer. Then we have some addition logic.

Starting point is 00:25:55 So this is something that Verilog or VHDL will give us already as predefined code blocks. We don't have to define this in in logical and and or etc blocks but this will be directly translated onto these logical islands already and this we can do basically in one module right so this is basically then one module with three inputs, A, B and C, and one output. And then we'll put the flip-flops here in between, basically for all of the inputs. I mean, here we basically need three flip-flops times the number of bits that we basically want to keep here. And then you need like one flip-flop times the number of bits for

Starting point is 00:26:49 the output down here in order to basically then have it and after one basically clock cycle so at one clock cycle we have to basically input here with the clock we'll have everything through and we'll have it at the next basically step like after one clock cycle at the next output if like whenever it changes basically with the next clock cycle through these flip-flops we'll get the new code and the the simulation basically will determine so while we're probably like we're programming this, this is translated into hardware structures, and then it's simulated, so basically the compiler checks how fast can this actually,

Starting point is 00:27:35 the signal go through this whole module, and then based on how fast this is, it will check what's the maximum frequency that we can use for this. So if this, I mean, in this kind of very small structure, Then based on how fast this is, it will check what's the maximum frequency that we can use for this. So if this, I mean, in this kind of very small structure, probably this is a single cycle. So this is basically at the maximum frequency

Starting point is 00:27:54 that the FPGA can give you. So maybe 600 megahertz or something like this. But if this is a very complex structure, so we have multiple additions. They work on top of each other, we have something else that's happening in there, this might actually take longer. So this might be too long for a single cycle. And what happens then is there will not be automatically

Starting point is 00:28:19 necessarily flip-flops added here, but what happens is that the frequency will just be scaled down. So basically, the frequency of the FPGA will just be halved or even less than that. So we don't get the same frequency just to get the signal through the overall structure. And there's a trade-off again.

Starting point is 00:28:43 I mean, it might make sense to add more flip-flops and be able to better pipeline the processing, et cetera. Or it might actually be more efficient to just have like a slower clock cycle or less frequency, essentially, and have a larger structure just to have everything processed in a single clock cycle. So, I mean, in the end, maybe the latency is lower if we have a larger structure, but the bandwidth is probably also lower because we're just doing one at a time, right?

Starting point is 00:29:17 So one event at a clock cycle, while if we're splitting this up into multiple stages, we can pipeline essentially across the different stages. So we can basically input more while we're doing already some calculations. Clear so far? It's not super hard, right? OK. So let's look at some basic building blocks again in a bit more detail.

Starting point is 00:29:48 So we have VHDL, which is the VHSIC, hardware description language, which again stands for very high speed integrated circuit hardware description language. So these hardware people like to kind of put abbreviations into each other. And I took this because it has kind of this nice layout. And Verilog is very similar. Again, you've already seen this, right? So you have three basic building blocks mainly, which is like fundamental signal units. So you have signals coming in and signals going out. And then you have entities, which are like modules that I explained earlier, which are input and output definitions. So you define what goes into my module or my entity and what comes out of my entity. And then you have the architecture which describes the actual functionality inside.

Starting point is 00:30:54 So that's kind of the three-step definitions. And with these, you can define for an entity, say for example, how many flip-flops you will need at the input and at the output. This can be seen in the input and output definitions of the entity, and how much wiring you need to have to go into a module, and how many bits there are, etc. And then the architecture internally basically says, okay, what kind of switching do we need inside? So what happens actually with the signals that are going in and that are going out? And so there's a simple 16-bit multiplier example. So we have an entity.

Starting point is 00:31:48 So I don't have the signal here, but we have the entity input-output definition where we have three input ports and an output port or basically definitions. So we have the clock itself, so this is clocked, right? So there will be like, this will be done on a clock cycle basis. And then we have two numbers coming in.

Starting point is 00:32:17 So 16-bit unsigned integers coming in, and we also name them right away as 16 bits from 0 to 15. So we know, OK, 15 down to 0 as a vector. So two vectors coming in and one vector coming out of 32 bits in order to not have an overflow. And then internally we have some logic and here we have basically our register or flip-flops that will store the intermediate or the output for one clock cycle and then pass it on

Starting point is 00:33:03 in the next clock cycle. So this is basically what we define here. So we define this input, we define this output, and we define we're going to work with the clock here. And then the actual architecture in here is basically we have a process that uses the clock. And if we have a clock event and the clock is 1, basically, then we're multiplying these two integers here and adding or putting it on the output.

Starting point is 00:33:33 And that's it, right? And then basically the code will decide, or the compiler and the synthesis will decide what's the frequency that I can work this with, and if you're using this somewhere else it will basically do this plugging in into other parts. Okay, good so far? Very good. Okay, so let's go

Starting point is 00:34:01 through the design flow. Now so, I mean, now you saw the very low level, very few details on how does this works on a low level. Of course, if you wanna do more exciting programs on the FPGA, then there's a lot more. You need to like have a lot of modules and things that these are used for might be protocols, as I said earlier, like network protocols or some stream processing, something like this.

Starting point is 00:34:34 I'll show some examples later, not in this low level details, but more on a high level, how this is structured. And well, I mean, like if you're designing an FPGA or building an FPGA design, then you somehow need to get your code to this parallelism on the FPGA. Right. So you're either defining this in the hardware definition language or some higher level languages. Then comes the synthesis that produces a logical view of this code, like of the logic gate level presentations, meaning, well, we have logical units, like logical islands, and they are somehow connected in this or that way, right? So there's lookup tables in

Starting point is 00:35:24 there, and there's flip-flops in there and this is how we would basically connect this. And that will be done in a generic fashion. So this basically then can be mapped onto any FPGA. And then comes the step of place and route, which means we're putting the circuit onto a specific FPGA. And then basically we're putting the uh the circuit onto a specific fpga and um then basically we're saying okay what where are which parts of the fpga i mean if you remember these fpga designs look quite peculiar or particular so i mean if we briefly go back here, right, this is the design that we had on the small

Starting point is 00:36:08 FPGA that I brought last time. Then you can see, I mean, this is the actual layout of the chip, right? So you have this individual programmable logic blocks, as they're called here, which are basically on the chip then you have the like a bit of b ram in there mixed in between you have the i o coming in from outside and within there you have eight different flip flops and multiplexes and lookup tables so within each lookup or the lookup table of four units. And this is basically then, okay, I know these are connected in this way. I have these inputs and outputs. And this is basically then what actually needs to be correctly configured,

Starting point is 00:36:54 depending on where these lookup tables, et cetera, are and how these are connected, right? And then, basically, once the place and route is done, so we have something that fits this layout, it needs to be generated into a code or a format that can be read from the FPGA. So the FPGA itself has some kind of small hardware to actually flash these lookup tables on there, right, to reprogram and reprogram the networking in there. And that needs to be coded. So that's just a simple formatting step, essentially. And then you

Starting point is 00:37:41 flash it onto the FPGA. So then it will be transferred to the FPGA. There will be checked, of course, if it's still correct after transmission, and then put into these lookup tables in there. So that's done internally. So here, as an example, is the Xilinx. So if you're working with Xilinx, that would be an AMD.

Starting point is 00:38:07 It was bought by AMD at some point. So AMD FPGA. So here, typically, you're rewriting your code in something like Verilog. And then there is the Xilinx synthesizer. That's basically the first step, which basically translates, as we said, this code into this logical gate level and register level design. This is the first step.

Starting point is 00:38:45 We're turning the HDL specification into gate level netlist. So there's a specific format, the netlist gate. Let me see. No, it's native generic circuit format. In the next step, this is multiple steps. Let's basically translate these into a native generic database file. I already said that.

Starting point is 00:39:20 So there's multiple steps where we first basically have some rough logic blocks, then basically more specific to the individual FPGA that will then be mapped onto the FPGA design or rough overlay design. And then, so we're translating or we're mapping this, like the native design that we have, we're mapping this to the actual resources that are available on the FPGA. I mean, this is basically every FPGA has slightly different number of lookup tables, of logic blocks, how they're configured. And we're placing this into what would be basically checking how we can actually place this or what we can actually place.

Starting point is 00:40:28 And then we're trying to figure out the optimal placement on there. So depending on how the signal needs to go through, I mean, depending on where the RAM is on the FPGA and we need to store something in between. We might need to place logical units closer to the RAM or what kind of modules are connected, how we want to place them close to each other. But there's always like we just have like it's a 2D map, basically, with certain connections. And we need to solve this in a way such

Starting point is 00:40:59 that we still get the maximum throughput that we can get. So the signal doesn't have to travel too far, and we're using the space optimally. Because if we want to, say, for example, replicate something on there, or we have multiple modules that need to be connected to each other, we just have to see how we can actually connect them to each other. And, well, then we're generating a bitstream, how we can actually connect them to each other. Then we're generating a bitstream, which is basically this map translated to whatever

Starting point is 00:41:32 can be flashed onto the FPGA. Then we're actually programming the FPGA. And this is done through all these individual code pieces or individual libraries and compilers, essentially. This is all, at this stage right now, this is all closed source.? So this is basically all very specific to individual FPGA vendors. So if you program an Intel FPGA, you have a different tool chain than if you program an AMD FPGA. And all of these individual tools are quite expensive. You need special light sensors. It's quite tricky to set this up. And usually these tools only work with very specific FPGAs. So you have a certain version that works with the certain FPGA. And there's basically also version conflicts in between these different things, right?

Starting point is 00:42:37 So depending on like the bit generation and the mapping, this is very specific to the FPGA. So you need a very specific version to get this done. So this is a bit annoying. It also means that quickly there's no... If they build new FPGAs, we need the newest workflow. And then older FPGAs might not be supported anymore. So you need kind of the very specific version, at least that's for us,

Starting point is 00:43:10 that's always been a bit tricky. But there is some hope. So there are some efforts to make this more open source also in order to gain more traction, to make it easier for people to program. And one tool chain that's out there, that's what I've used in order to do this small FPGA here. This is basically an open source tool chain that's based on Y-OSIS. So there's a synthesis tool that can basically do Verilog

Starting point is 00:43:45 that's open source. And then there's a tool called Next PNR for place and route. And then, of course, at a certain point, again, we're getting hardware specific. So first the code, Verilog and VHDL, these are open, I mean, these are open, not open source, but open descriptions. So it's basically more or less standardized kind of code or programming languages. And so these basically, you have an idea of what they're supposed to do and the native descriptions, so basically the abstract logical descriptions, different kind of like how logical blocks should be connected to each other.

Starting point is 00:44:37 This is also, can be sort of generic, but at a certain point, as soon as we really want to place this on a certain FPGA, this gets very hardware specific. And for this, fortunately, for the FPGA that I've shown you, there's also open source tools. So there's the IceStorm tools because the Ice FPGA somehow. So there you basically have a bit stream tool, which is ice pack, and then you have the ice proc for flashing the FPGA. And you can use with like the standard workflow

Starting point is 00:45:17 with these tools that I showed you is Verilog, but you can also use VHDL, which is like a GNU version. It's GHDL. So you can use this, translate this, and then use viosys, then the next PNR for the place and route, and then again, the iStorm tools to actually put it onto the FPGA.

Starting point is 00:45:44 OK, questions so far? Not so tough, right? Okay, good. Then we're gonna do a quick break here, four minutes or so, and then we're gonna go into how this is actually used in data processing. So how do we use the FPGAs in data processing? So, I mean, first some more details about parallelism before we really dive into like

Starting point is 00:46:15 examples. So there's many different ways, or let's say not many different ways. I mean, we remember essentially there's a couple of different ways of building parallelism into our system. So we can basically have data parallelism, as we said here. We can have pipeline parallelism, and we can have different parts of the program, like task parallelism, different tasks at the program, like task parallelism, different tasks at the same time. And data parallelism means that we

Starting point is 00:46:53 apply the same task on multiple data items. And that, of course, is very simple in an FPGA, right? So this just means we're replicating logic on the FPGA. So instead of doing one multiplication on the FPGA, we want to do two multiplications at a time. We just need the same circuitry twice. And that's basically it. And the amount of parallelism that we can get is just limited by chip space

Starting point is 00:47:25 plus the amount of data we can get in, like within a certain amount of time, essentially. So I mean, this is also to some degree limited, right? So remember, this is usually connected through PCI Express. And that's basically the amount of data that you can move in and out of the FPGA. So whatever you can get in there, plus there might be other connections as well, right? So there might be network connections on the FPGA. So that's the other way of getting data into the FPGA. But again, like the amount of data that you can get through this. So say having an InfiniBand connection, HDR,

Starting point is 00:48:07 so four times would be a hundred gigabit we said, right? If I remember correctly, I would have to look it up. Always forget these numbers. So, but say in the order maybe of 50 gigabyte or something like that. So that might be something we can move into the if by per second that we can move into the FPGA if we have like really current hardware probably in most cases the the cars that we have here is less

Starting point is 00:48:39 so something in the order of 10 gigabyte per second, something like that. But that we can move in. And with this, I mean, this is not gonna be like a single byte at a time. It's gonna be multiple at a time. So then we can use some parallelism. And for this, we can replicate stuff. And so we can, yeah, have something like map functions

Starting point is 00:49:09 that we can build by parallelizing. We can build SIMD units on there. So we just process vectors of data rather than individual data items. An example would be vector addition. So this is kind of what this would look like. So we have data coming in. This is dispatched to the different replicas. So it could be a vector, for example. And then it's collected again.

Starting point is 00:49:33 We can process, go on with our lives, and continue from there. We also can use pipeline parallelism. And this is basically this trade-off that I told you, right? So we can have one large chunk of circuitry and wait until the signal goes all the way through, or we break it up into smaller parts. And if we break it up into a sequence of subtasks, then they are directly wired to each other.

Starting point is 00:50:09 So we just need these flip-flops essentially in between, but we don't need any other communication in between. These subtasks are basically directly wired to each other. Right? With this, I mean, it's like on circuit, on chip networking or wiring, which is programmed at flash time. Right? So we're flashing on. We saw there's these small gates with small SRAM cells. So these are programmed, and then they

Starting point is 00:50:40 are directly connected. So we can build the pipeline parallelism on hardware level, where we have one large computation task that we break down into sub-circuits and then we put these registers or flip-flops in between that then remember the outcome of the computation. And at the next cycle, so this is basically then all of a sudden we're using a clock here. At the next clock cycle, this becomes available at the next level, on the next pipeline step essentially. And the benefit of this is we're going to get a higher bandwidth, right?

Starting point is 00:51:29 So we can basically produce, well, I mean, if we have one large block, basically, we have to wait until everything went through. And the signal needs to be on this end until the signal has gone all the way through. Otherwise, we're in some kind of undefined state. So if I'm basically putting my signal here, my signal process is here, I'm switching the signal here, then it will change again. So we're getting in a weird state.

Starting point is 00:52:03 This is where then our synthesizer, like the translation, will tell us, well, this won't work, right? So, you're going to get into problems. You can still build something like this, and you're going to get undefined output, basically. Well, it's not working, essentially. So, what happens is we have to put the signal here, wait until it went all the way through, and then we can change the signal on the beginning again. We can, of course, split this up into multiple, but if they're just directly connected, it's still the same problem.

Starting point is 00:52:36 We still have to wait for the signal passing here, passing here, passing here, etc., until all the way to the end in order to have the final output. We can change this to this pipeline setup with registers in between, which means we're moving the data to the first, like we're basically processing the first step all the way to the register. And as soon as the clock cycle changes, I can add the new signal here. So this means the throughput will be higher. The latency most likely will also be higher because I mean we might be very fortunate that this step exactly takes as long as a cycle but most likely won't because if we're really at the cycle border,

Starting point is 00:53:25 well, we're going to get into the same problem that every now and then it will be too slow, and we'll basically have undefined behavior. So this basically means we need to have something here that can be processed under a cycle, right? So under a clock cycle, faster than a clock cycle. So this might be just a bit more than a single clock cycle, right? So under a clock cycle, faster than a clock cycle. So this might be just a bit more than a single clock cycle, hypothesizing, right? So just a bit more than a clock cycle. So we just have to wait a bit longer

Starting point is 00:53:54 before we change the signal. Here, we'll have to wait basically three steps. But, unlike here, so basically say we have here, we might be done in two clock cycles, but we can only process one data item in two clock cycles. He will have a latency of three clock cycles, but we can process three items in three clock cycles, right? So this will be higher throughput, higher latency, essentially. Again, the length of these pipelines and the amount of

Starting point is 00:54:35 pipeline parallelism that we have is just basically limited by our program and the size of the FPGA. So, I mean, we could build one long pipeline all the way through the FPGA and just fill it up with computation like in every clock cycle put in new computation and then basically once it went through all of the computation we add we have the output. So that might have like a high throughput but very long latency because we have to wait for all the cycles until this goes through. And then of course we also have task parallelism.

Starting point is 00:55:14 And this is also what we definitely want to use at an FPGA, right? So basically we want to do different things on the FPGA, different parts of our program, different types of operations. If you do query processing, different operators, and we will do them in parallel. So we can build these different operators as modules. You have the input to the operators

Starting point is 00:55:40 and we can basically apply the different operations either on the same kind of data item, but also on different kinds of data items in parallel. And well, we can do completely different tasks in parallel. We could even, say, map two queries on the FPGA at the same time or have networking plus query processing plus something else on the FPGA at the same time just because I mean whatever fits on the FPGA we can put it together and wire it together essentially and like a good example and it's and at least in

Starting point is 00:56:24 research is used a lot as evaluating different regular expressions in parallel, right? And this also makes sense for network protocols or general protocols. So we have some kind of, might have some regular expressions to express the kind of routing, whatever you want to do. So in the database, our standard example, one of our standard examples is sorting. So this is something that we've seen before with like sorting networks that we can also do in SIMD, but we can also do it in FPGA. Works very well on FPGA, basically same kind of setup. So you remember there's like, we can build a sorting network

Starting point is 00:57:05 and then do basically just compare and swap operations in between based on the inputs. And if we have this kind of layout, then of course, or if we have this kind of sorting network, logically in SIMD, for example, or something like that, we can do the same thing in hardware on the FPGA. So we can basically really build the same kind of architecture, the complete network on the FPGA, and do this sorting on hardware. So meaning, I mean, here's like on the FPGA and do this sorting on hardware.

Starting point is 00:57:46 So, meaning we, I mean, here's like a, on the bottom, I have a fundamental circuit element that would do this kind of sorting. So, we have like our inputs, 32-bit integers, for example, or whatever, 32-bit numbers. And then we have a comparator operator. So this is actual some kind of circuitry that needs to be mapped into these logical units. I mean, we cannot do it with a single bit, but it needs to be done with multiple bits. So the single multiplexer won't be enough.

Starting point is 00:58:21 We need more than a single multiplexer to do this. But then in the end, I mean, for every, not for every, but basically we need to compare across all. And then based on the overall comparison, again, we have some kind of multiplexer that basically tells us, OK, if a is larger, then set this multiplexer or all of the multiplexers here to 0, or set this multiplexer to 0 and this one to 1.

Starting point is 00:58:57 If B is larger, set this multiplexer to 1 and this one to 0, such that we're basically either keeping the order of A and B or we're swapping the order of A and B or we're swapping the order of A and B. So, this is basically all that this does, right? So, this is our basic operation. We compare the two and if they're in the right order already, we keep it. If they're in the wrong order, we swap them. And this is all that we need.

Starting point is 00:59:21 And out of those, then we need multiple of those. Out of those comparators, we can then build this whole network. Each of these lines basically is one of these comparators. We could count them now. I think we said 13. This is what we need. I'm not going to count it now. Basically, all numbers will be compared to each other

Starting point is 00:59:48 eventually. And then after doing all of the comparisons, we have the total output. This works as a whole network, meaning we're just putting basically the numbers and the input, and we wait until the signal traverses all the way through. So we could do that. Essentially, we're just basically putting the numbers here,

Starting point is 01:00:13 but it's a fairly deep pipeline already, so this might take a bit long. So what we can also do is we can basically put some pipeline stages in here. And this then basically means with these pipelines, like putting the registers in between, this means we need for every number, we basically again need some storage where after a clock cycle, it will be stored.

Starting point is 01:00:39 And at the next clock cycle, it will be available at the next sorting stage. And here you can see there's basically one comparison in every step on at most four numbers at least two numbers in every step. So it's basically we get all of the comparisons and then in six stages right. This means in six clock cycles, I will have to complete sorting done after getting the input for eight numbers here. And of course, I could also build this bigger. I could do this for 16 numbers.

Starting point is 01:01:16 At a certain point, I just need so much chip space to do this. It gets slow. So I might use something else for continuing. It might be more efficient to do something else for continuing the sorting after this. And yeah, so that's one way of doing sorting and that actually works quite well, right? So for this, you get an efficient sort operator with different ways of doing the pipelining. And again, where to put these pipelines depends on how fast the signal goes through the network

Starting point is 01:01:54 and how fast then the overall clock rate will be. If you have other parts of your design that make it slower already, then you might actually consider reducing the number of pipeline stages here in order to not wait so long for the throughput. You have to wait for other parts anyway, so this might not be the bottleneck here. Another thing we can do is aggregations. What we can basically do is we four multiple values at the same time, where we build an aggregation set up. So say, I mean, an aggregation might be minimum, maximum, average, the sum, et cetera.

Starting point is 01:02:51 I mean, sum's very simple, right? Or minimum or maximum. I mean, minimum, maximum, even simpler if you think about it, right? So you basically just need to compare the different values to each other. and you always keep the minimum across like a kind of like comparisons of two. So you just do it basically step by step. And if you want to, I mean, what's also what you also can do in an FPGA, rather than putting the whole program always on the FPGA

Starting point is 01:03:29 and flashing the whole FPGA with your new program, modern FPGAs also enable or make it possible for them to have partial sub-areas that you're changing. In this case, you could basically say well I need a new type of aggregation so let me replace this block of the aggregation in with a different block of aggregation so I don't need to implement all aggregations in my aggregation operator but I'm only flashing the one that I actually need right now. So that gives you some additional functionality and some additional

Starting point is 01:04:11 flexibility to make this faster. So say for example here we have an example where we have a selection query with two different conditions. And then with these different conditions, could basically be different or could be different implementations of a partial area that will be used within the FPGA. So then we don't need to flash the whole FPGA. We only need to flash the part of the FPGA. So that's a bit faster.

Starting point is 01:04:45 Also, the routing might be cheaper for these subparts. And we can produce multiple layouts already with just different subpart change if these different kind of aggregations fit into the same subpart. And the benefit is then the flashing is a bit faster. The flashing actually is not really the part that takes the most time. The place and route is the part that takes the most time.

Starting point is 01:05:13 But if we do it this way, then we know that the rest of the chip will stay the same. So we can basically exchange just certain modules. If I'm doing the place and route, etc., everything from scratch again, most likely the overall layout of the resulting layout will be completely different or at least also different of other parts of the FPGA. So we can basically make sure that this is a bit more efficient overall and is more also configurable to that. And so this basically means we're updating only parts

Starting point is 01:05:57 without stopping necessarily also the flow of data so we can continue processing other parts. And we can do hardware software co-designs. So we can continue processing other parts. And we can do FP, like hardware software co-designs. So we can also just do restrictions and aggregations inside the FPGA and we could do the rest of the query processing on the CPU. So we just send basically whatever we want to aggregate, we send to the FPGA.

Starting point is 01:06:22 The FPGA does the aggregation very fast, sends the aggregation results back, and we just plug the result back into our query. And this, of course, only makes sense if the stuff that needs to be processed on the FPGA actually is large enough and is enough computation such that it makes sense to move the data onto the FPGA and back because that's a significant amount of latency again because we typically need to go through the PCI Express bus. But if that makes sense, then this is, for example, one way of optimizing here or of utilizing this. And I've put like an example of a paper

Starting point is 01:07:08 where they tried exactly this. There was a lot of work on this in SAP research where they tried to somehow plug in FPGAs into the database system. As far as I know, not successfully. So I mean, successfully in research, not necessarily successfully in practice. And, well, what you can also build is basically

Starting point is 01:07:35 a fully pipelined hash table. So, there's also some work that people did. Scholt Istvan and Gustavo Alonso, so ETH Zurich. Scholt is now at TU Darmstadt, also in our seminar regularly. So this is also why we see some FPGA topics every now and then there. And here, something that they built is a fully pipelined hash table, meaning they basically built all the hash functionality in the FPGA, which then means they can basically rather than do

Starting point is 01:08:30 like a lookup in the hash table and then compare all of the elements in the certain hash bucket, you can do this in parallel. So you can do all of the comparisons in parallel, and especially all of the elements that are within the same hash bucket. You can compare them directly. And here, they also tried to scale the key value store, scaled or, again, did this co-design with the memory, so with DRAM.

Starting point is 01:09:10 And in this case, with a 10-gigabit data center application, so this is part of this catapult idea where you have network cards that are FPGAs in the Azure data centers. So there basically you can get the data or the lookups through networking and then directly do the comparisons on the FPGA rather than going through the CPU

Starting point is 01:09:39 and doing the comparison here. And here they had some, I mean, for the details you will have to go to the paper. But you can see that if, depending on the key size, I mean, this basically means the key size is basically chip space that we need. And with the chip space, it also means, on the one hand, we get less parallelism. and we get more complex logic,

Starting point is 01:10:07 which will decrease the frequency that we can use. And you can see from key sizes of 20 bytes to 160 bytes, they are still, let's say, a factor of 10 to 30 faster than x86 by basically having this kind of lookup inside the hardware with also a very low latency in terms of clock cycles. Although the clock cycle latency here, you can also see, is basically going up, or the number of clock cycles required is going up with the key length, basically because we have to either split up the keys

Starting point is 01:11:04 or reduce the clock cycle frequency. Okay, so as a quick summary, for FPGAs, FPGAs are slower than CPUs or GPUs in terms of frequency. It's still in the megahertz range, so hundreds of megahertz maybe in comparison to gigahertz that you can get on the CPU or GPUs. But they make up by parallelism. So everything that goes in is super parallel. So you don't need to sequentialize your program in any way.

Starting point is 01:11:47 I mean, remembering we have parallelism on the CPU and the GPU, but your program is essentially still crossed per core in a sort of sequential manner. There's some reordering and whatnot, lots of interesting logic, but the FPGA does everything at the same time. Only once we're putting these clock cycles, once we're putting the registers in between, we basically sequentialize stuff. Otherwise, all of the signals are always processed in parallel. And because of that, because they're slower,

Starting point is 01:12:23 and they're typically also not necessarily as large, at least not as GPUs today, they're also more energy efficient. Because they use lower clock cycles, the frequency is much lower, it just requires a lot less energy, essentially. The design flow is quite, I mean, programming is complicated and placing and routing is complicated. So this can often, like if you have a complex program, this can take more than a day to actually produce the layout that goes onto an FPGA,

Starting point is 01:13:06 which is also annoying if it doesn't work in the end. But then, so that's basically, if your application changes a lot, it's not good, right? So this is, you also need some, like you need to make sure that your application is something that can stay stably run for a bit of time. But there's some help essentially by having this replacement, like partial reconfiguration. So if you can basically have different modules and you can replace parts of your configuration,

Starting point is 01:13:43 you can pre-compile those and change it. That's fast again, right? So flashing the FPGA doesn't take a lot of time. This is in seconds, essentially. Even probably for the larger FPGAs, the modern one, this is probably sub-second. The place and route, like getting the actual schema, this is something that takes hours and days. And this is really the larger the FPGA, the larger your program, the longer it takes.

Starting point is 01:14:10 I mean, for my, you remember, right? So my very small program, this already took some time, right? It's not instantaneously, but this is like, it was basically nothing that the FPGA needed to do. And it still takes some time to get this all the way through. If the programming is more complex, then this compilation step will take a very long time. But of course, you can also create some hardware

Starting point is 01:14:38 with some reconfiguration options at runtime. So I mean, essentially, you can simulate a CPU on the FPGA. So you could have a program counter. You could have a program or instructions coming to the FPGA that basically lead to certain behavior on the FPGA, depending. Like you have different functional units. You can switch on different function units, et cetera.

Starting point is 01:15:05 At a certain point, it doesn't make sense anymore, because then it's much more efficient to use the CPU. But if you have dedicated instructions that you want to do, dedicated parts of the hardware, then again, it might make sense to get some reconfigurability there or modifiable behavior at runtime. Okay and with this that's all for today. Questions to FPGAs? We'll hear more about FPGAs next week on Tuesday by Piotr.

Starting point is 01:15:49 And now we can't see an FPGA. I'll show you the servers where an FPGA is inside. But I think that's the best I can do, unfortunately, in the data center. So it looks something like this. So it's just a card with one large chip like this would then be the FPGA. OK. No questions?

Starting point is 01:16:10 Yes, there is a question. I was wondering the example you showed with the smaller than comparator with 32 bits, how many logical cells would that need? Would that need much space? Would that take up? So I mean, it's a good question. So essentially, we needed something

Starting point is 01:16:35 like a multiplexer for this. So as we said earlier in, let me show, I mean, I don't want to hypothesize too much because there might be much better ways to implement this. But essentially, I mean, this is the most simple way we can have this in hardware, right? So for every bit, we could basically compare this. So then you need some carry over logic. I mean, this is basically to say, like, what was the previous comparison?

Starting point is 01:17:19 I mean, we're starting with the most significant bit, do the comparison there. If it's equal, well, we have to go to the next one. If it's equal, we go to the next one until we find a difference. And then we know the result. And this basically means for every bit, we need something like this multiplexer here. So that would be then one logical set?

Starting point is 01:17:41 That would be, I mean, probably in our case, I mean, if we wanted to do it for 8-bit, I would say in this example, it would be one logic block, basically, for 8-bit. Because we have 8-bit input, right? I mean, we have the lookup table has four, well, there's a four-input lookup table. I don't know exactly, well, we have four inputs here.

Starting point is 01:18:22 And we have to carry logic. So it might be enough to, like we might get like four bit out of a single lookup table. So then it would be for one byte, it would be two lookup tables. For comparing two would be four, four of these blocks, right? for the logic cells, essentially. How many of those were there on the chip?

Starting point is 01:18:53 JANIS LIEBHOLDTMANN- On this one, there were eight in a programmable logic block. program a logical block, and out of those 7,680 logical cells, so close to 8,000 logical cells on this one. So not that much on this one, but that's a very small one. On the larger ones, we have 400,000 of logical, like elementary logical units, which each have basically 12 bits. So that's a lot more space. MARTIN SPLITTMANN- There's a lot more space on this one,

Starting point is 01:19:41 yes. The small one, you're not going to get a very complex program on that one. Yeah, so this one's probably a lot, depending on how much parallelism you use. Of course, you can easily fill this up by just replicating stuff. But for the smaller one, yeah, this one is filled up quite quickly. So, I mean, as soon as you get like a more complex program or layout, this will probably already be over. You're welcome.

Starting point is 01:20:16 Good. Okay. So then, thank you very much. Thanks for coming despite the weather today. And I'll see you all next week for the presentation. And then hopefully the data center tour, if you're curious. Thanks.

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.