Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays (2)
Episode Date: July 10, 2024...
Transcript
Discussion (0)
Darmstadt again, I think this is the last one for this semester.
And Andreas Kipf from TU Nuremberg will present today.
He's an expert on learned indices and learned components for database engines.
And he will talk about what we learned and what we will learn.
I think Nils also posted the abstract in the Moodle,
so you can see if that's interesting to you.
The Zoom link is also in Moodle.
So if you're curious about that, feel free to join.
That should be fun.
Then Tuesday, we're going to have our guest,
Piotr Ratusniak from Intel Poland.
I found an image on his LinkedIn page.
So he's an expert in FPGA systems, and he will talk about FPGA in practice.
So I don't have the title.
He sent me an email today.
He's going to send the title as soon as I have the title and the abstract.
I'm going to also put it into Moodle.
But I'm pretty sure this will be interesting and fun.
And I hope you will join and have a lot of interesting
or have a lot of questions to him and see how this is working.
We actually also have Intel FPGAs in the data lab,
so in our data center.
So we can also bug him about these things.
So this is actually quite complicated to program.
So maybe you can shed some light on how to make this easier.
Then there is a survey on the data center visit in Moodle.
If you want to visit the data center on Tuesday,
I mean, I've already said this last time,
so the two of you probably heard this last time as well.
Please fill in the Moodle.
So fill in the survey in Moodle, because we need some numbers.
I mean, basically, the question is,
if we're more than a certain number of people,
we'll have to split groups.
So because we should not have more than a certain number
of people in the cages or basically the area
with the servers at a certain point in time
because of the security measures in there.
So there's basically some fire extinguishing system that works automatically and that basically you have
a certain, I think 20 or 30 seconds to leave the room. And if we're too many people, this might get close.
So then that might be annoying.
And it's kind of getting just a bit crowded.
So it's good if we know some numbers.
If you fill in the Moodle, then we know, okay,
if it's way above the number that we should,
or close to the number where we should split groups,
then I will organize this a
bit more structured if we're a smaller group then we can just walk there we can see everything etc
otherwise we'll have to split and then you maybe saw the email there was a bit of a mistake on our end in how what we sent out or how we structured the repository.
And yeah, so we sent out a bit more than we wanted initially.
One person nice and kindly enough made us aware of that immediately, but at that point already
at least three people had checked out the repository.
And so because we want to make a plain level play field for everybody,
we said, okay, let's not treat this in a weird way or something or whatever.
So we'll provide the advanced test and the benchmarks for everybody for this one.
Basically means everybody gets all the info.
I mean, we sent out even a bit more,
but if you get that info,
it's very easy to get caught up in our plagiarism tool.
So we're not going to send out our solutions. But everybody gets the
advanced test and the benchmarks. We have the same chances to implement it in the same way.
So that was unfortunate, but again, I think with this, everybody has kind of the same chances and you get to see the insights of one of the tasks.
Maybe that's also a good thing for you. Okay, so I think that's almost it. Yeah, so these are all the
announcements that I had. So just a reminder where we are. We're all the way at the bottom, right? So,
everything covered so far. We're in... Forgot to switch this on, just a second. So we're here, so you know your last task.
We're going to have Piotr's presentation on Tuesday,
and then I'll do a very brief summary next week,
so maybe give a bit of an outlook what else we have,
walk through quickly what we discussed. If you have any questions next time about the topics we discussed.
So, on Wednesday, of course, then you can ask them here. As we said, we'll try to have
on top of this, I mean, there will be one more feedback round or not feedback round, one more interview round for the last task.
So if you haven't presented your task yet and you're not presenting your task tomorrow, so the buffer manager, then you're in the skip list round.
So this will happen later, already in the semester break, once you finished the task, right after that.
And then we said we're also going to do one session, so Marcel will coordinate this, where we present
solutions. So this was like one of the feedbacks that we got, like multiple times.
So what do very good solutions look like?
Right.
So, and then we said, okay, let's try to find your solutions,
ask you to present them, whatever was quite good
and discuss in the audience.
So this will be a separate session sometimes later,
probably around like once the last task is finished.
So, more like a plenary session. There's nothing here yet. Marcel will coordinate this.
Okay, so with this, lost're going to go to the second half of FPGAs, which is basically programming.
So how do you program an FPGA?
And I'm going to say again, I'm not an expert, so you will notice while I'm presenting this.
So this is something, and it's also quite tedious.
This is why I brought last time this small device,
because this is something where at least you don't have to deal with all of the licensing, etc. issues.
If you're curious, feel free to stop by and try it out with that device.
We also have regular devices at the data center,
but I know the setup, this is really tricky, just getting the licenses, et cetera, right? Getting a
pipeline set up that you can develop. So that's something that needs quite a bit. It has a steep
learning curve, let's say. There's some nice books on this as well, but these are also more on the high level. There's also lots of practical books on how to implement FPGAs.
We have some at the chair.
If you ever should be curious about it, feel free to stop by.
With this, let me switch gears and go back to where we left off last time,
which is FPGA programming.
Okay, so there's multiple ways how we can program an FPGA.
I mean, technically we could also write kind of the
schematics, like basically the image that is flashed on the FPGA. So this is more or
less just like the image that is projected on the die when it's like the photo image,
more or less, with the difference that you're basically filling out these lookup tables,
right? So you're basically saying, okay, I want these lookup tables being this and this.
And I want these connections being this and this
in between, so how the switching network and how
are the connections in between.
Of course, you don't want to do that,
because it's very specific to just a single FPGA.
It's super error-prone, etc.
So you want some more high-level abstraction.
And there's, again, multiple levels of high-level abstractions.
There's the hardware description language,
typically Verilog or VHDL.
So this is basically, as a high-level abstraction,
it's still low-level.
So we're still talking about signals here
and different modules that are connected to each other.
And you have full control here
on what is actually put on the FPGA,
but you don't have control of the layout.
So the layout, this is something
that the tool chain
will do for you.
So basically, you build these small submodules, et cetera.
But then how this is actually stitched together,
how this is best placed on the FPGA,
how the lookup tables exactly look like,
this is done by the compiler, essentially,
or the compiler tool chain.
Still, it's fairly low level and you have lots of control,
but it's also quite hard.
It's more like thinking like an electrical engineer rather than thinking like a programmer, in terms of how do I
pass the signal through my device,
essentially. But there's also high-level synthesis,
so high-level, basically, abstraction,
which looks more like C or C++.
And this is OpenCL or Vivado HLS,
so high-level synthesis language.
This looks like regular programs, essentially.
I mean, you still have to be aware of how the FPGA works.
I mean, as always, right?
So you need to be aware of how the hardware internally works.
Otherwise, your program might work, but it's not going to be efficient.
And so it's much easier to program.
And you have lots of pre-configured components.
So you have some I.O. components.
You also have some components pre-configured in Verilog or VHDL.
So you don't have to basically do all kinds of gates or multipliers or something.
You don't have to structure them yourself.
And depending on what logic is already on the FPGA,
the synthesis will then basically pick those
rather than you defining this in hardware yourself.
But here, you might have even more libraries, et cetera,
that then will be translated into modules that then
will be mapped into substructures on the FPGA.
OK. So there's lots of text there.
This is basically from getting from code to circuit.
Circuit is, you have a higher level programming language
depending again on the two abstractions.
So more the hardware description language,
the more low level abstraction
or the high level abstraction,
where you actually write more like regular programs.
If you have these hardware description languages, they have fundamentally different semantics, right?
So this is not just writing loops and object oriented, but you're writing signal passing. So you're writing
basically modules that have input signals and output signals. And these then you connect to
each other in different ways. And it's also not like if you're writing code within a module,
it's not like executed in sequence, but everything that's in parallel
will be executed in parallel.
Everything that's independent of each other
is a separate part of the chip essentially
being activated at the same time.
And this is basically whenever you have some,
like you have an input value that goes into your module
or that might be spread out to multiple
modules at the same time, all of that is executed completely in parallel, just like the signal passes
through. So there's, I mean, you might have function calls, but they don't really have meaning.
It's really just how things are connected on the chip, essentially. You can do some abstractions
in essence or you can basically organize your code in modules which then
relate to basically hardware sub-modules. So you have some sub-structures on your
FPGA that you're then designing.
So a couple of lookups or lots or logic blocks, basically, or logic islands,
however you want to call them, that then work as one functional unit.
So you could basically, again, create your floating point arithmetic unit.
So this is something that you could define. And either it's used by multiple different parts, or you can replicate this on the FPGA
multiple times.
So this is basically what might be happening.
And so then these have input ports and output ports, basically data coming in to this module
and data coming out of this module.
And this is basically where then this wiring happens.
And so again, if you need this in different areas or different parts, either you can share the same module,
but then, well, you need to make sure that this is actually somehow shared across time,
or you can have multiple of these modules.
Each sub-part of the code is then translated
to this hardware structure.
So these modules that you program will be translated.
So there's still some higher level function.
You're not writing lookup table code,
but you're writing multiplier or something like this, or you're
writing some kind of cases like switching logic basically, so a case statement, and
this will be then translated into something like a multiplexer, etc.
And well, if you want to be efficient on this, I mean you can always program everything,
right? But if you're just doing this without knowing what's going to happen, it probably
won't be very efficient. So as a designer, you really need to know the relations between the
hardware definition language or HDL or VHDL or Verilog and the hardware constructs that come out
in order to produce correct and efficient designs.
So this is basically what we're doing, right?
We're not producing a binary or something.
We're producing a design on the chip,
basically a layout of how this will later be executed.
You can actually, and I don't have images for that,
that might actually be interesting.
Also you can, in the tool chains,
usually you can analyze a lot.
So you can, on the one hand, you can simulate everything,
learn your FPGA design.
Also during the design, like while you're programming this, there's frequent steps
where you're basically simulating different sub steps, where you're basically analyzing different
parts of your code in order to figure out in what kind of like how much code, like how much space
will this use on the FPGA. I mean, this is also a limiting factor. The more code you have, the more space
this will need. The more you parallelize, the more space you will need, et cetera. And then you can
analyze how fast does my signal pass through this, for example. So does this work within a single
clock step, for example, or do I need multiple, like do I need to break this up into smaller
pieces, for example?
And so there's lots of tooling around this.
And there's, of course, also an iterative process.
And then eventually, you will basically
create the actual design and then flash it onto the FPGA.
So I brought a couple of small examples of how code actually translates then into hardware
designs.
So this is fairly simple.
So this is a simple conditional statement, which will basically give you more. So if you have this conditional statement,
this is very log without lots of the additional details.
And I've mixed this in the slides a bit.
So you cannot basically mix the code that I have on my slides
and get like a or put it together and get a working
program because I'm mixing very log and VHDL on different slides.
But here you can see, say for example,
this is a functional model module.
So we have the process basically with an input cell
here would be at the input.
And you can also see this in the hardware structure.
And here we have basically a selection
or like a conditional statement.
If this input cell for selection is 0, then our output is,
I've mixed up the under script here, is i0.
If otherwise, the output is i1.
And so this will directly basically implemented
in the multiplexer.
So we have basically our input to this process.
I mean, essentially we have three signals coming in here,
but the one that we're looking at is the cell.
And then we're basically choosing between I0 and I1.
And this is done immediately, right?
So as soon as we have something lying here,
like some kind of input here,
then this is basically switched through, right?
So there's no real time in between.
This is not like there's no clock working here or something. As soon as there's input here, input here and here, right? Then we
get an output. If this input changes immediately, this output will change. Of course, it takes a
certain amount of time, but it's not within the clock cycle. Something is below that. A bit more complicated is the D flip-flop. So this is basically one of the
structures that you already saw in these logical islands. Also in the small FPGA, this is like a
really central part of the FPGA as soon as we're working with the clock. In general, we can also do un-clocked code,
meaning that we're not using the clock at all,
but then it basically, as soon as we're putting stuff,
like we're putting information on the inputs,
it will take some time for the signal
to traverse the overall circuitry.
Right?
And for this time, basically, we cannot, like, this, basically, we have to wait.
And this is usually where then the clock comes in, where we basically say, how long does
it take to get through this whole module?
This should be one clock cycle, essentially. And we can build our modules or substructures
as large as can be done within a clock cycle.
And then, in order to basically keep this information
somewhere and not have different parts of the circuit
being in different state, but basically say, OK, within this clock cycle, for example,
I'm doing my arithmetic.
So I'm doing some calculations within a clock cycle.
And in order to synchronize somewhere,
we're using these dflip-flops.
So essentially, they basically switch,
or they keep their state until they are in a new clock cycle.
And there's two different versions of the flip-flops basically,
meaning they're basically switching either on like a,
how do you say, like a positive signal or a negative signal.
So either if the clock changes from 0 to 1 or if the clock changes from zero to one
or if the clock changes from one to zero.
So this will be two different instantiations
of these D flip-flops, how you build them internally.
And then you, I mean, we already heard, right?
There's a couple of different other things.
So there's a set and a reset input typically.
And we have like the input and we have the output
and the negative output,
just to make like some switching logic easier.
So with the reset, we can basically say,
well, whatever comes, this will always be zero.
With the set, we can say,
whatever comes, this will always be one.
And with D, basically, the queue will change to the d
with the next switch.
And it will keep this value until it switches again.
And this we can basically use to store information.
Or we can use it to also have this kind of switching logic
based on this clock event.
So this is the main important part.
This is like with the clock event,
we're doing something else.
If we don't need it,
so this might also be the case that we don't need the clock.
So we have our logical unit or logical island,
and we don't wanna use the clock in there, then we can basically just
disable this part or we can like short circuit around it in the logic unit so it's not based
on the clock but it's just like passing through the signal as fast as possible.
So now with this in a bit more elaborate example,
so essentially, I already mentioned this,
all code is eventually turned into registers and gates
in these logic blocks.
And we organize those in modules.
And the modules have input, and the modules have output. And in between
these modules, then we're going to put these flip-flops in order to basically keep the
state for a clock cycle. And then we can basically say with the next clock cycle, so we have
a stable signal as an input, which means while the clock has changed, we have a stable signal,
then the signal can pass through all of the components,
all the way to the next flip-flop,
and with the next clock cycle,
it will basically be available at the next module, or at the output.
I mean, it's available at the output as soon as it passes through,
but basically the flip-flop will synchronize this. I mean, it's available at the output as soon as it's passed through, but it will basically,
the flip-flop will synchronize this.
The flip-flop basically just makes sure that we're adhering to the frequency or the clock
frequency of the chip.
And so here you can see, again, a simple example.
So we have a case statement or an if statement.
If A equals B, then we're adding 1 to C and put it on D.
Or we're adding 2 to C and put it on D.
And I mean, this selection statement, this, again,
we know would be built something
with a multiplexer.
And this we can see here, right?
So here we have our multiplexer.
Then we have some addition logic.
So this is something that Verilog or VHDL
will give us already as predefined code blocks.
We don't have to define this in in logical
and and or etc blocks but this will be directly translated onto these logical islands already
and this we can do basically in one module right so this is basically then one module with three inputs, A, B and C, and one output.
And then we'll put the flip-flops here in between, basically for all of the inputs.
I mean, here we basically need three flip-flops times the number of bits that we basically want to keep here.
And then you need like one flip-flop times the number of bits for
the output down here in order to basically then have it and after one
basically clock cycle so at one clock cycle we have to basically input here
with the clock we'll have everything through and we'll have it at the next basically step
like after one clock cycle at the next output if like whenever it changes basically with the
next clock cycle through these flip-flops we'll get the new code and the the simulation
basically will determine so while we're probably like we're programming this, this is translated into hardware structures,
and then it's simulated, so basically the compiler
checks how fast can this actually,
the signal go through this whole module,
and then based on how fast this is,
it will check what's the maximum frequency
that we can use for this. So if this, I mean, in this kind of very small structure, Then based on how fast this is, it will check what's the maximum frequency
that we can use for this.
So if this, I mean, in this kind of very small structure,
probably this is a single cycle.
So this is basically at the maximum frequency
that the FPGA can give you.
So maybe 600 megahertz or something like this.
But if this is a very complex structure,
so we have multiple additions. They work on top of each other,
we have something else that's happening in there,
this might actually take longer.
So this might be too long for a single cycle.
And what happens then is there will not be automatically
necessarily flip-flops added here,
but what happens is that the frequency will just
be scaled down.
So basically, the frequency of the FPGA
will just be halved or even less than that.
So we don't get the same frequency
just to get the signal through the overall structure.
And there's a trade-off again.
I mean, it might make sense to add more flip-flops
and be able to better pipeline the processing, et cetera.
Or it might actually be more efficient to just have
like a slower clock cycle or less frequency, essentially,
and have a larger structure just to have everything processed
in a single clock cycle.
So, I mean, in the end, maybe the latency is lower if we have a larger structure,
but the bandwidth is probably also lower because we're just doing one at a time, right?
So one event at a clock cycle, while if we're splitting this up into multiple stages,
we can pipeline essentially across the different stages.
So we can basically input more while we're doing already
some calculations.
Clear so far?
It's not super hard, right?
OK.
So let's look at some basic building blocks again in a bit more detail.
So we have VHDL, which is the VHSIC, hardware description language, which again stands for
very high speed integrated circuit hardware description language. So these hardware people like to kind of put abbreviations
into each other. And I took this because it has kind of this nice layout. And Verilog is very
similar. Again, you've already seen this, right? So you have three basic building blocks mainly, which is like fundamental signal units.
So you have signals coming in and signals going out.
And then you have entities, which are like modules that I explained earlier, which are input and output definitions.
So you define what goes into my module or my entity and what comes out of my entity.
And then you have the architecture which describes the actual functionality inside.
So that's kind of the three-step definitions. And with these, you can define for an entity, say for example, how many flip-flops you will need at the input and at the output.
This can be seen in the input and output definitions of the entity,
and how much wiring you need to have to go into a module,
and how many bits there are, etc.
And then the architecture internally basically says, okay, what kind of switching do we need inside?
So what happens actually with the signals that are going in and that are going out?
And so there's a simple 16-bit multiplier example.
So we have an entity.
So I don't have the signal here,
but we have the entity input-output definition
where we have three input ports and an output port
or basically definitions.
So we have the clock itself, so this is clocked, right?
So there will be like, this will be done
on a clock cycle basis.
And then we have two numbers coming in.
So 16-bit unsigned integers coming in,
and we also name them right away as 16 bits from 0 to 15.
So we know, OK, 15 down to 0 as a vector.
So two vectors coming in and one vector coming out of 32 bits
in order to not have an overflow.
And then internally we have some logic and
here we have basically our register or flip-flops that will store
the intermediate or the output for one clock cycle and then pass it on
in the next clock cycle.
So this is basically what we define here. So we define this input, we define this output, and we define we're going to work with the
clock here.
And then the actual architecture in here is basically we have a process that uses the
clock.
And if we have a clock event and the clock is 1, basically,
then we're multiplying these two integers here
and adding or putting it on the output.
And that's it, right?
And then basically the code will decide,
or the compiler and the synthesis will decide
what's the frequency that I can work this with,
and if you're using this somewhere else it will basically do this
plugging in into other parts.
Okay, good so far?
Very good. Okay, so let's go
through the design flow. Now so, I mean, now you saw the very low level,
very few details on how does this works on a low level.
Of course, if you wanna do more exciting programs
on the FPGA, then there's a lot more.
You need to like have a lot of modules
and things that these are
used for might be protocols, as I said earlier, like network protocols or some stream processing,
something like this.
I'll show some examples later, not in this low level details, but more on a high level,
how this is structured.
And well, I mean, like if you're designing an FPGA or building an FPGA design, then you
somehow need to get your code to this parallelism on the FPGA.
Right.
So you're either defining this in the hardware definition language or some higher level languages. Then comes the synthesis that produces a logical view of this
code, like of the logic gate level presentations, meaning, well, we have logical units, like logical
islands, and they are somehow connected in this or that way, right? So there's lookup tables in
there, and there's flip-flops in there
and this is how we would basically connect this.
And that will be done in a generic fashion.
So this basically then can be mapped onto any FPGA.
And then comes the step of place and route,
which means we're putting the circuit onto a specific FPGA.
And then basically we're putting the uh the circuit onto a specific fpga and um then basically we're saying okay what where are which parts of the fpga i mean if you remember these fpga designs look
quite peculiar or particular so i mean if we briefly go back here, right, this is the design that we had on the small
FPGA that I brought last time.
Then you can see, I mean, this is the actual layout of the chip, right?
So you have this individual programmable logic blocks, as they're called here, which are
basically on the chip then you have the like a bit of b ram in there mixed
in between you have the i o coming in from outside and within there you have eight different flip
flops and multiplexes and lookup tables so within each lookup or the lookup table of four units. And this is basically then, okay, I know these are connected in this way.
I have these inputs and outputs.
And this is basically then what actually needs to be correctly configured,
depending on where these lookup tables, et cetera, are and how these are connected, right?
And then, basically, once the place and route is done,
so we have something that fits this layout,
it needs to be generated into a code or a format that
can be read from the FPGA.
So the FPGA itself has some kind of small hardware to actually flash
these lookup tables on there, right, to reprogram and reprogram the networking in there. And that
needs to be coded. So that's just a simple formatting step, essentially. And then you
flash it onto the FPGA. So then it will be transferred to the FPGA.
There will be checked, of course,
if it's still correct after transmission,
and then put into these lookup tables in there.
So that's done internally.
So here, as an example, is the Xilinx.
So if you're working with Xilinx,
that would be an AMD.
It was bought by AMD at some point.
So AMD FPGA.
So here, typically, you're rewriting your code
in something like Verilog.
And then there is the Xilinx synthesizer.
That's basically the first step, which basically translates, as we said, this code into this
logical gate level and register level design.
This is the first step.
We're turning the HDL specification into gate level netlist.
So there's a specific format, the netlist gate.
Let me see.
No, it's native generic circuit format.
In the next step, this is multiple steps.
Let's basically translate these into a native generic database
file.
I already said that.
So there's multiple steps where we first basically have some rough logic blocks, then basically more specific to the individual FPGA that will then be mapped onto the FPGA design or
rough overlay design.
And then, so we're translating or we're mapping this, like the native design that we have, we're mapping this to the actual
resources that are available on the FPGA.
I mean, this is basically every FPGA has slightly different number of lookup tables, of logic
blocks, how they're configured.
And we're placing this into what would be basically checking how we can actually place this or what we can actually
place.
And then we're trying to figure out the optimal placement on there.
So depending on how the signal needs to go through, I mean, depending on where the RAM
is on the FPGA and we need to store something in between. We might need to place logical units closer to the RAM
or what kind of modules are connected,
how we want to place them close to each other.
But there's always like we just have like it's a 2D map,
basically, with certain connections.
And we need to solve this in a way such
that we still get the maximum throughput that we can get.
So the signal doesn't have to travel too far,
and we're using the space optimally.
Because if we want to, say, for example, replicate something on there,
or we have multiple modules that need to be connected to each other,
we just have to see how we can actually connect them to each other.
And, well, then we're generating a bitstream, how we can actually connect them to each other.
Then we're generating a bitstream, which is basically this map translated to whatever
can be flashed onto the FPGA.
Then we're actually programming the FPGA. And this is done through all these individual code pieces or individual libraries and compilers,
essentially. This is all, at this stage right now, this is all closed source.? So this is basically all very specific to individual FPGA vendors. So
if you program an Intel FPGA, you have a different tool chain than if you program an AMD FPGA.
And all of these individual tools are quite expensive. You need special light sensors.
It's quite tricky to set this up. And usually these tools only work with very specific FPGAs.
So you have a certain version that works with the certain FPGA.
And there's basically also version conflicts in between these different things, right?
So depending on like the bit generation and the mapping,
this is very specific to the FPGA.
So you need a very specific version to get this done.
So this is a bit annoying.
It also means that quickly there's no...
If they build new FPGAs, we need the newest workflow.
And then older FPGAs might not be supported anymore.
So you need kind of the very specific version, at least that's for us,
that's always been a bit tricky.
But there is some hope.
So there are some efforts to make this more open source also in order to gain
more traction, to make it easier for people to program. And one tool chain that's out there,
that's what I've used in order to do this small FPGA here.
This is basically an open source tool chain
that's based on Y-OSIS.
So there's a synthesis tool that can basically do Verilog
that's open source.
And then there's a tool called Next PNR for place and route.
And then, of course, at a certain point, again, we're getting hardware specific. So first the code, Verilog and VHDL, these are open, I mean, these are open, not open source, but open
descriptions.
So it's basically more or less standardized kind of code or programming languages.
And so these basically, you have an idea of what they're supposed to do and the native
descriptions, so basically the abstract logical descriptions, different kind of like how logical
blocks should be connected to each other.
This is also, can be sort of generic, but at a certain point, as soon as we really want
to place this on a certain FPGA, this gets very hardware specific.
And for this, fortunately, for the FPGA that I've shown you, there's also open source tools.
So there's the IceStorm tools because the Ice FPGA somehow.
So there you basically have a bit stream tool,
which is ice pack,
and then you have the ice proc for flashing the FPGA.
And you can use with like the standard workflow
with these tools that I showed you is Verilog,
but you can also use VHDL,
which is like a GNU version.
It's GHDL.
So you can use this, translate this, and then use
viosys, then the next PNR for the place and route,
and then again, the iStorm tools to actually put it
onto the FPGA.
OK, questions so far?
Not so tough, right?
Okay, good.
Then we're gonna do a quick break here,
four minutes or so, and then we're gonna go
into how this is actually used in data processing.
So how do we use the FPGAs in data processing?
So, I mean, first some more details about parallelism before we really dive into like
examples.
So there's many different ways, or let's say not many different ways. I mean, we remember essentially there's a couple of different ways of building parallelism
into our system.
So we can basically have data parallelism, as we said here.
We can have pipeline parallelism, and we can have different parts of the program, like
task parallelism, different tasks at the program, like task parallelism,
different tasks at the same time.
And data parallelism means that we
apply the same task on multiple data items.
And that, of course, is very simple in an FPGA, right?
So this just means we're replicating logic on the FPGA.
So instead of doing one multiplication on the FPGA,
we want to do two multiplications at a time.
We just need the same circuitry twice.
And that's basically it.
And the amount of parallelism that we can get is just limited by chip space
plus the amount of data we can get in, like within a certain amount of time, essentially.
So I mean, this is also to some degree limited, right? So remember, this is usually connected
through PCI Express. And that's basically the amount of data that you can move in and out of the FPGA.
So whatever you can get in there, plus there might be other connections as well, right?
So there might be network connections on the FPGA.
So that's the other way of getting data into the FPGA.
But again, like the amount of data that you can get through this.
So say having an InfiniBand connection, HDR,
so four times would be a hundred gigabit we said, right?
If I remember correctly, I would have to look it up.
Always forget these numbers.
So, but say in the order maybe of 50 gigabyte
or something like that.
So that might be something we can move into
the if by per second that we can move into the FPGA if we have like really
current hardware probably in most cases the the cars that we have here is less
so something in the order of 10 gigabyte per second, something like that.
But that we can move in.
And with this, I mean, this is not gonna be
like a single byte at a time.
It's gonna be multiple at a time.
So then we can use some parallelism.
And for this, we can replicate stuff.
And so we can, yeah, have something like map functions
that we can build by parallelizing. We can build SIMD units on there.
So we just process vectors of data rather than individual data items.
An example would be vector addition.
So this is kind of what this would look like.
So we have data coming in.
This is dispatched to the different replicas.
So it could be a vector, for example.
And then it's collected again.
We can process, go on with our lives, and continue from there.
We also can use pipeline parallelism.
And this is basically this trade-off that I told you, right?
So we can have one large chunk of circuitry
and wait until the signal goes all the way through,
or we break it up into smaller parts.
And if we break it up into a sequence of subtasks,
then they are directly wired to each other.
So we just need these flip-flops essentially in between,
but we don't need any other communication in between.
These subtasks are basically directly wired to each other. Right? With this, I mean, it's like on circuit, on chip networking
or wiring, which is programmed at flash time.
Right?
So we're flashing on.
We saw there's these small gates with small SRAM cells.
So these are programmed, and then they
are directly connected.
So we can build the pipeline parallelism on hardware level, where
we have one large computation task that we break down into sub-circuits and then we put
these registers or flip-flops in between that then remember the outcome of the computation.
And at the next cycle, so this is basically then all of a sudden we're using a clock here.
At the next clock cycle, this becomes available at the next level, on the next pipeline step
essentially.
And the benefit of this is we're going to get a higher bandwidth, right?
So we can basically produce, well, I mean, if we have one large block,
basically, we have to wait until everything went through.
And the signal needs to be on this end until the signal has gone all the way
through.
Otherwise, we're in some kind of undefined state.
So if I'm basically putting my signal here, my signal process is here, I'm switching the
signal here, then it will change again.
So we're getting in a weird state.
This is where then our synthesizer,
like the translation, will tell us, well, this won't work, right? So, you're going to get into
problems. You can still build something like this, and you're going to get undefined output,
basically. Well, it's not working, essentially. So, what happens is we have to put the signal here,
wait until it went all the way through,
and then we can change the signal on the beginning again.
We can, of course, split this up into multiple, but if they're just directly connected,
it's still the same problem.
We still have to wait for the signal passing here, passing here, passing here, etc.,
until all the way to the end in order to have the final output.
We can change this to this pipeline setup with registers in between, which means we're moving
the data to the first, like we're basically processing the first step all the way to the
register. And as soon as the clock cycle changes, I can add the new signal here. So this means the throughput will be higher. The latency most likely will also
be higher because I mean we might be very fortunate that this step
exactly takes as long as a cycle but most likely won't because if we're
really at the cycle border,
well, we're going to get into the same problem
that every now and then it will be too slow,
and we'll basically have undefined behavior.
So this basically means we need to have something here
that can be processed under a cycle, right?
So under a clock cycle, faster than a clock cycle.
So this might be just a bit more than a single clock cycle, right? So under a clock cycle, faster than a clock cycle. So this might be just a bit more than a single clock cycle,
hypothesizing, right? So just a bit more than a clock cycle. So we just have to wait a bit longer
before we change the signal.
Here, we'll have to wait basically three steps.
But,
unlike here, so basically say we have here, we might be done in two clock cycles,
but we can only process one data item in two clock cycles.
He will have a latency of three clock cycles, but we can process three items in three clock cycles, right?
So this will be higher throughput, higher latency, essentially.
Again, the length of these pipelines and the amount of
pipeline parallelism that we have is
just basically limited by our program and the size of the FPGA.
So, I mean, we could build one long pipeline all the way through the FPGA and just fill it up with computation like in every clock
cycle put in new computation and then basically once it went through all of
the computation we add we have the output. So that might have like a high
throughput but very long latency because we have to wait
for all the cycles until this goes through.
And then of course we also have task parallelism.
And this is also what we definitely want to use at an FPGA, right?
So basically we want to do different things on the FPGA, different parts of our program,
different types of operations.
If you do query processing,
different operators,
and we will do them in parallel.
So we can build these different operators as modules.
You have the input to the operators
and we can basically apply the different operations
either on the same kind of data item,
but also on different kinds of data items in parallel.
And well, we can do completely different tasks in parallel.
We could even, say, map two queries on the FPGA
at the same time or have networking plus query processing plus
something else on the FPGA at the same time just because I mean whatever fits on the FPGA we can
put it together and wire it together essentially and like a good example and it's and at least in
research is used a lot as evaluating different
regular expressions in parallel, right?
And this also makes sense for network protocols or general protocols.
So we have some kind of, might have some regular expressions to express the kind of routing,
whatever you want to do.
So in the database, our standard example, one of our standard examples is sorting. So this
is something that we've seen before with like sorting networks that we can also do in SIMD,
but we can also do it in FPGA. Works very well on FPGA, basically same kind of setup. So you remember there's like, we can build a sorting network
and then do basically just compare and swap operations
in between based on the inputs.
And if we have this kind of layout, then of course,
or if we have this kind of sorting network, logically in SIMD, for example, or something like that,
we can do the same thing in hardware on the FPGA.
So we can basically really build the same kind of architecture,
the complete network on the FPGA,
and do this sorting on hardware. So meaning, I mean, here's like on the FPGA and do this sorting on hardware.
So, meaning we, I mean, here's like a, on the bottom, I have a fundamental circuit element
that would do this kind of sorting. So, we have like our inputs, 32-bit integers, for example,
or whatever, 32-bit numbers. And then we have a comparator operator.
So this is actual some kind of circuitry
that needs to be mapped into these logical units.
I mean, we cannot do it with a single bit,
but it needs to be done with multiple bits.
So the single multiplexer won't be enough.
We need more than a single multiplexer to do this.
But then in the end, I mean, for every, not for every,
but basically we need to compare across all.
And then based on the overall comparison, again,
we have some kind of multiplexer that basically tells us, OK,
if a is larger, then set this multiplexer
or all of the multiplexers here to 0, or set this multiplexer to 0
and this one to 1.
If B is larger, set this multiplexer to 1 and this one
to 0, such that we're basically either keeping the order of A
and B or we're swapping the order of A and B or we're swapping the
order of A and B. So, this is basically all that this does, right?
So, this is our basic operation.
We compare the two and if they're in the right order already, we keep it.
If they're in the wrong order, we swap them.
And this is all that we need.
And out of those, then we need multiple of those.
Out of those comparators, we can then build this whole network.
Each of these lines basically is one of these comparators.
We could count them now.
I think we said 13.
This is what we need.
I'm not going to count it now.
Basically, all numbers will be compared to each other
eventually.
And then after doing all of the comparisons,
we have the total output.
This works as a whole network, meaning
we're just putting basically the numbers and the input,
and we wait until the signal traverses all the way through.
So we could do that.
Essentially, we're just basically putting the numbers here,
but it's a fairly deep pipeline already,
so this might take a bit long.
So what we can also do is we can basically put some pipeline stages in here.
And this then basically means with these pipelines,
like putting the registers in between,
this means we need for every number,
we basically again need some storage
where after a clock cycle, it will be stored.
And at the next clock cycle,
it will be available at the next sorting stage.
And here you can see there's basically one
comparison in every step on at most four numbers at least two numbers in every step. So it's
basically we get all of the comparisons and then in six stages right. This means in six clock cycles, I will have to complete sorting done after getting the input
for eight numbers here.
And of course, I could also build this bigger.
I could do this for 16 numbers.
At a certain point, I just need so much chip space to do this.
It gets slow.
So I might use something else for continuing.
It might be more efficient to
do something else for continuing the sorting after this. And yeah, so that's one way of doing sorting
and that actually works quite well, right? So for this, you get an efficient sort operator
with different ways of doing the pipelining.
And again, where to put these pipelines depends on how fast the signal goes through the network
and how fast then the overall clock rate will be.
If you have other parts of your design that make it slower already, then you might actually consider
reducing the number of pipeline stages here in order to not wait so long for the throughput.
You have to wait for other parts anyway, so this might not be the bottleneck here.
Another thing we can do is aggregations.
What we can basically do is we four multiple values at the same time, where we build an
aggregation set up. So say, I mean, an aggregation might be minimum, maximum,
average, the sum, et cetera.
I mean, sum's very simple, right?
Or minimum or maximum.
I mean, minimum, maximum, even simpler if you think about it,
right?
So you basically just need to compare the different values
to each other. and you always keep
the minimum across like a kind of like comparisons of two. So you just do it basically step by step.
And if you want to, I mean, what's also what you also can do in an FPGA, rather than putting the whole program always on the FPGA
and flashing the whole FPGA with your new program, modern FPGAs also enable or make
it possible for them to have partial sub-areas that you're changing.
In this case, you could basically say
well I need a new type of aggregation so let me replace this block of the
aggregation in with a different block of aggregation so I don't need to implement
all aggregations in my aggregation operator but I'm only flashing the one
that I actually need right now. So that
gives you some additional functionality and some additional
flexibility to make this faster. So say for example here we have an example
where we have a selection query with two different conditions. And then with these different conditions,
could basically be different or could
be different implementations of a partial area that
will be used within the FPGA.
So then we don't need to flash the whole FPGA.
We only need to flash the part of the FPGA.
So that's a bit faster.
Also, the routing might be cheaper for these subparts.
And we can produce multiple layouts already
with just different subpart change
if these different kind of aggregations
fit into the same subpart.
And the benefit is then the flashing is a bit faster.
The flashing actually is not really the part that takes the most time.
The place and route is the part that takes the most time.
But if we do it this way, then we know that the rest of the chip will stay the same.
So we can basically exchange just certain modules.
If I'm doing the place and route, etc., everything from scratch again,
most likely the overall layout of the resulting layout will be completely different
or at least also different of other parts of the FPGA.
So we can basically make sure that this is a bit more efficient
overall and is more also configurable to that.
And so this basically means we're updating only parts
without stopping necessarily also the flow of data
so we can continue processing other parts.
And we can do hardware software co-designs. So we can continue processing other parts. And we can do FP, like hardware software co-designs.
So we can also just do restrictions and aggregations
inside the FPGA and we could do the rest
of the query processing on the CPU.
So we just send basically whatever we want to aggregate,
we send to the FPGA.
The FPGA does the aggregation very fast, sends the aggregation
results back, and we just plug the result back into our query. And this, of course, only makes
sense if the stuff that needs to be processed on the FPGA actually is large enough and is enough
computation such that it makes sense to move the data onto the FPGA
and back because that's a significant amount of latency again because we typically need
to go through the PCI Express bus.
But if that makes sense, then this is, for example, one way of optimizing here or of utilizing this.
And I've put like an example of a paper
where they tried exactly this.
There was a lot of work on this in SAP research
where they tried to somehow plug in FPGAs
into the database system.
As far as I know, not successfully.
So I mean, successfully in research,
not necessarily successfully in practice.
And, well, what you can also build is basically
a fully pipelined hash table.
So, there's also some work that people did.
Scholt Istvan and Gustavo Alonso, so ETH Zurich.
Scholt is now at TU Darmstadt, also in our seminar regularly.
So this is also why we see some FPGA topics every now and then there.
And here, something that they built is a fully pipelined hash table,
meaning they basically built all the hash functionality
in the FPGA, which then means they can basically rather than do
like a lookup in the hash table and then compare
all of the elements in the certain hash bucket,
you can do this in parallel.
So you can do all of the comparisons in parallel, and especially all
of the elements that are within the same hash bucket.
You can compare them directly.
And here, they also tried to scale the key value store, scaled or, again, did this co-design with the memory,
so with DRAM.
And in this case, with a 10-gigabit data center application,
so this is part of this catapult idea
where you have network cards that are FPGAs
in the Azure data centers.
So there basically you can get the data
or the lookups through networking
and then directly do the comparisons on the FPGA
rather than going through the CPU
and doing the comparison here.
And here they had some, I mean, for the details
you will have to go to the paper.
But you can see that if, depending on the key size,
I mean, this basically means the key size is basically
chip space that we need.
And with the chip space, it also means, on the one hand,
we get less parallelism. and we get more complex logic,
which will decrease the frequency that we can use.
And you can see from key sizes of 20 bytes to 160 bytes, they are still, let's say, a factor of 10 to 30 faster than x86
by basically having this kind of lookup inside the hardware
with also a very low latency in terms of clock cycles.
Although the clock cycle latency here, you can also see,
is basically going up, or the number of clock cycles required
is going up with the key length, basically
because we have to either split up the keys
or reduce the clock cycle frequency.
Okay, so as a quick summary, for FPGAs, FPGAs are slower than CPUs or GPUs in terms of frequency.
It's still in the megahertz range, so hundreds
of megahertz maybe in comparison to gigahertz
that you can get on the CPU or GPUs.
But they make up by parallelism.
So everything that goes in is super parallel.
So you don't need to sequentialize your program in any way.
I mean, remembering we have parallelism on the CPU and the GPU, but your program is essentially
still crossed per core in a sort of sequential manner.
There's some reordering and whatnot, lots of interesting logic, but the FPGA does everything at the same time.
Only once we're putting these clock cycles,
once we're putting the registers in between,
we basically sequentialize stuff.
Otherwise, all of the signals are always processed in parallel.
And because of that, because they're slower,
and they're typically also not necessarily as large,
at least not as GPUs today, they're also more energy efficient. Because they use lower clock
cycles, the frequency is much lower, it just requires a lot less energy, essentially.
The design flow is quite, I mean, programming is complicated
and placing and routing is complicated.
So this can often, like if you have a complex program,
this can take more than a day to actually produce the layout
that goes onto an FPGA,
which is also annoying if it doesn't work in the end.
But then, so that's basically,
if your application changes a lot, it's not good, right?
So this is, you also need some,
like you need to make sure that your application
is something that can stay stably run for a bit of time.
But there's some help essentially by having this replacement, like partial reconfiguration.
So if you can basically have different modules and you can replace parts of your configuration,
you can pre-compile those and change it.
That's fast again, right?
So flashing the FPGA doesn't take a lot of time.
This is in seconds, essentially.
Even probably for the larger FPGAs, the modern one, this is probably sub-second.
The place and route, like getting the actual schema,
this is something that takes hours and days.
And this is really the larger the FPGA, the larger your program, the longer it takes.
I mean, for my, you remember, right?
So my very small program, this already took some time, right?
It's not instantaneously, but this is like, it was basically nothing that the FPGA needed
to do.
And it still takes some time to get this all the way through.
If the programming is more complex,
then this compilation step will take a very long time.
But of course, you can also create some hardware
with some reconfiguration options at runtime.
So I mean, essentially, you can simulate a CPU on the FPGA.
So you could have a program counter.
You could have a program or instructions coming to the FPGA
that basically lead to certain behavior on the FPGA,
depending.
Like you have different functional units.
You can switch on different function units, et cetera.
At a certain point, it doesn't make sense anymore,
because then it's much more efficient to use the CPU.
But if you have dedicated instructions
that you want to do, dedicated parts of the hardware,
then again, it might make sense to get
some reconfigurability there or modifiable behavior at runtime.
Okay and with this that's all for today. Questions to FPGAs? We'll hear
more about FPGAs next week on Tuesday by Piotr.
And now we can't see an FPGA. I'll show you the servers where an FPGA is inside.
But I think that's the best I can do, unfortunately,
in the data center.
So it looks something like this.
So it's just a card with one large chip like this
would then be the FPGA.
OK.
No questions?
Yes, there is a question.
I was wondering the example you showed
with the smaller than comparator with 32 bits,
how many logical cells would that need?
Would that need much space?
Would that take up?
So I mean, it's a good question.
So essentially, we needed something
like a multiplexer for this.
So as we said earlier in, let me show, I mean, I don't want to hypothesize too much
because there might be much better ways to implement this.
But essentially, I mean, this is the most simple way
we can have this in hardware, right?
So for every bit, we could basically compare this.
So then you need some carry over logic.
I mean, this is basically to say, like, what was the previous comparison?
I mean, we're starting with the most significant bit, do the comparison there.
If it's equal, well, we have to go to the next one.
If it's equal, we go to the next one
until we find a difference.
And then we know the result. And this basically
means for every bit, we need something
like this multiplexer here.
So that would be then one logical set?
That would be, I mean, probably in our case,
I mean, if we wanted to do it for 8-bit,
I would say in this example, it would be one logic block,
basically, for 8-bit.
Because we have 8-bit input, right?
I mean, we have the lookup table has four,
well, there's a four-input lookup table.
I don't know exactly, well, we have four inputs here.
And we have to carry logic.
So it might be enough to,
like we might get like four bit out of a single lookup table.
So then it would be for one byte,
it would be two lookup tables.
For comparing two would be four,
four of these blocks, right? for the logic cells, essentially.
How many of those were there on the chip?
JANIS LIEBHOLDTMANN- On this one, there were eight
in a programmable logic block. program a logical block, and out of those 7,680 logical cells,
so close to 8,000 logical cells on this one.
So not that much on this one, but that's a very small one. On the larger ones, we have 400,000
of logical, like elementary logical units, which
each have basically 12 bits.
So that's a lot more space.
MARTIN SPLITTMANN- There's a lot more space on this one,
yes.
The small one, you're not going to get a very complex program on that one.
Yeah, so this one's probably a lot, depending on how much parallelism you use. Of course,
you can easily fill this up by just replicating stuff. But for the smaller one, yeah, this one
is filled up quite quickly.
So, I mean, as soon as you get like a more complex program or layout,
this will probably already be over.
You're welcome.
Good.
Okay.
So then, thank you very much.
Thanks for coming despite the weather today.
And I'll see you all next week for the presentation.
And then hopefully the data center tour, if you're curious.
Thanks.