Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Field Programmable Gate Arrays (2)
Episode Date: July 12, 2023...
Transcript
Discussion (0)
Welcome everybody to our next session. Today we're really going to talk about FPGAs.
But before just an overview where we are. So this is the second to last topic that we're
going to discuss. And the next thing will be the last thing will be CXL which is something like very new fresh off the
mill basically new technology that's just been started and we're fortunate to
have some yeah already have some resources to work on this so we can tell
you a bit about this kind of new protocol and interconnect. Marcel will do that in one of the next sessions.
Here's kind of the overview of what we're going to do.
We'll see, maybe I can finish up FPGA today, maybe not.
If I'm not finishing it up, we're going to finish it up on the 19th
and then end with some QMA or maybe I'll already do
the summary there, because as I said the last
week I'm not going to be here we will have one more task to do so the locked free data structure
so this is a lock free skip list that should be actually quite fun and interesting to implement. And there's, I mean, we talked about this, there's many different ways in implementing
lock-free data structures.
And we'll give you one way, we'll give you one example how to do this, and then you can
test yourself if you can make it more efficient, et cetera.
And then, I mean, I know not everybody's here today,
but who would actually be interested in the data center
tour?
Ah, cool.
So I'll probably ask Marcel to organize this,
because I'm not here.
But this will be fun.
You'll see all the hardware.
So maybe reach out to Biosyn soon to set this up.
It's getting increasingly complicated
to get into the data center because there's
lots of security measures in there.
But we have good access.
So you can actually go there, see the hardware,
don't touch it.
But I mean, it's better not to touch it.
It's not necessarily because you're not allowed to touch it,
but because you might get hurt if you touch it.
And we can also show you a CXL server.
Yes, we can also show you a CXL server there.
But for that, I guess you shouldn't take pictures, right?
But you can see it.
So that's actually also cool. Right. And yeah, so we'll, we might swap around the last couple of sessions a bit.
So, but that will announce if there's, if something happens there, you will get the info through Moodle, etc. Okay, so today, we'll see how far we get.
For sure, I'll do the FPGA introduction,
so you'll see what an FPGA is,
and then we'll also see the architecture,
get a bit of an overview in programming.
So, unfortunately, I haven't gone as deep there
as I would have liked.
So, I mean, this will further iterate in the next
sessions, but there's a lot of good work on this and a lot of material that you can find.
But you will get a hint on how this works and some bit of understanding. This session is more really about the architecture.
So you get an idea how this architecture
or this kind of processor is different
from a regular processor.
So an understanding of how you program this
or how it works internally.
The way you program it is a bit more high level.
Of course, you could program it at this low level,
but that would be super hard.
There's a lot of tooling around it, but you get an understanding why it works as it works. This is my idea. Then, we'll talk about the design flow briefly and then how to use it
in data processing. This might run over to next session, we'll see. Okay, so some of this
you already know, but this is kind of always the same reasons why we
have to do what we have to do, right? So you already know we're kind of getting
bound by physics. So I, we cannot like increasingly scale down the circuitry. It's just not economical
anymore. Basically, we cannot put ever more power into the CPUs because we have increasing,
let's say, leakage. So, meaning that the more dense the CPU gets, the more power we have
to put in just to keep them running.
This is shown by these trends.
For a long time, we were just using single-core, we were just scaling down the circuitry in size, basically,
and just increasing the frequencies,
and with that would get better and better performance.
And this, to some degree, has stopped in the early 2000s. So then, all of a sudden, we started with increasing the number of logical cores
and not increasing the power anymore.
And this is also called the Dennard scaling.
By just keeping the same chip space
and just decreasing the circuitry,
we can keep the current constant
and use the same kind of power, but pack
everything more densely.
But at a certain point, this stops because of the leakage current.
So then there's basically the problem that we cannot decrease the size anymore because
we just have to put more power into it.
And a current CPU basically is like a heating plate, right?
If you have a stove at home,
this is basically what's going on in the CPU.
And we're just putting like cooling system on top of it
to just remove this kind of power.
So it's really like you turn off your stove at home
and you put like a cooler on top or a ventilator
to keep it at room temperature somehow.
So that's basically your CPU today in terms of power density.
And that's not super economical, of course.
So we somehow need to move somewhere else.
So one idea initially was more coarse, but this also has, as I said,
some limitations because of the Dennard scaling. So the idea of Dennard scaling is we're doubling
the transistor density, but even if we keep the same chip space, then the power consumption would be the same.
And this stops because, as I said, of the power leakage or the current leakage.
And so this means we cannot increasingly reduce the size.
We cannot increasingly increase the density.
So one way is just add more cores.
But at a certain point, this also doesn't work
because we cannot just power,
like keep up the power consumption.
So, and one way to deal with this
is basically just not use all of the cores at the same time.
So you've seen this in the M1, for example, right?
We have the efficiency cores
and we have the performance cores,
and we can do either or,
and we then basically,
we don't use everything at the same time.
And there's also like specific architectures
like ARM Big Little,
where you have also big cores and small cores,
and we're just using part of the chip space,
and by that, basically also reducing the overall power consumption.
The unused physical cores are called dark silicon.
Using this dark silicon, we can still improve,
we can kind of specialize part of the hardware,
but we're not really that efficient.
So, and this also basically means we have some problem
and we kind of need an alternative.
And the alternative is basically going
into ever more specialized devices.
So I showed you this last time already, right?
So we see we have this very flexible CPUs
and multi-core CPUs today. Then we have more specialized hardware and accelerators like GPUs
and there's other accelerators, especially for machine learning today, like TPUs. And then we
can get even more specialized. And an FPGA is is a kind of hardware that you can really specialize
for certain type of application, but you can reconfigure it. Then, at the far end,
there would be the ASIC, the application-specific integrated circuit. This means you build a chip for exactly one application. And if you're into
like real high frequency stuff, so high frequency trading, Bitcoin, mining, whatever, all the good
stuff, basically this means you'll probably, wherever there's a lot of money, basically people
will put these very tightly integrated chips in there.
Or cases where you just need high throughput networking.
So for networking, you will also build specialized circuitry just to do that fast.
But this will be only for one type of processing.
And this is also often a problem.
So if you have networking, your protocols change, you need new hardware because you cannot change it.
An FPGA is kind of a nice middle ground because it can be highly
specialized but it can also be adapted. You can also change it. This is what
we're going to talk about today. The field programmable gate array type of processors.
So here's a comparison between an ASIC and an FPGA. So as we, the ASIC is hardware or is a processor that's specifically designed for one type of application.
Of course, you need some kind of flexibility to adjust things,
but you will really tightly integrate the hardware or tightly specialize the hardware for one application.
So, and with that, of course, you get a very high
performance. You're not sending any instructions, you
don't do all this decoding, all of the stuff that the CPU has, like these
pipelines, pipeline breakers, etc. You don't really have, because
this is all in chip, this is all is all in hardware. So the circuit is fixed and you cannot
reconfigure it.
This is really like in terms of the design flow. If you look at the
graph or
the stuff on the bottom here. Let me use the laser pointer again.
So this is basically how you're designing an ASIC.
So you have like a basic design,
hardware design, right?
Real chip design, how the circuitry is.
Then you have your tape out,
which is basically you're printing this on some kind of,
I mean, traditionally you would have some kind of film,
like projector film,
where you would put this and then make an image.
Today, this will be digital.
You send it to the manufacturer,
and there it will then actually be put in silicon.
There will be tests, but after you've already done the tape-out,
so after you've produced the image for the chip,
so the layout of the circuitry, any kind of bug fixing, any kind of changes are super
expensive.
And you're basically stuck with the hardware for some time.
You really have to produce many of the chips in order for this to pay off. In the FPGA, you have a hardware that's reconfigurable.
You can design your hardware, you implement it and you test it,
and you can do bug fixing and you can redesign.
You can basically go back, put a new image or a new design and layout on the chip and then run it again and test it again.
And this is why this is really also used for prototyping.
A lot of the hardware that you're buying today, like some kind of controllers, initially they
will be built in FPGAs to figure out how they work and what are good
designs.
How do you design this in hardware?
Once you've factored in everything, once all of the bugs are out, then you might produce
an ASIC with this.
Or you just keep the FPGA because you can later on reconfigure it again.
So we'll see some use cases for this.
So, the FPGA has the good or the benefit of giving you free choice of the architecture.
Internally, of course, there is an architecture in an FPGA. We'll look at some details there.
So how this actually looks like internally.
What are the basic building blocks?
So there is predefined circuitry in the FPGA,
but it's built in a generic way,
so you can actually implement something else on top of it.
So you can design basic logical gates and build more advanced things out of those.
You have a very free choice of how your processing is done in the architecture.
You're doing very fine-grained pipelining,
communication, and you have distributed memory in the chip.
They're much slower than regular CPUs or regular chips processors.
So they're in the 100 MHz range, something like that,
not in the GHz range as you would have with CPUs.
So this one, for example, runs with 12 megahertz here with a multiplicator,
so it can go a bit faster, but the base frequency would be 12 megahertz on this one.
But also you have a very low power consumption.
So this thing here, you can see there's no cooling, nothing.
If you have a Raspberry Pi, for example, the better ones or the modern ones,
they will at least have a cooler on top, maybe even a fan, because they're going to be warm.
And if you really run a lot on a Raspberry Pi, for example, I mean, this is kind of same size, right?
Also chip wise and if you run I don't know some data processing
performance benchmark something like that then the Raspberry Pi frequently will basically overheat and
Shut down and this won't happen with this kind of FPGA because the power consumption is really low
so the FPGA is really kind of a middle ground between dedicated hardware and
More flexible CPU that you you can use for anything.
And it's really good to try out things.
And it's really good to have very specialized applications where it doesn't really make sense to get an ASIC yet, or because you want to try something out, for example.
So how is the architecture different?
From a high level to a low level.
We're going to start very high level.
So if you think about the classical von Neumann model,
which is still the kind of brain or mind model that we use
if we're thinking about a CPU,
it's not really accurate anymore completely,
but you can see that at least the programming
is designed in that way.
So in there, you basically have like the memory
is physically separated from the processing CPU.
And to some degree, this is still true, right?
So the RAM is like a separate entity that's not on
the chip. You have a pipeline of instructions and the information is processed in this pipeline.
And there's a lot of logic and cycles, like a lot of processing, that just deals with getting new instructions
and decoding them and pipelining them, etc.
So if you remember how do we execute instructions on the CPU,
there's a lot of work that just deals with this.
And we really have to think about this, like how do we pipeline stuff, etc.
CPU will deal a lot with these micro operations etc.
And although a lot of this you don't have to deal with,
a lot of chip space is used for this.
So I mean we're writing C++,
this will be decoded into some assembly,
but then even the assembly will be broken down into micro-operations.
And this is done on the CPU. There is specialized hardware just for this.
And the FPGA doesn't have any of this. The FPGA doesn't deal with instructions.
Basically, the FPGA, the design of the hardware, the layout of the circuitry, defines the function,
defines what is done on the CPU, on the FPGA, so the processor.
You have basically flip-flop registers and RAMs that are distributed across the chip
and are wired to logic blocks. I'll show you exactly how this looks like.
And with this, this is highly parallel, right? You have many of these building blocks. I'll show you exactly how this looks like. And with this, this is highly parallel. You have many of these building blocks
and each of those can work independently.
Essentially, every logic block can
be addressed completely independently and can
work completely independently. So, all of the
information can be processed in parallel. You can also build long pipelines, right?
So there's circuitry, you can connect these logic blocks and then you can say, well,
I'm going to implement my instruction decoding on the FPGA. You can do that.
But of course, it's not going to be super efficient. So you really want to make sure that
you use this parallelism.
And the latency, like how much time it takes to process something,
really just depends on the signal propagation.
So depending on how you wire the circuits,
I mean, basically the flip-flops and the lookup tables will have a certain delay.
And just how fast these can basically switch, the flip-flops and the lookup tables will have a certain delay.
And just how fast these can basically switch,
this means how fast your program will run.
And this means more complex programs,
more complex circuitry will take a bit more time.
Easy circuitry will almost be instant, right? So it just depends on how fast you get the data into the device.
Typically, an FPGA is built into some kind of card and it has some additional logic on there.
So, even if you look at this one here, it's not just the FPGA.
So, this has the FPGA on here, but it also has lots of other controllers on there
that help you with the input and output.
There might be actual DRAM modules, so there is some memory on the FPGA itself, but there might be also some additional
memory. Then you might have some flash storage, you might have a network interface, so a lot of
FPGAs are actually used for networking. Or you use it for some video encoding or decoding,
and then again you might already have some video and peripheral ports that you don't have to program yourself every time.
And that you don't spend chip space on.
So this will already be provided.
So there's different options on how to integrate this.
So I said like many FPGAs and especially the FPGAs that you could program here in the data lab.
These are all accelerator cards.
So then usually they will be connected through PCI Express.
But the FPGA can also sit in other places.
So one thing that we also have is disks that have FPGAs.
So you can do some encoding, decoding,
pre-processing already on the data path.
So, say for example in SATA, but of course also PCI Express, etc.
And then you have an acceleration there.
And there's even co-processors that use FPGAs or that are FPGAs.
So, you have some special processors,
for example Intel had a processor
where there is a regular CPU and then an FPGA coprocessor.
And you have UPI as a connection between the two.
So you are basically on the internal fabric of the CPU and the sockets,
and you have the connection there.
FPGAs are used in research and industry,
so I have a few examples before we go into the details
of how the FPGAs are built up.
And one prominent example is the Microsoft data centers.
So if you think about the Azure cloud,
so that's like Microsoft's cloud offering.
There, they have an FPGA in every single server for the network.
So all of the networking there is basically built on FPGAs and you can use the FPGAs to reduce data movement.
So all of the virtual network functions are basically offloaded to the FPGA. And while there are smart NICs and everything,
using an FPGA makes this very flexible for the data center.
So they can basically have all kinds of different operations.
They can reuse this for different purposes and also specialize this.
And here the FPGA, again, is managed through a PCIe interface.
So this is basically just like a network card, but the network card is actually an FPGA.
So this basically gives you all kinds of flexibility to pre-process.
I don't think that Microsoft actually exposes this.
So it means you cannot really change this,
but Microsoft can really do some a lot with this.
So basically adapt their networking very flexibly.
Then there's also the, as I said, like FPGAs for storage
and here basically like any kind of new storage device, etc. typically is prototyped
in FPGA. Because all of the protocols, etc. for interconnects you can implement in the FPGA and
the FPGA can be quite fast. I mean, if you implement it correctly, you can get very low latencies.
You're not going to get super high bandwidth or you're always bound by the interconnect that it's connected with.
And depending on the FPGA,
you might not be able to reach kind of memory type bandwidth.
But still, you can flexibly build new kind of layouts. So there was the eBACS example
in
From ETH. So ETH Zurich the systems group. They do a lot of research in
FPGAs. They also did a lot of research with this catapult or the Microsoft's
FPGA interface in the network. And here they basically tried using the FPGA in between the hard drive and the workstation or the database.
And then you can basically offload some of the computation to the FPGA.
And there was even a database startup in Berlin that built special accelerators for MySQL, but they now were
bought by Altera or something.
One of the FPGA vendors, I think.
There was actually really just low-level acceleration. if you go back to our lecture series,
you can actually find their talk.
So we had them here at some point present some details
about their architecture.
But unfortunately, they're gone now.
So I don't know actually if there's still a lab here
or not in Berlin.
And so here, for example, eBIGS implemented projection,
selection, and group by aggregation in the FPGA.
Then the database would basically offload this to the FPGA and the database can deal with the rest.
It just gets already the pre-projected data from the FPGA.
Today, you can even buy disks which already have built-in FPGA. And today you can even buy disks which already have built-in an FPGA.
So in that you can basically do the same thing yourself. So Samsung
sells a disk that has an FPGA that you can reprogram and then use as a regular
disk but already do some of the computation on the FPGA.
The last example that I want to show is the Intel processor. There was the Intel processor,
like a Xeon processor, which had an FPGA built into the processor. So there basically you have a coprocessor where you have on the one hand you have the
regular CPU, so Xeon processor, and then you have an FPGA and they're connected via UPI
and PCIe lanes.
So you can directly basically use the FPGA as a coprocessor within the same socket essentially.
And there is basically two sockets, so then you would just have the UPI or in this case even QPI,
but then they also built one where the FPGA was directly within the chip.
And that means you can basically reprogram this and you have very tight integration into the processor.
You can offload some of the computation that would make more sense or that's more efficient on the FPGA than on the regular processor.
So say, for example, networking.
So this is something that you can directly do, even IP networking.
You can directly implement in hardware and then the CPU doesn't have to do anything anymore.
Or any kind of encoding and decoding if you're doing video processing, for example.
Or you're doing simple database workloads, you can also offload those. The FPGA takes some time to reprogram,
and I will talk a bit more about this,
and especially to design the mapping,
so how is the layout of the FPGA,
but it's reasonable amounts of time,
especially if you already have different kind of layouts,
just mapping and programming the FPGA is not that expensive.
Okay.
So with this, let's dive a bit deeper into the FPGA architecture, right?
So what really does the chip look like internally and the key to
reprogrammability are lookup tables.
So this is actually quite simple, right?
So the idea is that you basically have logic tables.
So you basically just build like these lookup tables,
like you probably know from your math classes or something
where you basically, we want an end gate. You have a look something, where you basically want an AND gate.
You have a lookup table where you define all of the inputs,
like binary inputs, and you define the correct unary output.
So if you want an AND gate, for example,
your input needs to look like this.
So if you have two zeros, the output will be zero.
Or two false, the output will be false.
If it's false and true, the output will be false.
If it's true and false, the output will be false.
And then, if it's true and true, the output will be true.
This is how we can basically, having a lookup table,
we can program an end gate.
Same with an OR gate, same with an XOR gate, NOR, etc.
So all of that is basically just built with these lookup tables.
So you're building, you're just setting up lookup tables or something like really custom designed.
And then you get the correct output.
And this is built into the hardware.
So this is really one of the basic logic blocks,
is you have these lookup tables in hardware.
So you basically map such a lookup table into an SRAM.
Remember, this is one type of storage cell.
You're mapping it directly into the storage cells
and then you're using a multiplexer to basically switch the correct storage cells as an input.
So, if you have, and this is, so lookup table is like abbreviated with lot, so you have an input lot, then you need 2n bits of SRAM to store
the lookup table, but it's probably 2 to the power of n.
I have to check.
Anyway, you basically then build a tree of multiplexes to read a given configuration.
So the way this works is here in this lookup table, we're storing our here in the SRAM, we're storing the lookup table and then we're using multiplexes.
So these are basically switches and We're basically switching them based
on the input. Here, this would be a four-input lookup table. For the two inputs, we only would
need this subpart here, for example. I'll have another example in a bit where I do take exactly
this. Then we're switching based on the input. So here I'm
writing 0, 1, 1, 0, whatever. And then based on the input that I have, I'll look this up, right?
So I'll basically see which one did the input choose, right? So this input will basically basically choose the first level on each of the first entry of each of the two switches.
Then this will basically choose in between those two outputs.
So you get the same structure that you would have in the lookup table essentially.
Let's say reversed. So if we look at this input 0, 1, 2, 3,
then this 0 would basically be this row here.
And by switching here, we can just do exactly this lookup
and our output will be the output that we have here.
So basically, if in here I will say 0 0 for example,
then I will go here 0 and 0. So here I would switch to this route and here to this and I get
this output. So this would be this lookup for example. And this is the basic principle how
we're building different kind of logic blocks in the FPGA. Just
using this. Here I have an example of four input lookup table. This
for example has these kind of four input lookup tables. Modern larger FPGAs will typically have six input lookup tables.
So meaning you can have six inputs in there.
And especially based on this, you have to always double basically the size.
This is why I'm saying this is probably wrong.
So if you have five inputs, obviously you need to double this again.
Six inputs, you need to quadruple this.
Then in order to keep this info or keep the output stable,
because this basically is done by switching.
We have a current running through the sim,
like a clock, and the lookup table
will basically be running based on the,
or be working based on the input coming in, right?
So we're writing an input to the input here
in the clock cycle,
then we need to keep the input stable for a clock cycle.
And this is what the D flip-flop does.
So this is basically a simple memory cell
where the info will go in.
So this is basically our input here.
Then we still have the clock.
And then we get basically the output which is just
the same input, but it will be stable.
So even if this briefly just is inputted by, let's say, rising current, for example, then
this will be stable here until we have the next clock cycle.
And we can also, rather than getting the same
input, we can also get the inverted input. So if this is 1, the input is 1, then
we'll have a 1 coming out here and a 0 coming out here. And then we also have a
set and reset port where we basically can override this. So with the set port
we can say, well everything should be, no matter what the input is,
or everything should be zero, no matter what the input is.
And then there's also, I forgot what that does.
So we can also put one and one, and then I think the whole logic is switched, for example.
So we need this to keep the input stable
and it gives us additional configurability
because we can just negate the output
of the lookup table here, or we can overwrite it.
So with this, we can then basically build the whole cell.
So this is like a mini example for a tool lookup where we would have
like two inputs and this again it's the same as before so we have in this case we only need
because it's only two lookup we only need four memory cells these can either be zero or one
and then we have two multiplexes that basically do the first or the second input
actually.
And then one multiplexer that switches to the first input.
And then if we have like one instance of this, so this would be basically the lookup table.
So these are all of the inputs that we can have, and this is the lookup table. So these are all of the inputs that we can have and this is the lookup
table. So this is actually what we need to store in the lookup table.
And then you can see if we put this here just like as we would have it in the
lookup table. We store this in this SRAM. Then if we switch, say for example our
input is 1, 0. So then in this multiplexer, we're going to 1.
So this would mean on the second input, for example.
In this multiplexer, we set 0.
So we're going to this multiplexer, we'll have 0,
and which is exactly what we're looking up here.
So this is basically really how this works.
This is what the internal
logic does in the FPGA. And of course, this is really simple. But building this, you have these
basic logical building blocks that you can use. And it's very flexible, right? So right now,
we're talking about two inputs. But as we said, in a big FPGA this would be six inputs.
Multiple of those will be actually connected within a logical block.
You're not going to do the programming on these two lots.
Of course, you can really just do these small inputs, but you will have larger groupings. So in, let's say, in the current
Xilinx architecture
or Altera architecture
there would
Xilinx is, I think, AMD
and Altera is the Intel.
Of course, everything bought up
in different ways
at some point.
It's a different two types of vendors,
or the two predominant vendors, I would say.
They would be called slices or adaptive logic modules,
and I think in OpenCL it's called elementary logic units.
And there you're packing different parts together.
So it could, for example example be two lookup tables with four inputs
and then the flip-flops and you might additionally have some carry logic. So if you're doing
some mathematics, you basically want to carry the value over if you're running over with some multiplication or addition or something like that.
So then you can basically carry this over also to other lookup tables.
So you don't have to implement additional carryover logic yourself.
So then this is already built in.
So you have some arithmetic carry logic.
You have the multiplexers for getting the output here.
And you have the lookup tables.
Then these might be connected again to logical islands, so then you might have two to four
logical elementary lookup units connected to the logical islands, which then are connected to the switch
matrix. So, then you have basically logical islands in here, many of those, and connected
with some lookup tables and connected to some input output and some memory connections or some memory blocks basically. And again these are called
differently in different type of architectures or depending on the
vendor. I'll also show you how this works internally. So here we can see we have
these logic blocks which are built of the lookup tables of the carry logic and the flip-flops
these are then packed together and then we have this switch matrix and the switch matrix again
is kind of this networking so this basically connects all of these logic blocks together. So we have a set of connections just on the chip in
between these logic blocks. So we have hundreds to thousands of logic blocks
and then we have this matrix and this is again configurable. This
basically then means based on how we connect these, then we kind of get the
connections of what is connected, like how is which logic block connected to which other logic block.
And this means we want to have multiple connections in between the logic blocks, so it's not just like what is shown here, right?
It's not just a single connection, but we'll have multiple connections connections and then we can switch which ones are
actually active and which ones are not and how are they connected so if we have like three connections
here for example we'll have um switch or programmable links in here and these again just
work with memory cells right so basically we can do all the different kind of connections that are possible
so if we have like a crossing here right then we can just switch them straight we can do these
round or like left to upper or left to lower or upper to right etc so all of these individual
switches are possible and for each of those
there's basically for each of the connection there's a small memory cell which again we can
program at programming time right so when we write the design on here then we can say okay i want
this connection to be activated and the other connections are not going to be activated it's
just done by this memory cell we'll be basically basically saying, okay, we switch this on,
and then this means, okay, these two logic blocks, for example,
these two outputs are connected together.
That's not everything on there.
So we have all the logic blocks,
and they can store a little bit of data in the flip-flops, etc.
But then, some stuff we always need.
So if this is a PCI Express board, we need to communicate through PCI Express.
Of course, we can implement this on the FPGA, but this will take already some space on the
FPGA. We always have to place and
route all of this. So it doesn't really make sense to put this stuff on the FPGA, but there is
some dedicated hardware or hard logic on the FPGA or on the accelerator board, which will not be
implemented using the reconfigurable logic. This is then called hard IP.
There can also be soft IP, so this would be some kind of software implemented on there,
some kind of firmware on there, that's basically designed by the manufacturer and will be placed
on there so you can easily use it, but you don't have any influence on it. And that's basically how to communicate with the system.
Then you also have I.O. blocks.
So this is basically input output.
If you say, for example, on this one, you want to communicate with the LEDs.
So you want to show something here.
Then there needs to be an input output block that will send something to the LEDs.
I'm going to switch this on in a bit, so then you can see.
And then there's also block RAM.
So in order to have fast little memory on there,
you have kind of small blocks of memory in there
that you can also connect to the individual
logical islands so that you can store some intermediate states you can store some input
etc or do some random access basically but this is really small right this is kilobytes
per individual block and then you might have some floating point units, some digital
signal processors, anything where it doesn't really make sense to build this in the hardware
because you can just be much more efficient if you don't do it in logical blocks, in these
elementary units.
If there's special hardware that can do floating point operations very efficiently, you don't really want to do this in the FPGA resources.
I mean, this one, for example, doesn't have this, but a larger FPGA specialized for video processing will have something in there for this, like a digital signal processor.
And then you might even have a complete processor on there. But the actual
layout of the chip will be something like this. So you have the logical islands, you have the
block RAM and you have some other kind of units there interconnected. And then using the switch
matrix, you can connect however you want. And of course it makes sense to basically,
if you want to use the block RAM in logical units
to place them close to the block RAM,
rather than going far,
because otherwise you're gonna use a lot
of the switch matrix just for these connections.
And you might run out of switch matrix,
so your program cannot be placed.
Again, something that you don't really have,
like technically you could do this yourself, but practically this will be done by you, by
algorithms. So which logical block is placed where, this is like
there's a place and route optimization that's done for you. And this is actually
what costs most time in programming the FPGA. And not in terms of like, of course, you programming the whatever you want will probably
cost most time, but then in terms of what the compiler etc. has to do, this is
where all the time goes in making sure that the placement on the chip is actually good.
Okay, so I'll show you two examples again, So just so you get like an idea of size.
So this would be like a
older Intel card.
So an Intel ARIA
10, 11, 50 GX. We might have that in the data center.
I don't fully remember. And here you can see this elementary logical units.
You have 400,000 of these.
So this would be these kind of sub-parts of the logical islands.
And so there you have two lookup tables with six inputs and four flip-flops.
And then, of course course the carry logic and the
logical islands would then be 10 of these elementary logical units.
So this means then we would have like 40,000 logical islands roundabout.
And we have 2.5 thousand BRAM blocks with 20 kilobytes each. So you can see we're in the megabyte range
in terms of memory in there. That's directly placed inside. We will also probably have some,
I don't have numbers for this, we'll have some memory, some DRAM on the board itself, but inside the processor, the memory that's inside in there,
that's in the megabyte range. And then I have this one here. This is what I have here, right?
So this is this thing. That's a Lattice ICE40XH8K breakout board. So the breakout board is like a prototyping board.
So the ICE 40xh is the FPGA type and 8K
because it has 8,000, close to 8,000 logical cells.
And here the logical cells is one lookup table
with four inputs and a flip-flop.
And then we have programmable logic
blocks, which is basically our logical islands again, which consist of eight
logical cells. And we have 32 BRAM blocks with 4 kilobytes each, so 128 kilobit in
total. Kilobit, 4 kilobit each, so 128 kilobit in total. Kilobit. 4 kilobit each.
So 128 kilobit in total in the whole thing.
And that's actually all we have in there.
So there's no additional memory on this.
Everything else has to go through the USB connection here.
And just as an overview, this is basically how this is set up internally.
So this is actually the 1K, with 1,000 logical cells.
That's what they're called here.
And you can see, again, it's kind of this matrix layout.
So we have a two-dimensional layout, where we have the programmable logic blocks, each of those with a four-input lookup table,
with some carry-on logic, with the D-flip-flop, and then with kind of the switch. So we have the layout looks like the schematic is a bit different.
Here we basically have the input from this side.
We have the switch here that basically will then switch the multiplexer that will switch the lookup table and do the output, which then goes into the D flip-flop,
where we can switch if we want to use like this or we want to directly route it.
Here we only have the direct output, not the negative output, but we can also reset this. We have the clock and we have carry logic here as well.
And then we can see eight of those will be in a programmable logic block.
We have the memory in between.
Again, we'll have to switch matrix and we'll have some IO bank,
which then on the one hand will connect us to this USB driver and on the other hand
has some external connectors.
You can see here I could connect a lot of stuff if I want and I have the LEDs.
So this is what would be connected through these input output banks here.
So with this, do we have questions so far?
No? Then we'll do a five-minute break here and then look into how we program this in the next part, after the break.
So, let's talk about FPGA programming. So how do you actually program this?
So you saw how the internals work and that you have these individual memory cells.
Basically that will define your hardware and the circuitry.
And I mean the very basic way to program this, I mean, if you're adventurous, you have to know exactly the
layout of the chip. And then there's a small program that just basically tells the chip
to write the memory cells as you want them. So you will basically say this memory cell
1, this memory cell 0, etc. exactly in the layout that the chip has. But of course, you don't want to do this
because that's going to be much too hard
and very error-prone to do yourself.
So you need kind of a bit more high-level way of doing this.
And there's two or there's a couple of different ways
of programming these.
There's the hardware description language
and there's two prominent languages.
So one is Verilog and one is VHDL. And typically you can use both on the different FPGAs. So
they do more or less the same thing. It's just like a different syntax. And it's kind
of this register transfer level abstraction so you have a
low level of abstraction but you have very full control of the hardware so
you're basically defining signals and you're finding connection between the
signals and what to do what to store how etc but not the placement right so this
you have like a generic circuitry and then your tooling basically has to define
or has to figure out if this fits on your chip or not or how to best place this in order to get good
timing. And so because this is more tricky and often well it needs a lot of understanding of the circuitry and thinking in signals.
And because of that, there's also more high-level languages.
There's OpenCL, which is a language that you can use even to program GPUs and just regular CPUs,
but also FPGAs,
and you can also translate code that you wrote for a GPU to an FPGA.
So that's very generic.
Then there's HLS, so high-level synthesis from Vivado,
so that's vendor-specific.
That's where you also have kind of a C++-like interface
to programming the FPGA.
And they, on the one hand, are much more easy to program
and they provide all of the I.O. components.
So, they will provide libraries of components
that then will be also placed in your circuitry
to make the programming much more easy.
So, that's of course also true for Verilog and VHDL.
There are components that you can reuse and you don't have to program all the details.
So, then, well, I mean, syntactically, VHDL is related to Ada and very long to see, so it looks similar
to a basic programming language, but the semantics are different.
So the statements that you write, they're not executed sequentially, but all executed concurrently.
So depending on whenever the input values change
of the statements, so you have to think about it.
So this directly maps to logical blocks on the circuitry.
And the functions that you're writing, essentially,
or the statements that you're writing, essentially, or the statements that you're writing,
they will be executed exactly when there is a new signal on the logical unit.
And if you have multiple statements in parallel, they will be executed in parallel,
depending on how they're basically connected to each other. There's no function calls.
You have kind of modules
and these modules will then connect it
depending on how you basically define the modules inside them
and in the logic, they will be connected to the through the switch matrix.
If basically depending on what you're using VHDL, very log or high level synthesis,
then this code will be translated to specific hardware structures.
And you really need to know the relations between the hardware constructs and the code that you're writing to get hardware structures. And you really need to know the relations
between the hardware constructs and the code
that you're writing to get efficient code and correct
designs.
I mean, there is a lot of tooling around this.
But essentially, the challenge is,
so you're programming your logic.
And then basically, the circuitry
needs to be routed through
this and you need to make sure that the frequency is is not too high or isn't
yeah basically not too high so you're not basically using too fast frequency
so that the signals can actually flow through your program fast enough or
through your secretory so it's really thinking about signals so you can have
like many signals and many inputs in parallel and then a pipeline of signals
and depending if this pipeline is long or it is too long then you either have
to reduce the frequency of the clock or you have to input in between some
memory so in order to store some intermediates, in order to then
basically use the next clock cycle to continue with this.
And this of course is not everything you can do yourself, but there's basically tooling
that will give you the timings on how the different gates in the different circuits can be connected
and then you can tweak based on saying,
okay, I need to integrate something else here.
This I can do basically with clock cycles,
this I can do without clock cycles,
I can just use the input as fast as I get it
because I know the input will be processed within a clock cycle, no problem.
Okay, so here's a simple example. I mean, basically, just in terms of translation,
right? So you don't have to do this translation. This will be done for you. But in order to figure
out how this is done. So if you do something like a conditional statement, this will be
equivalent to a multiplexer in hardware. So you have your different kind
of inputs would basically be the two kind of options that you have and
your selector. This is basically the switch, how you select which statement or which condition is actually taken.
So this is one way of thinking about this.
So the multiplexes give you conditions,
then any kind of signal changes on clock events will be done through these flip-flops.
So, if you're using the clock, then you'll be using the flip-flop to stable while during the state changes or for one clock cycle.
And you can also basically program this.
You will see this in an example later on.
I have two examples.
Unfortunately, I couldn't fully consolidate this.
So I have a Verilog and a VHDL example.
The Verilog example on the chip and the
VHDL example here in the slides, where you can basically see that we're actually working with
these kind of signals. So this is VHDL and like all of the hardware stuff, people like weird and and boxed abbreviations.
So VHDL is basically VHSIC, hardware description language,
and VHSIC is very high speed integrated circuit
hardware description language.
And here you have basically three things that you're using.
You have the basic signals.
So this is really the input.
And say if I'm writing something to the chip, like through USB,
this will be the input that goes in.
Or my clock could basically be an input, for example.
And then I can also have like a...
or this would be a signal basically.
And then I can have inputs and outputs.
So I have this input from the USB,
I have the input from the clock, etc.
I have inputs from other circuits
and I have outputs.
Basically that again will go to other circuits. This basically defines how things are connected and then we have the
architecture itself that basically tells us how the logic the lookup tables will
be specified. And of course if you get more it gets more complex we'll have to
combine multiple lookup tables. And you can use arithmetic operations, things like that, and that will then be translated
into these logical blocks.
So I have a simple example here, like a multiplier, you will basically have two 16-bit INTs as an input,
multiply those for a 32-bit output out of this,
which will go into the register, be stored there for a clock,
and then be moved on to whatever next processing you might want to have.
And if you want to program this in VHDL, then you will need these kind of signals.
So you have the clock, the two input vectors or the inputs and the output vector.
And then you will have an architecture, which is then basically really just doing this kind of,
let me get pointer again, really on the clock event, basically applying the multiplication
and putting this on the output.
And that's all we're doing, right?
And this will be then really translated into the output. That's all we're doing. This will be then really translated into the hardware.
This is basically what VHDL would look like, like these basic blocks.
Verilog, I'll show you a very simple example, will look quite similarly.
Maybe I'll show you the example right now. So, this is a very simple counter.
This is what I will flash on here in a second.
And this is basically, this is very log in this case.
And so, here you can see we have these modules.
The modules have input and output, which would basically again be the circuitry.
Here as an input, we're using the clock.
The FPGA has a clock signal, basically.
Our output is the output blocks, which are going to the LEDs.
The LEDs are up here, and you'll see this in a bit.
Then we have a register.
So this is basically, we're using some of the BRAM
and then on the positive edge of the clock,
find my mouse, where's my mouse?
Here, so on the positive edge of the clock, so whenever the clock goes one,
basically, or plus,
then we're increasing the counter. So the counter is 12 megahertz.
So we're going to count quite fast and because of that we're only using like the upper signals of the counter because otherwise
this would flash too fast so you don't really see it so we're using um eight bits of the the counter
uh register and up assigning this to the led so this is basically the input is the is the
the clock the clock will go into the register will basically be counted in there or will go
in the, like on every positive edge, will basically increase the counter.
And the counter is assigned basically to this register. And this again is assigned then or is counted in the register
and this output is assigned to the LEDs.
So this is really, really basic circuitry.
We will not take much of the chip space in here.
That's all we do and then we can basically also,
I'm going to show you this.
So what we can do is we can basically just...
To find the mouse again, we can just see it basically has no state right now. So that's why the LEDs are not doing much.
So now it's basically doing synthesis and now programming the chip.
So it's writing the programming on.
Now the program is on and it's just going to continue to count.
So it's basically counting.
You can see that it's just basically counting up in binary the bits
and ever starting again because we're not using the this not just this
is basically the the counter is actually larger but we're just using a few bits
out of the counter that's just going to continue yes 12 megahertz so it's
basically So it's basically... It should be 3 seconds, right? So the whole thing is more like 8. Sorry?
It should be 3 seconds till the counters pull right.
So it looks more like it's...
Well, we're only using a subset of the...
6 to 19, which is...
I can do the math online right now.
So, I mean, if you...
Let me
So we're using
26 to so 19 to 26 which is
I would need a calculator to get this right So what's 19, 2 to the power of 19 is like, it's like, it's like, sorry, no, it should be 500,000 something roundabout.
Like two to the power of 20 would be 1 million, right?
So 500,000 now 12, like 500,000 divided by 12 megahertz
would be like how often,
how frequently this should be updated.
So that's 12.
0.04.
And I mean, this would basically, this, the lowest one,
would be the one that's like every, and then the highest one 26 is like 1 million 2 million 4 million 8 million 16
20 32
well let i let marcel do the counting uh and the calculations round about, but this should
correlate with the 12 MHz basically.
It makes sense.
Make sense?
So this will be sub-second, but the other ones you can see is in the second range.
Yeah, I mean with MHz it's roundabout right, right?
So it's, I mean we're in the millions, so megahertz would be in the millions of frequency.
So, sounds roundabout right.
And that should be exact. I mean, this is actually the thing, right?
If the 12 MHz are exact,
we should actually be able to exactly count
how frequently this is updated.
So this is also how this, in the end, would work
if we're counting or if we're synthesizing.
So we really have to know how the signal processes through there, how often this is updated,
in order to then be able to predict
how much we can put into one circuit.
So let me move this.
We have to...
Okay.
So, nice little adventure.
If you ever want to try this, come by.
I can surely borrow this to you. Okay, so let's look at the design flow in more detail.
And that's just going to be where we're going to stop today after the design flow, actually.
So, we're basically, I mean, what we have to do is we have to specify the hardware.
We have to somehow tell the chip or the compiler, etc., what exactly should be put on the hardware, how the layout should be.
And from a logical level or from a programmer level, we have to adapt our algorithm to the parallelism of the FPGA and to the design of the FPGA.
So, we do this in the hardware definition language or high-level language.
So, here I showed you Verilog.
I showed you VHDL a bit.
I actually wanted to do this with VHDL, but apparently the pipeline is not there yet.
So, this doesn't 100% work.
It's a bit more complicated.
So this is why we're using Verilog on this one.
So we're writing this code.
And then this needs to be translated.
So the code that we wrote, the code that you saw,
is not a low-level description of circuits.
So this basically needs to be further translated into logic-level representation.
So meaning into this lookup table kind of block representation.
And this is called synthesis.
So this part is actually not that expensive,
because it's a simple translation to logic blocks, and that then somehow have to be put
on the FPGA. But then the expensive part is the place and route. So first we just have these logic
blocks that are somehow connected, but now we actually have to place it on the fpga
itself and the fpga has very specific logic blocks and the fpga has the switch matrix and we have to
make sure that we connect the switch matrix correctly and we place everything correctly so
we're using the chip space efficiently we have to make sure that the signal transmission is fast enough through
Through the whole chip etc. Because if I mean, of course we can design a very long pipeline
Of things but then the signal won't go through fast enough
So even in 12 megahertz like even on this chip
We can design something that won't go through within 12 MHz.
And then, basically, we would have to clock down and again, there we need to know what's
the clock frequency we can actually use. So, this is a lot of work. So, this is basically
NP-hard problems and this takes a lot of time. And The larger the FPGA and the more complex your circuit,
the longer this takes. This can easily go into the days of processing, in order to actually execute this.
Then, this place and route, this mapping that already gives us this two-dimensional architecture of the circuits,
needs to be configured and put on the FPGA.
So, we're basically building just a binary representation.
So, usually it's an ASCII representation that is really like the chip layout.
Basically, all of the memory cells on there, we're saying this is one, this is zero, the one, etc. So, the memory cells for the lookup tables, the memory cells for the switch matrix,
etc. All of these we can independently configure and this needs to be be written into this format and then be flushed onto the FPGA.
And again, this is not super expensive.
The most expensive part is here, this place and route.
Or this place and route, so the implementation to the FPGA itself.
So here, a quick example of the Xilinx workflow.
So we're starting with our high definition, or hardware definition language, so VeryLog, for example,
that we turn into a collection of net lists.
So basically logical blocks that can be placed on an FPGA
that are not designed yet. So we have a basic representation of logical hardware
without any way of how to put this onto the FPGA, just how we designed this.
Then the actual place and route is done.
And this means basically we translating
and combining the input nest list.
So initially this is individual models.
Then we combine them and into like a generic database file,
meaning like one big generic file,
which is still not exactly how we would put it to the FPGA.
It's not, but it's like one big schematic of the implementation.
Then we need to actually map this to the exact specification of the FPGA,
meaning do we have four lookup tables, four input lookup tables,
do we have six input lookup tables, what are the logic blocks, etc.
Again, not super expensive, but already some work to make it more hardware specific.
And then we do the place and route so basically place these individual
components and connect them through the matrix and this is as you can imagine quite expensive
because all of the blocks are identical right so there's many different ways of placing this
and many different ways of connecting this and we need to find the one that
actually is actually most efficient or at least one that actually works so in many cases if we
really want to fill up the fpga this is a lot of work to basically move around and figure out where
we place this if the fpga has many more logical blocks than we actually use, then it's easy, because then there's a lot of space to move around.
If we're filling it up, then it's going to be very costly.
And there's even ways that we can actually, in modern tooling, we can actually divide
this up into submodules and then reuse these submodules and also reconfigure these submodules. So because this is quite expensive, we can say, well, I'm using this subpart and I'm
replacing this subpart and I have different implementations for this subpart, which adhere
to these timing constraints, etc.
And then I can reconfigure these parts more efficiently because doing the whole image
and basically the whole layout for everything,
as I said, can take a day or two for a larger FPGA.
Then, once we have the design, we have to generate the bitstream. So, this is just like generating
this FPGA readable format. It could be ASCII, so often this is really ASCII,
and then it just needs to be programmed on the FPGA.
This is usually flushing, and this is basically where you saw that the lights were kind of low.
This is where really the program is written onto the FPGA, so it's reconfigured.
We write all the individual memory cells, and then we're in an undefined state for some moment,
and then once the clock starts again and regular processing,
then we can actually do the processing.
And then the FPGA will just continuously do whatever it's programmed for.
This is Xilinx.
So Xilinx is one of the big vendors.
It was bought up by AMD. You can find lots of FPGA cards from Xilinx.
They have lots of complicated and expensive licenses for everything.
This is why I actually bought this one, because for this there's also an open source toolchain.
So there's a tool, a synthesis tool called IOSIS, where you can basically do all the synthesis,
which also includes a place and route tool, an open source place and route tool that's called NextNR. And then, specifically for each FPGA, you basically need the tools that will map to exactly this hardware. So, then we need to know what are the logic blocks in here, what are the
programmable logic blocks, where is the BRAM, etc. So, the exact layout of the model,
like first the basic blocks, depending on the hardware,
then the exact layout of the hardware,
saying here I have 8000 lookup tables,
so where are they, how can I connect them?
So, for this I need hardware-specific tools,
and that's in this case IceStorm tools, which then I at least need the
IcePack for creating the bitstream and IceProg for writing the FPGA. So IceProg would then be
the tool that just writes the program onto the FPGA. And IOSYS is very locked.
There's also GHDL, which is an open source VHDL compiler.
You can also use this for simulating the FPGA.
So there's a complete simulator that will compile this into binary
and you can run it on your regular CPU and try it out.
But there's also a connector to IOSYS
to then use the same kind of tool chain.
And there's of course also alternatives to this,
but this seems to be the most prominent way of doing this.
But this one is still prototypical,
which is why I didn't really try it out in this case.
So this is, and this is what I showed you here today, right? So using IOSOS for the
compilation and like in the mapping etc. or that the basic translation
into the
netlist and then next PNR which is part of the IOSOS suite for the place and route and then the
IStorm tools for actually programming the FPGA.
Okay, questions so far? Yes?
How good is the vendor support for OPCSOS software?
For this one, okay, because they particularly built it for that. For the others not much. So I think, I mean, they really live off these
expensive licenses and this you need license servers and everything so it's quite complicated.
Even for academia we have to go through a lengthy process to get access to licenses for the hardware. But IOSIS and NixPNR, there is some tooling for
Xilinx hardware. But it's experimental. So I think it cannot do
everything, but at least you can try it out. We could try it out at some point if this works.
But of course, the regular tools, they are much more convenient.
They have a lot of other stuff.
But all of the timing stuff, etc., you can also do with this tool change.
Okay.
No more questions? Great.
Then thanks a lot for listening. Next time I'm going to
finish this up and I'm going to summarize the whole course also, since we'll have some
additional time and we can go do some Q&A. So that's going to be on Wednesday next week.
Tuesday you're going to get the next exercise.
And then, the final topic will be by Marcel.
Unfortunately, I'm not going to be here, but this is something, this is really hot stuff.
So, I really recommend going there, listening in, because at least as much or what we see right now,
this is where all of the accelerator connections will go
in the future.