Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays

Starting point is 00:00:00 Welcome everybody. Sorry, I have a bit of a back pain, so I'm not super focused today. But we'll get through this together. So today we want to talk about FPGAs. FPGAs is the last topic in this class, in this course actually. We're going to talk about this today and then also next week in the second session next week. I don't have any big announcements except as you know we'll have the invited talk and we'll have the data center tour on the last day. I'll probably also do something like maybe let's do an announcement or a poll for the data center tour so we know who's going to come.

Starting point is 00:00:58 And then we'll see how much we can show you. So this is always a bit tricky because it's a safe or not safe critical environment critical space so we need to make sure we have somebody to show us the different areas. So I'm not allowed to enter all of the areas without a proper person. We'll figure this out but with this uh today fpgas um today i'm gonna mainly talk about introduction and architecture i brought an fpga with me um i'll show this later in action it's not that much to see actually but you can see this is a real fpga um yeah, it's fairly small. And I have like the specs also in the slides so you can actually see what's going on.

Starting point is 00:01:50 But this one is nice because unlike most FPGAs that you can get, this one you can completely program open source. So this one, you don't need any weird license stuff, et cetera, which is the usual case for FPGAs. Otherwise, okay. So, with that, let me dive right into it. Sorry about this.

Starting point is 00:02:31 So, here we go. The usual recap, right? You know we're not really increasing in performance anymore. We've seen that we've basically tried to mitigate these problems of limited scalability in single processors or single threads through multiple cores. Due to hitting the power limit at this stage, we're also at the stage where just increasing the number of cores is not really feasible anymore. So there's a need for specialization. And so with that, because we kind of have this limitation of multi-core, because even though for a long time the power consumption would stay the same, even if we're doubling the transistor density, that's called Dennard scaling,

Starting point is 00:03:20 this hit a limit, a physical limit in 2006. And with this, we basically, we cannot just increase density anymore. So meaning just building better chips didn't work anymore. So, I mean, if you basically look at current chip designs, what happens is that chips just get larger and larger, right? So you really see humongous chips right now. So there's specialized vendors who build chips on a single waiver, right? Like a complete waiver, sort of a single chip. I mean, this is really specialized hardware but you can find this today, which seems ridiculous.

Starting point is 00:04:00 Right, so you get chips like this, this size. And I mean, for a long time everything just got smaller and smaller. Now it doesn't work anymore, so things get larger. And of course that means it takes a lot of energy. Also, just propagation through the chip takes some time. And the more energy you put in there, the more you get current leakage. Meaning the more it also warms up the whole thing.

Starting point is 00:04:31 This means there's a certain peak frequency that a CPU can achieve, or a processor in general can achieve. You can circumvent this to some degree, or you can stretch this to some degree by investing more into cooling. But that means you basically have a heating plate that you're trying to cool down all the time, so in order that the chip itself doesn't melt. So that's what's going on right now. And you see these modern chips, the latest amd and intel cpus they and also the gpus they they use an humongous amount of energy and this needs to be cooled so it's not super

Starting point is 00:05:14 efficient anymore right we're just basically buying uh more performance just by just using more power which was not the case in the past. In the past, basically, we would get better performance by staying at the same level of power usage. Right now, we're just putting more power into these things. This is also why right now what we're seeing is data centers, I mean, even though we're scaling data centers, also individual data centers use more and more powers

Starting point is 00:05:45 just because these CPUs and GPUs are so power hungry. Right? I mean, as a side note, I might actually come back to this in the last lecture. You can see that all major cloud providers are not hitting their carbon efficiency targets. Right? So right now, like for a couple of years ago,

Starting point is 00:06:05 everybody said we're going to be net zero very soon. It's not happening anymore because they're just like putting more and more AI stuff in their data centers and doubling their power usage. So what can we do, right? So, I mean, this is happening anyway. I mean, anyway, let's hope it's not happening anyway. We're trying to be better here.

Starting point is 00:06:29 We're trying to be more efficient. But there's even more we can do with hardware. So, we don't necessarily just have to get like this individual or this regular off-the-shelf chip more performant by just putting more chip space and more power into this. But we can also try to be more efficient by specializing. One idea is basically adding some kind of dark silicon. The idea is that we have different specialized architectures on the same chip and

Starting point is 00:07:06 just utilize some of it for certain cases. And this we've seen to some degree already if you look at the ARM chip that's in my laptop for example. So there you have different kind of cores and some are used for one case and some are used for another case, right? So we have the performance and the efficiency cores. So, and this is also an architecture that has been, let's say, pioneered by ARM design. So with ARM Big Little design, we say you have like a big chip, if you need a lot of performance, you have a small chip, if you need a lot of performance you have a small chip if you want to be rather perform you only on or not chip like an area at core and you're only using uh the one that

Starting point is 00:07:52 you can um that you have for a certain uh or using the one for your application so you can basically specialize based on the application. Also, we're like in regular chips, we see we've put, or we're putting more, the vendors put more and more different specialized units and not everything can run all the time at the same time. Right, so, and this is what we call dark silicon then. So we're specializing, but we're not using everything all the time.

Starting point is 00:08:23 With this, we're basically paying some extra chip space, but we might be more efficient for certain tasks by just having specialized areas on the chip. So that's one way. But we can go even further, right? So you've seen this, where we have the very flexible CPU on the one end. So, let me see. So, the CPU on the one end that basically can run anything.

Starting point is 00:08:56 So, single thread CPU, very generic for all kinds of programs. Multi-core CPU is not as generic anymore because you already need to specialize, need to have a parallel program to run on this. If you have something that's completely sequential in its base, right? So the problem is really sequential, multi-core CPU doesn't help you anymore.

Starting point is 00:09:25 It's not really like there's very few problems like this. So in most cases, you can do something with multi-core, but it's not as flexible anymore. We go one step further. We saw GPU, right? So this is a super parallel SIMD style architecture for even more specialized kind of number crunching problems. And we've seen you can do data processing on this in a highly parallel fashion.

Starting point is 00:09:53 As soon as you're walking through a lot of data, this actually makes sense, right? So for all kinds of analytical workloads with good interconnects, you can get a good performance with this. And now we're going to look even further to the right. So more specialized hardware. And this is basically whenever we have a very specific task and we can specialize the hardware even further. So a GPU is still a general purpose processor.

Starting point is 00:10:23 So if we have something that's highly data parallel, that's highly parallel in calculation, then that works well. If we can specialize further, say a certain protocol that we need to, like we have some protocol that's compute intensive, networking for example, or some cryptographic protocol or something like this, we can go further and specialize, use special hardware like an FPGA,

Starting point is 00:10:56 so Field Programmable Gate Area, or we can even use an ASIC, so a specialized circuitry that's just built for this. ASICs you have everywhere in your CPU, right? So everything that's basically built in hardware that's not configurable processing units, these are ASICs. So NICs, for example, to some degree are ASICs. They're not reconfigurable. They can just process network packets. This is specialized circuitry. But you can do this for all kinds of stuff. The big thing what these are built for, of course, these days is AI and cryptocurrency.

Starting point is 00:11:36 Basically, we can specialize hardware just for AI inference, for example, or for cryptocurrency mining. Because this is, I mean, it uses so much power, also super useless, of course, unless you want to get rich in weird schemes. But otherwise, this is basically the only way to somehow still make money with this by specializing the hardware for this and today we i mean we will cover asic to some degree but i want to talk about fpga in a bit more detail and for that i already showed it earlier i also brought a small fpga here today and you'll see some lights here that does something so if you want to see this maybe I'll show it in the break or something. You can come closer and touch it, whatever you want. I put it into this frame so you can actually also touch it without breaking it.

Starting point is 00:12:35 Okay, so specialized hardware. So today we're going to talk about very specialized hardware. So GPU is way too generic for us today. There's two things, two levels, let's say. We have the application-specific integrated circuit. It's a broad term for basically all kinds of hardware or processors that are built for specific applications. So this means we have circuitry that in hardware implements protocols, for example, implements certain algorithms, etc. And then we have the field programmer with the gate arrays,

Starting point is 00:13:19 which are very similar in a way. So they function in a very similar way as an ASIC, but you can reprogram them. So they have logical blocks on there, so this thing here has lots of individual logical blocks with a very specific structure. We'll go through this today. That basically allows you to implement any kind of algorithm on this, any kind of algorithm, any kind of program on this, but in hardware. Different than what we were used to. So in the CPU, you remember you have this different kind of functional units that do calculations, for example, that do certain logical operations, that do load and store

Starting point is 00:14:02 and things like that. This here does not have this I mean, there is also some circuit will go through this again. There's also some circuitry for this but the general CPU Or not CPU the general processor on here is completely flexible so it basically doesn't have any addition or something like this you have to Program this yourself. There's some tools around it. We'll go through this, but this basically, it's completely flexible. If you don't want to do addition on this, there's no functionality to do addition on this. Okay. So in terms of

Starting point is 00:14:40 performance, if we compare the two, an ASIC usually has much higher performance and speed than an FPGA because it's really catered exactly to an application. It's super efficient because for the application that it's built, it's built in hardware. And it's in a different way, right? So while the CPU, it's a generic thing, right? It needs to go through the program. It basically spends a lot of time just decoding the program, loading the program. For an ASIC, that's not the case.

Starting point is 00:15:16 The ASIC has the program built in. It's hard-coded in the circuitry. So it's only there for this. Of course, it might have some additional logic where you can reconfigure some stuff, but in general, everything is in the hardware. So it doesn't need all the program counter, et cetera, stuff,

Starting point is 00:15:35 and load and store instructions, et cetera. That's all in hardware. But it's a fixed circuit, right? So you cannot change anything at runtime. If you have an error in your hardware, well, there's an error in your hardware, right? And you might be able to work around with some additional fixes in firmware

Starting point is 00:15:59 or some whatever patches, but the circuitry itself is fixed. So you cannot be configured. The FPGA can be reconfigured. So the FPGA has generic logical blocks that you can change, not necessarily at runtime, to some degree at runtime. There's some additional stuff out there for this,

Starting point is 00:16:21 but in general at, let's say, compile time. So you can think about this in the same way. And because of that, of course, the ASIC is much slower to market, right? So the ASIC, you need to basically build a new circuit, and then you put it to the market, and then you can use it while the FPGA is generic, right? So there's different kind of FPGAs with, you will see there's I mean if you go to the websites of the vendors there's many different versions it's generally it's always the same kind

Starting point is 00:16:52 of FPGA itself but then there's a lot of of additional stuff on the board that helps you for certain things and there's like the the design flow is is kind of similar for an FPGA and an ASIC with the differences where does this happen, right? So for an ASIC and often, maybe taking a step back, the FPGA is a prototype for an ASIC. So whenever you're working with companies that build hardware, a lot of the stuff that they do, they will basically prototype this on FPGA. So we're right now working with some CXL stuff, as you heard, and the CXL prototypes, these are FPGAs that implement in their reconfigurable

Starting point is 00:17:45 hardware the CXL protocols. are FPGAs that implement in their reconfigurable hardware, the CXL protocols. So for that kind of stuff, it's also, but you can also use it directly. I mean, this one is a toy, right? But in general, you can use it directly in your data center, whatever, to do something. Okay, so the design flow or workflow, how this works for an ASIC, for example,

Starting point is 00:18:10 you start with a design. So you have like a circuit design. So actual, what kind of circuits are there, like what kind of flops, et cetera, and gates. Then you have a tape out, which is basically, I mean, the way this works in an abstract way and practically as well, but of course, because it's so fine, there's a lot of physics and a lot of technology behind it. You basically project an image of the circuits onto the die, and then it's basically somehow built on the die and then it's it's basically somehow uh built on the die so but the the design itself is more or less on like an image it's a film that's then projected uh at the at the rollout

Starting point is 00:18:57 so i have a photo mask this is what the tape out does right so you have a photo mask that then is used for chip production and i mean classically if you see old pictures from the 80s or something, people would stand in front of large desks where they can see the different layout. And this is also, if I show you these images of dyes, it's basically what is projected on the dye. And there, I don't know exactly the technical way how this is done, probably laser and whatever.

Starting point is 00:19:29 It is then built into the chip itself. And, I mean, the design is fine, right? So this is there, you can iterate quickly. As soon as you build this photo mask and put it to manufacture, then fixing anything is super expensive. Because this basically is where you build the chips. You will always test the chips, because a lot of the chips,

Starting point is 00:19:58 just because of irregularities in the die, et cetera, there might be some problems. But if it's in the design I mean, this is also why there's always some some basically some arrow bounds on the chip so you can Capture some stuff usually because there will for sure I mean like in software, right? There will be some bugs here and there so you build build some redundancy, some error checking, so you can fix this.

Starting point is 00:20:29 But after manufacture, basically the chip is there. You cannot change anything on the chip anymore. So anything after that needs to be fixed in software. And if you have a circuit that's completely fixed, there's not much you can do. In the FPGA design flow, there you basically have a design and this is also circuitry. So, on here, I have a chip layout which has different kind of gates etc. and you will see this, as I said, logical blocks. For this, I build a design that's built into or programmed onto the chip. And once it's programmed, it can be implemented onto the chip. So it's streamed onto the chip or flashed onto the chip.

Starting point is 00:21:14 And then I can test it, I can use it, but I can also fix it again. So the design and implementation. So the design is basically you programming this on a high level, and then there's a step that basically projects this again onto the chip itself. And this actually takes a long time. So this might be hours to days of compilation, but only that, right?

Starting point is 00:21:43 It's not like a complete process where then going back where I have some piece of more or less hardware where I have to go back and redo everything. So here I just have to compile basically again. And then the implementation, that's actually fairly fast. Flashing this on here doesn't take much time. And so then bug fixing is not a problem. So it takes some time, especially the larger the chip,

Starting point is 00:22:11 the more expensive and the larger the program that you want to flash on, the more time this takes, just because it's a complex process. But it's not months or years. It's just like a few hours, essentially. Okay, so more details. Yeah, it's called free programmable gate area. You have free choice of architecture, right? So, I mean, the chip itself has an architecture,

Starting point is 00:22:42 but the way you use this architecture is completely flexible. And hypothetically, you could actually go and take a huge map, as you did in the past with, I mean, for this one, it would actually probably still be possible. You draw a huge map where you say, oh, I want this to be connected here and this, so you have kind of connections, you have logical units and you have lookup tables and you could basically map, completely map this out, right? In practice, you don't want to do this because modern chips are so large that this is not really feasible. So you program this and then it's translated into an architecture.

Starting point is 00:23:27 But you have some influence on this. You have fine-grained pipeline communication and distributed memory. So typically the memories in there is all distributed across the chip. Again, we will see this to some degree. So you have fast access and you basically have a slower frequency. So these are in the hundreds of megahertz range typically, not gigahertz range. Right now, for very modern chips, you will be in the 5 GHz range or something like this for CPUs or GPUs.

Starting point is 00:24:10 This one has 12 MHz, for example. So this is less than my first PC. And the big ones will be in the, as I said, 600 MHz range, something like this. So that sounds slow, but you can do a lot in a single clock cycle. So this is the difference. Rather than having one small instruction, or say, if you're really good,

Starting point is 00:24:39 you get in a single core, you get maybe three instructions per cycle or whatever. Here, you can do a complete sorting or whatever you could do in a single instruction. You can build everything completely parallel on there. You can have long pipelines, just depending on how fast the signal propagates during a single cycle or single signal. And because of that, or

Starting point is 00:25:06 in general, it's also fairly power efficient, also because of the low megahertz or load frequency. So usually these have under 25 watt power consumption. I don't remember what this has, but this is powered through USB, so it doesn't really take a lot. So in general, the FPGA is kind of the middle ground in between a CPU and an ASIC. And this is also what it's, what it's typically used for, right? Either in places where I want to be efficient, but don't have

Starting point is 00:25:44 to change the code a lot. So I want to be able to upgrade something. So networking, for example. So smart networking. I want to be very efficient. This has to be fast. But it doesn't have to be changed a lot. But every now and then, the protocol might have to be updated somehow.

Starting point is 00:26:01 So then it makes sense to have an FPGA, because if I have an ASIC I cannot really update anything or as I said for prototyping right so whenever I want to try something with new hardware I want to maybe have an ASIC at some point well then for debugging this and getting the timings etc right getting an architecture that works well, I can start with an FPGA and just basically iterate on the FPGA until I'm happy. And then I will build it in an ASIC

Starting point is 00:26:32 and the ASIC will be even more efficient. There. Okay. So just as a comparison, again, I've alluded to this to some degree already so far, but here is a comparison to the classical von Neumann architecture. So the von Neumann architecture is the classical

Starting point is 00:26:52 CPU design, where I have memory separated from the CPU. And I have this kind of pipeline, how the information is processed. So I have individual steps, how I have this kind of pipeline, how the information is processed. So I have individual steps, how I have a program counter, I have individual instructions, and they're processed now one by one. We know this, right? There's some parallelism in here today. But in a very classical way, this would be completely sequential. And a lot of the logic and the cycles and the chip space is invested just in fetching and decoding instructions.

Starting point is 00:27:32 So I have to basically load the code, I have to decode the instructions into micro-operations, I have to queue these micro-operations. Modern CPUs will reorder this in order to not wait for the memory all the time because the memory is slow because it's outside of the chip. In the FPGA, there's no instructions. If you want to have instructions, you have to program your instructions yourself. So you have to basically build all this yourself because the FPGA itself doesn't have anything like that. FPGA just has circuitry. It's just basically like electronics.

Starting point is 00:28:11 It's just circuits. There's no logic on there that would give you, oh, now I'm loading something else or something like this. But I'm building gates and circuits that just connect to each other and do something. There's a frequency, so I can basically set some frequency and can do something when the signal changes, but it's really, I have a signal change, now what do I do with the signal change, essentially. Flip-flop registers, and they have blocks of RAM, very small RAM, but it's close within the chip,

Starting point is 00:28:48 it's close to the logical blocks. It's across, distributed across, so I have very tight connections. I can directly read within the same cycle, I can basically read, just like from registers, from the RAM across the chip, and I wire this to the logic blocks. So I'm not basically have logical operations to find out what I'm gonna do next, but it's hardwired, not hardwired. It's flashed on there, so it's wired together. And all information is generally processed in parallel.

Starting point is 00:29:26 And the speed or the latency of the chip really just depends on the signal propagation time. And this is also like depending on the circuitry that I'm building on here, the compiler will scale the frequency of the chip. It will basically say, okay, this I can do with this frequency. This was maybe too large. The timing won't work out, right?

Starting point is 00:29:48 So I've built this large network of things in order for the signal to propagate, I have to scale down the frequency. So this is, and this is optimization step. Usually this is an accelerator card. So this is how you usually buy this. If you want to buy an FPGA that's connected somehow to the host system and it also has some local memory, some local I.O. resources can come with DRAM modules, flash storage, network interface, video,

Starting point is 00:30:26 peripheral ports, anything you can think of, right? So, and this is where I say, okay, it gets kind of wild. If you want to buy this, you find like lots and lots of different boards where you say, okay, this is more for networking, this is more for video processing, this is more for AI, and depending on this, this might even have an additional CPU on there, this might even have an additional CPU on there, this might have flash, etc., all kinds of stuff, but typically it's a PCI Express card, so most of them that you can buy. We'll also see other stuff, but it looks something like this, rarely something like this, but of course you also find small stuff like this. But of course you also find small stuff like this. And there's other

Starting point is 00:31:05 integration options as well. As I said, the typical way of connecting this is PCI Express. And as we've heard, probably soon CXL. So essentially having something on the side here through PCI Express, where then you talk to the CPU, or the CPU basically pushes some data to the FPGA. The FPGA does its thing, and then we go on. There's also other integration options. Of course, it could have networking, so it could just be network attached, like anything you can think of, but some specialized things are, it could be within the data path on the disk. So this is something that exists

Starting point is 00:31:52 that you actually have a hard drive with an FPGA on there. So you can do some pre-processing. You can do some additional operations already on the disk or that existed for sometimes. There might be a new version sometime soon with the Excel, et cetera. I think some things were announced. Yesterday, I've checked again.

Starting point is 00:32:15 I couldn't find any currently available CPU. But there have been CPUs where it's directly attached to the CPU. So you have basically an FPGA coprocessor that then would be connected through UPI, through Infinity Fabric, Crossbus, whatever you have, depending on your architecture. Some kind of interconnect such that you can directly basically talk to the same memory you can directly communicate to the fpga with a very high throughput and very low latency so for the fpga folks this was kind of the dream right now i didn't find any currently available version might come up soon again or there was an announcement at some point, I think from AMD that this would happen again,

Starting point is 00:33:07 but I haven't found it yet. So this is, and it's actually used in industry, and in research of course, I mean research, whenever there's new hardware, people go right in, even if it's just announced and people start thinking, okay, how could we use this? But this is also used in industry, right? So this is, I mean, I said in prototyping,

Starting point is 00:33:30 but also in production. So the Microsoft Catapult project uses FPGAs for the data centers. And this means, I haven't checked lately, but at least for some time, I think still, but I would have to verify every server in the data centers. So in Azure, cloud basically has an FPGA for their networking. So they basically every, and you can see this here, I mean, this is an example where they have an FPGA to basically have this like all the virtual network functions

Starting point is 00:34:09 reconfigurable in the cloud and so this was a like a major breakthrough for FPGA research because all of a sudden like there was like a real world example where you have lots and lots of FPGAs everywhere. Right. So meaning all of the servers have an FPGA. This means, okay, we can actually think about how would we use this on the data path. So if we're getting some information from somewhere or we're sending something to the network, what if I do some pre-filtering there? What if I do some pre-aggregation or more complex interactions with the data before I'm actually sending this? I mean, also compression, things like that would make sense here. And here the FPGA would be PCI Express. And you can do like virtual network functions on this.

Starting point is 00:35:03 You can do a reduction of data movement. And of course, you can look at ETH research. There's lots and lots of things that they thought about that you can do when you have this kind of architecture. Because they have a deployment at ETH where they use this, like with Microsoft together, they tried this out in various ways. Another example was EBEX, also a research project from ETH as far as I remember, where they used this as a coprocessor. So where the FPGA is basically directly attached to the SSD

Starting point is 00:35:50 and then used in MySQL as a custom storage engine. And for this actually, there was also for some time in Berlin, a startup that built a specific accelerator with FPGA for MySQL. And I forgot the name of the startup, but we had a presentation here at some point. So if you go back to the lecture series, I think Practical Data Engineering 2019 or something like that.

Starting point is 00:36:17 There they did a presentation on how this worked, but then they were bought by somebody I don't remember exactly. And they stopped building this, but just continued FPGA stuff, unfortunately. So it's not a database startup anymore. It doesn't exist as is anymore, but for some time there was a database FPGA accelerator here in Berlin, which I thought was quite cool.

Starting point is 00:36:52 Yeah, so this is basically here, if you have this in your storage, and there are also disks today, right, SSDs that basically have an FPGA on them, where then you can basically say, okay, I have some logic on my FPGA close to the SSD or on the SSD on the data path. So I can basically say, okay, SSD, if you read this and this, please do that. So again, a simple idea would be compression. Rather than just storing the data as is, I go through the FPGA in order to have everything always compressed. That might give me a bit more space on the SSD. And SSDs of course also have additional tooling for that, but I could also do other things

Starting point is 00:37:31 like encryption, like filtering already. So I'm reading the data and I only get back what's actually required for me. Or I could have, rather than having a block-based interface, I could have more like an object-based interface and say, I'm look, do a look up here. And then the FNPGA does the translation for me, checks if the, which kind of blocks are here and for my object. And I return this, right? And here in eBigs, for example,, they did projection, selection, and aggregation on the FPGA that was on the data path, so where the SSD was basically connected to the FPGA. And I don't remember, I don't have a date for this, but I think it's a Samsung disk. I think this is two years old or three years old where you basically have an FPGA in there. So you could basically build what the ETH guys did back then. You can just build it in there. You don't have to fiddle around with FPGAs yourself.

Starting point is 00:38:46 And this is basically what I also already said. So Intel for some time had FPGAs as coprocessors in two ways. So there was an Intel or an FPGA coprocessor on a separate socket. And then you would have QPI. So from QPI, you can already tell it's a bit older, where you have like a direct connection through QPI to the other socket, which has the FPGA. So one socket is the regular CPU.

Starting point is 00:39:22 The other socket is the FPGA that you then can reconfigure and they can quickly communicate to each other. And then there was this Intel processor. But you can see there's a picture, right? So this is basically an Intel processor where one half is an ARIA 10 FPGA and the other half is a Xeon processor, right? So you, I mean, you don't see it, but underneath you would see it's probably the die is split into two pieces. Oh, it's not, it's actually two dies basically under one is an FPGA, but there's like tight integration Where you also have UPI in between so fairly very fast interconnect between the two

Starting point is 00:40:15 Processors the two different parts of the processor logically, it's basically one processor, but one is configurable And one is the general purpose CPU. And with this, you have a very nice tight integration, very fast interconnect, and you can just access, just like a NUMA node, your memory, rather than going through PCI Express, which you would have to do whenever you have a PCI Express card. And today, again with CXL on the horizon,

Starting point is 00:40:48 this might not be so much more expensive going through PCI Express, but thinking about this one where you would have PCI Express 3, this UPI connection, of course, is way more performant. You get way higher bandwidth and way lower latency than going through PCI Express. With this, you can see that there were some experiments for some workloads. I think it was some basic OpenCL computation. It's more like a graphics program. It's not, this I think was the only one

Starting point is 00:41:29 where I could find some numbers for. So it's a highly parallel workload, not that much communication, but for this you can get fairly nice speed ups doing your, like putting this on the FPGA and having this direct connection. Okay, so with this, I would say let's do a quick break here. So, you've got the very rough, raw overview on a high level and then I would try to go a step deeper and actually tell you what's inside there, what's in the chip itself

Starting point is 00:42:14 and how does it logically work and how is it implemented. Questions so far? Very good. Then let's do a four-minute break. You've seen the hardware to some extent, and I'll show you some schematics of the hardware again. So I pulled out the schematics of the instruction and the manual,

Starting point is 00:42:44 where the details and the hardware characteristics are in a bit more detail. But before we go there, right, so to this little thing here, let's go through to this on a general way. So how does this work in general, from the bottom up, from the very low level logical blocks to then like the larger structure. So we're gonna start very small. And the key to reprogrammability in the hardware of an FPGA

Starting point is 00:43:16 are so-called lookup tables. And this is basically what you should know from your, I don't actually know, I learned this in Technische informatik way back when which is basically well uh or even i don't remember something like this so basically just a logical table right so this is a and you know this like an end or an or you can basically encode this through a lookup table where you basically say, if my input, like in an AND case,

Starting point is 00:43:50 if my input is 0, 0, then my output is 0. If my input is 0, 1, my output is 0. If my input is 1, 0, my output is 0. If my input is 1, 1, my output is 1. So this is a lookup table. So I can basically see what is the lookup, what is the input. For that, I have a certain output. And this is what we do on the FPGA. That's basically the major thing on the FPGA, is having these lookup tables where I can say, given a certain input, what will be the output?

Starting point is 00:44:26 So I really have small memory locations on there where I say, okay, I'm encoding the lookup table, and based on this lookup table, based on the input, please produce this or that output. So the same, of course, for an OR, or a NOR, or a nand, or whatever. So all of these you can encode in whatever else you want. So all other kinds of lookups. So it doesn't have to be like the logical lookup, the logical thing that you're used to.

Starting point is 00:45:02 But you could have something that's just like everything zero right whatever we're putting as an input we get zero out or whatever we just like we just want the first one to have an output right so we basically we have a gate where we're only looking at the first input this is tough let's do it like this so we're only looking at the first input. This is tough. Let's do it like this. So we're only looking at the first input, and the second doesn't matter. So what would the lookup table for this look like? If we're only interested in the first input,

Starting point is 00:45:38 and for the second we don't care. So I mean the lookup table, the input table, what would it look like? Yes? This part? Yeah. Would be the same. Stay the same. Genau, yes. Output would just be 0011.

Starting point is 00:45:58 Exactly. So, the output would be 0011. So, it's very simple. And we're basically just really encoding what our logical output is. So we just have a lookup table, very simple. And this is built in hardware. So this is what this looks like in hardware. So we have SRAM blocks. And in this case we have a 4-input lookup table, so it's a bit more already. Typically, this is 6-input. On this one, it's actually 4-input, right? So, this is a small piece of hardware.

Starting point is 00:46:35 This would actually look something like this in hardware, right? So, we have basically 4 inputs. Then, we have 2 to the power of n, so 2 to the power of 4 storage locations, because we basically need to encode all combinations that we want to have. So in our case where we have 2 inputs, 2 to the power of 2, 4 inputs, 4 storage locations basically. For four inputs, 16 storage locations where we encode the lookup table.

Starting point is 00:47:14 So here we basically have our lookup table for four would be 16 bits. And here we just basically put these inputs. We basically have to just encode this here. I'm going to show you this in smaller examples in a few slides. And then in order to process the input, we have a tree of multiplexers to read basically the input and the configuration. So this is basically the first input

Starting point is 00:47:48 which will go to all of these multiplexers and based on this input we basically produce the first output. Then we have the next multiplexer which basically combines this with this input. Again, combines until we have the final output that combined like the lookup table with all of the four inputs and we get the final logical output for all of the combinations. So this will basically, this input will activate every other input here.

Starting point is 00:48:25 This one will then activate every other on this level. So basically, we're always just looking up through these multipliquers. We're looking up a specific position in this input or in the lookup table. The input basically activates certain positions. So this one will, depending if it's 0 or 1, will depend either every first or every second bit will be activated. This one will then basically say either this one or this one, and if it's like the first, if this is 0,

Starting point is 00:49:04 it will basically activate this one, this one, this one, etc. If this one is one, it will activate only this one, so we'll get this one through, and this one through, right? And this one again will basically out of those select, and in the end we're looking up exactly one position in here and our output will be 0 or 1. So just a logical lookup table. And depending on the kind of logic that we want to implement we can then basically program so all we need to do in this for a single lookup table is just basically

Starting point is 00:49:43 say what will be the input here, right? So what will this be the bit mask in here? And this is what we program. Of course, that's not all, right? So with just this, I mean, the signal comes in here and this signal goes out. So this needs to go somewhere. So we basically have to combine this with other lookup tables but for a single lookup table that's basically what we do and this can give us all kinds of logical

Starting point is 00:50:11 interactions or logical circuits within or in order to combine these then we we have D-flip-flops, which basically keep the input signal stable at the output for a single clock cycle. So this is also a little circuit, and there is a bit more to that even. So technically we could have just a a single bit going in and we put this bit on the output so we're keeping the state for a single clock cycle so we're not losing the state. But this can do a bit more so it can also output the reverse signal so if we're basically negating the input. We can also reset this through set and reset ports. So we can also say, so this is basically

Starting point is 00:51:11 we have the clock cycle coming in. We have the input coming in here. Then on the output, we will have the same output. On the negative output, we will have the negated output. Here, we can basically say, okay, let's keep this, or let's reset this. And with the clock cycle, basically, it will be stable. And this will then give us the final output. So, what, like a mini example, again, here, in the, like, this is basically the two input lookup table so this is a bit easier

Starting point is 00:51:49 to see than what i did earlier so in general we have the sram here and then we have the lookup table or we have the the multiplexers so the sram each of this block each of these bits can be 0 or 1. And then here, what would this be? So let me think about it. So does anybody directly see what the logical operation is that we have here? It's basically a negated OR, right? NOR. So it's either. So it's not an OR, because we're basically, if input one or two is yes, then

Starting point is 00:52:50 it's zero. But if both are one or zero, it's one. And we encode this basically just, or we just use basically the output that we want to have for the lookup. This is what we encode in the SRAM, right? So we directly map this on here. If X1, or if, let's start with X2, right? Well, let's start with X1.

Starting point is 00:53:20 So if X1, for example, is 0 and x2 is 1, so if x1 is 0, we're going to go up here. x2 is 1, that will basically go here. Then our output is as expected is 0. If we have 1, 1, for example, x1 will go to the second output here. x2 will go to the second output. So we'll have 1 as an output. So that's how this works. And this is how we implement these logical blocks.

Starting point is 00:53:53 Now, we basically combine these into elementary logical units. So I mean, this is basically we have these mini lookup tables. We have already know there is some flip-flop to keep stuff stable. And these are now combined typically to something where we have multiple lookup tables, four or six input lots, lookup tables in an elementary logic unit, where then we might have some carry forward logic. So this is basically, if we do some addition, for example, so like the stuff that I showed you here, right?

Starting point is 00:54:37 So I need to carry forward the results here, then I can basically have a carry out logic. I can move this to the next lookup table. So this is basically carrying in. Then I know, OK, I have an overflow, for example, in my calculation. So this is a bit easier than doing some routing that would go through the next one.

Starting point is 00:54:58 So this makes this kind of operation faster. Then I have my dflipflop which keeps the stable and I have a multiplexer again where I can say oh I want but I don't want to have this like through the clock frequency but I want to just directly use this right so I directly multiplexes into the next operation and as long as the signal is fast enough I don't need to use the clock cycle that's as long as the signal is fast enough, I don't need to use the clock cycle. So as long as the signal propagates fast enough to the circuitry, I can just stitch them together and directly use the outputs from the inputs. And this is basically where then optimization comes in. And this is what you would see on this. As far as I remember, this is what is on this FPGA.

Starting point is 00:55:51 So this is the setup here. And this can be called elementary logic unix or ALUs, or it could be called slice and silencs, or in adaptive logical modules, ALMs in Altera. So depending on what hardware you buy, you will find different names, just like in the GPUs, right? So what this exactly is called, but it always looks somewhat the same. And so you have some storage in here essentially where you can, for a clock cycle or for some

Starting point is 00:56:24 time store the output, but you don't necessarily need to. You can also just directly work, continue with your work if you want to. So there's again some configurability in here. And then these groups of ELUs will be combined to logical islands. So in this case, for example, you have two logical, elementary logical units, which then are combined with a switch matrix. These are, we'll see for this one, but this is usually, again, this is like yet another level of organization

Starting point is 00:57:03 where you combine multiple of the elementary logical units to these logical islands and then you have some switch interface. So, within this logical island, everything is basically hardwired. So, you have all these individual connections in here. These are basically connected to the next one, the carry out logic is connected to the next one. So we have the carry out logic that goes through, we have and then these are directly connected to some kind of switch matrix. So you can see that there's basically some kind of interconnect fabric within the chip. This again is configurable. This basically says within here I can decide through programming which is connected to which. I have a limited number of connections, but these connections I can

Starting point is 00:58:05 configure. So, I can basically connect this logical island to this one, or I could connect it to this one, or to this one, but for this I will need these connections. So, these connections are basically switched on these intersections, and they can either be here, or I can also combine multiple to each other. We'll see this on the next slide, basically. And on the outside, we might have some external connections. So some kind of memory connections, some kind of I.O. connections, things like that. And an FPGA has hundreds to thousands of logical blocks. and then in between there's this configurable connectivity between the logic blocks. This basically works in a way that on these intersections you have a programmable link. So this looks something like this here. You basically have a small programmable switch

Starting point is 00:59:07 with a memory cell where you can activate individual of these. You can basically say, I want this to be connected like this, or I want this to be connected like this, or like this, or like this. And you can basically activate also. The signal travels in all directions,

Starting point is 00:59:22 or it just makes a turn, or in three ways. So this is basically all combinations that you can think of. You can activate and for each of those you have like a small memory cell which then will activate or deactivate this. So it's really like you have an S-Fram cell that basically depending on the state of the cell this is activated or deactivated. And this is at programming time.

Starting point is 00:59:48 So when I flush this, then this switch matrix will be set up. And you will have multiple of these connections. Depending on what you want, you can connect one logical island to multiple other logical islands or not, depending on what you need in there. Besides these logical islands, this is the major part. So, these logical islands and the switch matrix, this is basically how you bring your program onto the FPGA, your circuitry.

Starting point is 01:00:23 And I hope you can understand it's really like circuitry. Although you program this, you write a code for this, what happens in the end, it's circuits that are switched. Basically, we reconfigure the hardware in a way that the hardware basically transfers the signal in a certain way through the processor here. Additionally, there's some hard logic on there. And I'll go through more in a bit, but let's first look at what this would look like in here. So we have the logical islands in here.

Starting point is 01:01:05 These are these little blocks with the elementary logical units in there with the individual lookup tables in there and the flip-flops, right? So the gets finer and finer. And in order to keep this manageable, we basically block this. But in between them, we have some block RAM.

Starting point is 01:01:25 So, and this is basically small on-chip random access memory in the order of kilobytes. So this is not a lot, right? So this whole chip we'll see has a few kilobytes of of BRAM in there. So it's not megabytes, it's not gigabytes, but this is local memory that's super fast to access. It's distributed across the chip, so you don't have to go through the end. I mean, we see this here, right? So this might be connected somewhere, but let's say this is the whole chip here, the whole big thing, it's not just on the borders, but it's really somewhere in between. So you have fast connection. You can directly switch this. Otherwise also you will need a lot of connections on the switch matrix, right? So because if you want to read something in a logical

Starting point is 01:02:18 unit or logical island from a certain BRAM block, you need a connection there. So you need to basically configure your switch matrix to connect there. And if this is all on one end of the chip, but you need it on the other end of the chip, then you need to basically create the switch matrix in a way that this actually connects. So this will on the one hand be long and it will take a lot of space, a lot of your switch matrix. So it makes sense to have this distributed across. And you might have some other units. So other signal processing units depending on what you want and what you need in there.

Starting point is 01:02:59 So some floating point units, for example. So this is stuff where it actually makes sense to build specialized circuitry, because building a floating point unit out of the logical blocks takes a lot of space. So, building this in an ASICs fashion is actually much smaller and thus much more efficient. So, if you know already, well, I will need some floating point operations somewhere, then it makes sense to have a small hard IP block in there that you can directly use. And then you're just going to connect this to your Logical Plus. And this is called hard IP because it's intellectual property.

Starting point is 01:03:38 This is basically what the chip vendor won't tell you about what's exactly in there, but it just gives you a configuration how to use this. And there might also be soft IP, which is basically something you don't really know how it works. It's built in the compiler. The compiler will automatically build certain blocks on the FPGA for you, but you cannot really tell that much about it. And some of the hard IP is directly on the FPGA itself.

Starting point is 01:04:11 Some hard IP is outside. So I already showed you something earlier where we said, okay, there's additional processes on there. Even here, we have some additional chips on there that do some of the USB, for example, right? The USB connection doesn't really make sense to allocate space on the FPGA for this, when you know anything has to go through there anyway. So, I'll just put something in here to do this signal processing here.

Starting point is 01:04:40 But at the same time, some of this might even end up on the FPGA itself. So, let's look at yet another example. So, this is a fairly recent type of FPGA, so in this case, an Intel. I mean, the two major vendors are Intel and AMD these days. What a surprise. The one are the Xilinx and the other are the Altera. So Intel is Altera. Used to be separate, but they bought them up. And AMD is Xilinx. If I'm not switching it up now, I should be right. And here you can see see the size that you would have in 400,000 ELMs, which is like ELUs, but in Intel speak with each two lookup tables or two logical islands basically. No, the ELMs... So the logical... No, it's...

Starting point is 01:05:46 It's... ELUs, it's 400,000 ELUs with two lookup tables with six inputs each, and then four flip-flops in there to keep the state. And these are then built into... They call it labs, but basically

Starting point is 01:06:02 logical islands, where each of the logical islands consists of 10 ALMs. So essentially we have 40,000 logical islands on this. And then 2,000 BRAM blocks with 20 kilobytes each. So, what do we have? 40 MB of VRAM on this chip directly. Or 50 MB on the chip directly inside. But of course, on your board itself, you can already see again, so the board itself might have some additional memory chips in here, has some additional stuff to do the networking and the PCI Express.

Starting point is 01:06:50 So you can see this is networking interfaces, and here you have the PCI Express interface. So now let's look at this thing here. So this is what I brought to you here. This has 7,000 logical cells, where each logical cell has one lookup table with four inputs. So it is significantly smaller, as you can see. And the flip-flop, each programmable logical block, so in this again an ELU or a logical island in this case, consists of eight logical cells. So, eight of these lookup tables basically combined, and then we have 32 BRAM blocks with 4 kilobytes each, so 128 kilobit in total.

Starting point is 01:07:48 So it's significantly smaller, but you can also see by the chip space, this is significantly smaller. So if we look further in there, so this is basically how this is set up, and this is only the chip itself. You can see that there's basically these, what did we call them, programmable logical blocks

Starting point is 01:08:12 are distributed on the chip, like fairly square layout. And then we have these VRAM blocks in there. We have something to give the clock, right? So the phase lock loop. We have a non-volatile configuration memory. So something that basically makes sure that the configuration is stored on this chip. And then each programmable logical block consists of these eight input lookup

Starting point is 01:08:48 tables as you can see so this is basically again this more or less the same as we saw earlier just in a different like rotated basically we have our lookup table which stores the information. We have four inputs to the Lookup table. We have some carry logic. We have then a flip-flop to store the output. We have the clock that goes to the flip-flop. We have the enable and reset setup that would go through the whole logical block. Right, so the whole programmable logical block basically just goes through the whole logical block. So the whole programmable logical block basically just goes through the clock,

Starting point is 01:09:28 the set and reset things, and then we have the output. So basically four inputs to each logical block and then eight outputs for the whole programmable logical block. So we can basically, in one programmable logical block, So we can basically, in one logical block, we can process a single byte, essentially.

Starting point is 01:09:51 So an output would be a single byte here. As input, we can have up to four bytes, depending on how we switch this, the output would be a single byte here. Okay, so and with this, the output would be a single byte here. Okay, so with this, I would say, I mean, I showed you more or less already, but I'm going to show you real quick as basically the last step for today, what this looks like. Right, so I have a very simple program only.

Starting point is 01:10:22 Let me clear this and show you. And so this is the counter that I already demonstrated earlier. So this is basically, this is in Verilog. So we'll talk about VHDL next time. So there's two major programming languages if you want to go close to the hardware. There are also higher level programming languages where you write C-like code with it. But this would be more of the signal processing type. This will still be translated to this actual layout on the chip. But this is where you say, okay, I have basically input signals, I have some logic that happens with the input signals and I have output signals. And here you can see what this basically does is we have the inputs, or we're assigning

Starting point is 01:11:19 the outputs. So I have a register, which I'm assigning. Let's start from the top. So I have my module, which is basically what I'm defining. I have an input and an output. So input is basically the clock here, the output are my LEDs here, and I have a register that I'm using as a counter, so this is basically something

Starting point is 01:11:54 where I'm storing my counter. I'm assigning the output that I want to the counter, to a portion to the counter, to a portion of the counter. So my counter is 32 bit, but I'm only assigning, because I only have eight LEDs, I'm only assigning a subset of these. And that is from position 20 to 20,

Starting point is 01:12:24 eight, 27, so 8 bit anyway. So basically in order to have all of them, like all of the LEDs assigned one of the counters, the counter will just basically, it's individual bits, so the counter is 32 bits, it will just count the clock cycles in every clock cycle, increment the counter by one. As soon as my position 20 changes, the LED state here will change. If position 21 changes, the next LED will change, etc. So this means

Starting point is 01:13:07 we have a clock speed of 12 MHz, as I said. And then basically whenever divided by 2 to the power of 20, so 1 24th, I think yesterday of a second I calculated 1 24th of a second. This first LED will basically change to a 12th, a 6th, a 3rd What, a 1.5, is it, or something like this. So there we are at the second range, something like this. No, it doesn't work like this.

Starting point is 01:13:52 I'm getting confused here now. But anyway, around this here, we will see, right, a change round about every second. And then this goes even slower. And if it overruns, it will just start from the beginning again. So not much more than that. And all that I need to do here, basically, is then flash this. So, I mean, what I need to do, for this I have a makefile,

Starting point is 01:14:23 is basically first I compile this program. So what happens then is this gets translated to a layout for the chip. So this gets translated to switching on and off or assigning zeros and ones to these lookup tables and switching the switch matrix. So one thing that will happen is basically something gets connected to the LED registers. So where I basically put information where the LEDs are switched on and switched off

Starting point is 01:14:55 and the clock will be connected to some of these logical blocks and then with every logical input, I will basically increment there. So I'll build a small adder, and then we'll see this next time. I think I have an example how this can be done. And then through the carry forward logic, this will basically carry on through the next logical blocks

Starting point is 01:15:23 until I basically have my results in the registers. And now I can also compile this here. Let me see, do I have focus still? I don't see my mouse here. Okay. So I have a make file for this, as I said. So right now it's basically erasing.

Starting point is 01:16:00 And so it's, I mean, this is so small. So the compilation doesn't really take much time. It's erasing, then flashing, and programming, and then you can see it already starts. So then basically it doesn't have to do anything from here. The FPGA does everything by itself. So basically now it just continuously counts the clock cycles. So this is basically all it does. And we can see the count of the clock cycles from position 20 on.

Starting point is 01:16:29 And like in 20 bits, like if you have it at 32 bits, we're basically continuing here, and you can see this is fairly fast, and here, as I said, so around here we're on a second range on the bot. And I mean, from here on, it doesn't do what it needs to do anymore. This is programmed now, so it won't change anything. So this also staples the SRAM. So now if I put it in here again, it will just continue from where it should also continue from where it was.

Starting point is 01:17:03 So no, it doesn't. This one loses. It's probably RAM. So the RAM loses, but it then continues to calculate from there. Good. Questions to the architecture? No? So this is, I mean it's slightly different, but you get like it's very efficient implementation and you get like a hardware-like, circuit-like implementation. This is also how an ASIC would basically be programmed in

Starting point is 01:17:44 the end. But with the difference that rather than being able to flash this, the ASIC would then be in hardware. Okay. So if there are no questions, we'll continue next week with more details on the programming, so how you actually write these programs to some degree. So I'm not an expert, so I cannot really give you all the details,

Starting point is 01:18:11 but also the design process and how this is flashed on there with a bit more details. With that, thank you very much and hope to see you next week. Thanks.

Your Ad Here

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Field Programmable Gate Arrays

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.