Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Summary

Starting point is 00:00:00 Welcome everybody to our next session. Today we're going to do the summary after I finish up FPGA, but I'll start as usual with some announcements. So in our seminar with two Darmstadt, today we're going to have Shatai Demiralp. He used to be a CTO of Sigma Computing and he'll talk about integrating semantics into data systems. Now he's at MIT. So that could be interesting if you're into data systems, not super related to hardware though. But still I invite you to join. The Zoom link should also be in Moodle.

Starting point is 00:00:45 Then the other thing, this is gonna be my last lecture here, right? So next week we'll have CXL and then the data center tour. So this is why I kind of need to wrap up today. So there will be in the winter semester, we will have a hardware seminar. It's not in the lecture plan yet,

Starting point is 00:01:09 but we'll announce it soon. And this will be exclusively for people who attended this lecture or who passed, let's say who passed this lecture. So the idea is that we want to give you something that kind of further engages like all the hardware things so if you liked that then there will be more in the winter term and Marcel will mostly be managing it but if I mean we already had this

Starting point is 00:01:37 discussion with this one or two of you if you say this is something that I'm really interested in then we'll make sure that this is also in the seminar. So you say, I really like GPU, for example. We'll make sure there is some GPU in there in the seminar. And then we'll have big data systems in the winter term. And we're right now setting up something called Big Data Lab that will be exclusively for people attending or who have attended or are attending big data systems.

Starting point is 00:02:13 And here the idea is that we give you some hands-on experience on big data systems. So basically do more of what we're basically, where we're telling you in big data systems. This is how it works, this is how you set up the system, or what the architecture of the system is. In the big data lab you will learn how to set it up and how to use it. So that's kind of an extension to that. But this is in the making. So I noticed I basically... whoops... did this... so I mean we're all the way at the end almost, so we're in FPGA. There

Starting point is 00:02:58 will be CXL next week by Marcel. But let me show you the last, at least from my last overview slide. So we'll, and I didn't fix this here. So we're here at the 19th right now, right? We're going to finish up FPGA. This is only a few slides, five slides or so. And then because of that and because I'm not here next week, I'm moving the summary here. So I have a summary slide deck where I basically just pasted in all the overviews and I'll just quickly run through it. And then the idea is if you say,

Starting point is 00:03:42 oh, this is something that I have some questions on or that I didn't fully understand or where I would say also, like, please have more of this or less of this in the next semester or next time you do this, then I'm more than happy to hear this. So I can basically adjust. So that's the idea. And also just to give you an overview again of all the nice things that you learned during this lecture. Exactly. So this is what's going to happen after the FPGA summary, but now we're going to go to the FPGA summary first. And so last time we discussed all the details on how an FPGA works, right?

Starting point is 00:04:37 And I showed you an FPGA. I didn't bring it today. But I showed you some details. And today I just want to give you like a bit of a glimpse on how we can use this in data processing. What are the ways you could use this in a database, for example. As I said, this is going to be brief, but maybe you can take something from it or you can read up on more details in the referenced literature. So, coming back, right? So, as you remember, right, we're not really writing instructions or something, or a program for an FPGA,

Starting point is 00:05:24 but we're setting the layout of the FPGA. So, if we're programming the FPGA, we're basically giving a schematic for the circuitry. I mean, this is not what we're doing today. We're not really saying, okay, please plug this here and this here, but the compiler will do this for us. So in the synthesis, whatever rewriting as low-level code will be translated to something like this.

Starting point is 00:05:52 And this means, well, basically we have an input to the FPGA and then we have all the circuitry. And depending on how we basically lay out the circuitry in there, we can have a very high degree of parallelism. And this is good for data processing, right? So we can basically process a lot of data in parallel in there. And this is also one of the major ways how we can actually get performance out of an FPGA.

Starting point is 00:06:25 On the other hand, I mean, the other way is basically we can do multiple things or we can have pipelines of circuitry that is dealt with in a single clock cycle. So depending on how complex our task is and depending how fast the signal travels through our circuitry, we can do arbitrary or many things within a single clock cycle. And then we can again repeat the same architecture, the same circuitry multiple times on the FPGA, depending on how much space we have on the FPGA. So the FPGA that I showed you last time had 8,000 lookup tables, so we cannot do much on this one.

Starting point is 00:07:21 But say larger ones will have in the hundreds of thousands of lookup tables. So there you actually have some space, right? So you can actually lay out something there and can have multiple, like whatever we would do on the small FPGA, we can place many times on the larger FPGA. And that gives us this kind of data parallelism or an opportunity for this data parallelism. So we can apply the same tasks or the same task to all items in a dataset. And of course, again, we have to kind of balance

Starting point is 00:07:57 how fast can we actually get the data into the system, into the FPGA. So whatever we can get in there, then we can kind of make sure that we can process this in parallel if we have enough space, if the task is not too complex. But in general, like the chip area, like different parts of the chip

Starting point is 00:08:21 can operate completely independently. And for this, we can use this to just replicate stuff. So, say we do a filter, we can have the same filter many times and we can have multiple items coming in into the system and being filtered completely in parallel. This is of course also true for any kind of map function, right? So if you remember your mapReduce, then anything that I would do in a map function, that's embarrassingly parallel, so it's very easy just to parallelize. It's similar to something that we do in SIMD any kind of vector operations. So this basically gives us a way of using the FPGA efficiently. Let me maybe get a pointer. So it's basically, say we have one filter here, then we can have many, or one map function, something like that.

Starting point is 00:09:25 We can have many of those. We'll have something that needs to distribute the data across these multiple replicas, or we just have basically something like vectors coming in into the FPGA, and we're operating on these vectors. And of course, we can also build security for any kind of reduction function.

Starting point is 00:09:46 Then that would go like that would basically be something like this collection here. So with this, we can get speed and performance out of the FPGA. And that's basically one way of having this data parallelism. But then we, of course, can also have pipeline parallelism. That means, and that I already alluded to a bit earlier, is basically if we have a certain task, we can break it down into subtasks and then have different components in

Starting point is 00:10:30 between or the different components on the FPGA for these subtasks that then are connected and this communication or this connection is very efficient because you remember last time right there? There's basically this matrix, so there's a connection grid on the FPGA, and there's switches in between those, which will be hardwired during programming. And then we can basically have these pipelines generated already on the FPGA.

Starting point is 00:11:03 And either we basically do this on a clock cycle, meaning that we would have something like a register in between these individual pipelines or pipeline steps, or if the tasks, the subtasks are small enough still, like the signal processing is fast enough, this can basically be done within a single clock cycle. But then we don't get pipeline parallelism, so forget about that.

Starting point is 00:11:35 So as soon as we're basically using a clock signal, we'll have these individual pipeline steps and then we'll use some kind of registers in between to take the output of the one computation and move it to the next computation. So remember, if we built like in a very simple way, this would be like the flip-flop registers that we have in there to keep just the output

Starting point is 00:12:02 of one lookup table result for example. If we want larger data then we would have to allocate more registers for this if we want or we could even use something of the the B-RAM, the block RAM in between the registers to write something in there so if we have larger outputs and uh and by this we then have these multiple stages and you remember the pipeline parallelism this basically um basically means we can get a higher throughput right so we we don't get better latency um but we can basically have like these multiple. We don't have to wait until everything finishes. So here, the problem, and that's actually a problem on the FPGA. If we have a long pipeline and we're doing nothing about it,

Starting point is 00:12:54 this means the signal has to go through. Let's look at this. So if it's a long pipeline, we already broke it down into subcomponents. But the signal has to go through all of this, all the way until the end. And that means this will take some time. And if we're unlucky, and as soon as this is complex enough, this will take so much time that we cannot do it in the highest frequency that is available on the FPGA. So that means then the synthesis will basically see this,

Starting point is 00:13:26 right, it will see, well, the signal will not process or progress fast enough through this pipeline, so I need to tune down the signal. So basically I'm going down from, I don't know, 13 megahertz to seven megahertz, or from 300 megahertz to 200 MHz, like on a modern, larger FPGA. But using these pipeline registers, I can break down everything again into these smaller bits, where the signal will be fast enough to progress through, so then I don't have this problem anymore.

Starting point is 00:14:02 The difference is, here in this case, it will actually take one cycle to process through this, right? So this will really only take one cycle, but the cycle will be slower. In this case, I will need three cycles to go through this, but the cycles might actually be faster still. And depending on what else I have in my program, right? So I might have other multiple different kinds of things going on.

Starting point is 00:14:30 I should still be faster or can still be faster in this, can be more efficient than if I always have to wait for the longest pipeline to progress, even if other stuff could actually progress in parallel. Okay, and the final thing we also know is task parallelism. So rather than, and this is similar to the data parallelism with the difference that we're doing different stuff in parallel. Rather than doing the same stuff on different data,

Starting point is 00:15:00 we're actually doing different things in parallel. And that can be kind of anything. I mean, one thing that if we have complex programs on the FPGA, what we'll have to deal with at some point is some kind of protocols, right? Meaning I want to reuse some of the circuitry for different things, say for example my filter. If I have a simple filter, I will hard-code the filter condition onto the FPGA. That means I can only filter things that are, say, less than a thousand, and that will be hard-coded into the FPGA. I cannot change anything.

Starting point is 00:15:45 If I want to change this, I need some kind of protocol to make the... Either I say these input bits will always be my filter and I'm always going to send the input bits or always send the filter condition with every individual input. Then I don't need a protocol. I don't need to do anything, but I'm wasting a lot of the input bandwidth just for these filter conditions all the time. Alternatively, I basically have to somewhere store the filter condition, and if I want

Starting point is 00:16:13 to update, I need to make the circuitry aware of this, that this is an update to the filter condition, so I need to basically reroute stuff. And doing these kind of operations, I can have separate parallel tasks, for example. But this is basically something that I need to be aware of, and where I have subroutines, subcircuitry in parallel for different things. So might be just management stuff, might actually be completely parallel, but different things. So it might be just management stuff, it might actually be completely parallel, but different operations. So say I do a join here, I do a selection there on a different table, I can do this in parallel on different parts of the chip.

Starting point is 00:17:00 Or having different regular expressions in parallel, for example. So this is also something. So one thing that we've seen before already in SIMD, and it's basically the same idea if we do it on the FPGA, is sorting. So the idea is that in order to do sorting efficiently on an FPGA, we will also use sort networks. Of course, we cannot sort arbitrarily large datasets using a sort network, because the sort network will grow exponentially with the number of elements we need to sort. That's why we will keep it small. But we can sort subparts and then we can have merging networks again.

Starting point is 00:17:57 I mean, we remember this from the SIMD lecture, I hope. The idea is that using these compare and swap elements, we built this hardware sorting network. And you basically saw this already, right? So this is what such a sorting network looks like. And it basically just is a series of comparisons, which in the end results into a sorted sub-dataset. And again, in order to use this efficiently,

Starting point is 00:18:27 or in order for this to work, we will need some buffers. Unless our frequency is so low that we can actually go pass through the whole thing in one step. But usually we won't, so that means we'll have some buffers in between. Again, we'll have individual buffers in between again we'll have like individual pipeline stages but you can see these parallel comparisons they can actually completely run in parallel we don't

Starting point is 00:18:53 need to buffer in between because they're there they don't need to be sequential right so that means in this case for example for this sorting network of eight numbers, we'll need six pipeline stages. And then we could use these bitonic merge networks again in order to merge them into larger subsets and finally completely sort this. Similar thing which has been used in the past or one idea what you can do is like

Starting point is 00:19:47 database aggregations and or restrictions so restrictions would be a filter again and here one idea what you can additionally do is you can use partial reconfiguration so modern FPGAs have this option that you can say, I want to please reconfigure only this subset of the FPGA. So not flash the whole thing, but a partial, a smaller part of the configuration and or like a block of the FPGA. And that means we can actually, even while we're processing data, we can say, okay, now I need some different kind of restriction,

Starting point is 00:20:34 I need some different kind of aggregation, and I'll place different circuitry for this there. This is beneficial because we don't need to replicate all of the circuitry for this there. This is beneficial because we don't need to replicate all of the circuitry. So if we want to do all kinds of aggregations, for example, on the FPGA for all kinds of restrictions on the FPGA, then this would basically mean we need to have everything already on the FPGA once we start the program. So think about complex SQL queries, if we want to fully support them, everything, and we don't want to flash everything new, and we don't want to input all of the,

Starting point is 00:21:17 or have everything as an input coming with it, then we need to have all of the circuitry in there. And then basically just reroute based on which kind of configuration we need. Alternatively, we can reconfigure some subpart. So that would, for example, also work for the filter, right? So rather than saying I want new code for the new filter, for a filter, it doesn't really make sense because it's something which I can just use as a small bit of memory to fix this. But just as an idea, for example, so I would have the filter somewhere stored in my SRAM, then rather than reconfiguring this through some kind of protocol, I could also just rewrite

Starting point is 00:22:02 this part if it's in a separate block and the block granularity fits this. Then I can basically have this or partially reconfigure this on the FPGA. So you can see this here and then you can integrate this into the host system. So you can say, okay, I'm going to have some part reconfigured for this query, some other part, like depending on the type of query that I want to execute, I'll change some subpart here. And then of course I want to, I can integrate this with the host system. And this is also what people usually do. So in order to use the FPGA, they're not going to execute

Starting point is 00:22:49 the full database management system on the FPGA but just simple or some basic operations. Also, I told you there was a startup in Berlin that did the same thing, right? So they basically had MySQL and certain operations would then be pushed to the FPGA and the FPGA would basically deal with this. And the main data would actually still be on the database, but some kind of restrictions, for example, some aggregations would be executed on the FPGA.

Starting point is 00:23:26 And there's lots of work in this area. You can basically check the book that I've linked in on one of the first slides, or say, for example, look at the paper that I've referenced here. And another idea, how you can use this, would say, for example, be a hash table. So, and here you can basically build a fully pipelined hash table. So the idea is that you can compare many elements in parallel.

Starting point is 00:24:04 So that's basically what we're doing. So the idea is that you can compare many elements in parallel. So that's basically what we're doing. So you build a concurrent mechanism to handle hash collisions. So you basically have your key data coming in, and you have a buffer, and you compare to all of the different hashes in parallel and output which ones are actually matching which ones are not. So the main idea is that you do all of these comparisons completely in parallel. And yeah, so I mean the main data still will be stored in DRAM, but just the comparisons will basically be done within the FPGA.

Starting point is 00:24:49 Again, for the details, I'll point you to the paper. This is actually by Scholt István, who's one of our collaborators at TU Darmstadt right now. He used to do this, he was doing this while he was at ETH Zurich, but now he's in TU Darmstadt. Okay, so finally, just ass or GPUs, which is in the hundreds of megahertz typically rather than gigahertz. So it's, I mean, it's not much, much slower, but it is definitely slower.

Starting point is 00:25:39 But in order to still be faster or be reasonably fast and while using this, you really need to use the parallelism. And at the same time, because they're kind of slower and also not as large, they don't consume as much memory, also the way they're basically built. And because of the lower frequency. So that's actually one of the main drivers so you can be more efficient in terms of energy and you can also be more efficient because of the

Starting point is 00:26:13 sheer parallelism that you have right so depending on the complexity of the program that you put on the fpga you can fully utilize the whole chip space in the execution. Of course, this only makes sense if you're dealing with some kind of specified task and the circuitry is actually used all the time. So, in the CPU, you have all these instructions and all of the chip space, in order to be able to process anything that you throw at it. And the FPGA, you really want to specialize for some subtasks, because then you can actually use the chip space.

Starting point is 00:26:55 And this means the more you actually, I mean, of course you want to use it efficiently, of course you could also do lots of stuff on the FPGA, which doesn't make sense. But if like thinking about efficient compute, the more you can use the chip space through parallel units, etc., you will actually benefit from using the FPGA. However, the problem is that the compile flow is very slow. So remember, synthesizing and then mapping and routing, especially your program, that takes a long time.

Starting point is 00:27:29 This can take up to a day or even more than a day. And that means the task that you're dealing with that needs to fit this problem. So you cannot say I'm having a new ad hoc query in my database that I need to be answered in a few milliseconds maybe. That doesn't make sense, right? I cannot put this in there. I need to somehow be able to tolerate this reconfiguration. Or I have to be able to pre-compile this

Starting point is 00:28:00 because the mapping, the programming of the FPGA, that's not that slow. It's still not in milliseconds, but it's faster. You can also partially reconfigure. So we had this today. That will again be faster because you don't have to send as much data to the FPGA to reconfigure it. Still, it will take some time.

Starting point is 00:28:25 And you can, like through these reconfigure it, still it will take some time. And you can, like through these reconfigurations, you can create hardware with some modifiable behavior at runtime. Okay. So with that, I'm wrapping up the FPGA. So we walked through the architecture, how to program it, and especially I put, I think, most emphasis on how it internally works. So you get an idea how to get some performance out of it, I hope.

Starting point is 00:28:59 And we talked about the design flow and today very briefly about how this can be used in data processing. And there I invite you to read some of the papers. If you attend Marcel's seminar, then you might actually come up with some additional ideas how to use it efficiently, if you want. With that, are there questions to this part? No questions? Then, very good. Then we're going to switch to the summary today.

Starting point is 00:29:39 And as I said, this is really just a summary. So, I just brought, basically, I copied in all of the summaries or the overviews of the lectures. So we basically, we started with like an info overall and the introduction of computer hardware and the lecture. And then I gave you an introduction to, or a recap of database management systems. And then we started with performance analysis. And this is basically something that I

Starting point is 00:30:14 have in many of my lectures because I think it's really useful. And maybe it helped you to some degree with the tasks. Basically, get a basic understanding of performance think about what the hardware can do in terms of bandwidth in terms of uh throughput and latency so what are the numbers that the vendors tell you how does this relate to what you're seeing and what would you expect to see because then you can basically see if your program actually does well or not or if you have like a complete error in your in your thinking on how to use this we talked a bit about measurements and

Starting point is 00:30:57 benchmarks and fair benchmarking so this is something relevant if you ever want to do a research paper if you want to do master thesis with some kind of researchy aspects, then it's good to know how to properly evaluate stuff and how to benchmark stuff. Then we basically started with the real stuff. So CPU and caching gave you an overall overview of the CPU architecture, the buses, the memory hierarchy, etc. And then we talked about memory accesses. So how do they work?

Starting point is 00:31:34 So remember, there's these multiple level of caches and virtual addressing. And we want to make sure that we keep data in the caches as much as possible. But not only data, but also instructions. So we want to keep them there. Of course, it's not always possible, but we want to be cache efficient. And then basically based on this, we can change our data layout. We can change data alignment, but also instruction alignment, and make sure stuff works well.

Starting point is 00:32:10 With this kind of knowledge and this in mind, we also talked about the instruction execution, so how your program is actually broken down into micro operations, which are then executed on the CPU. And you remember, hopefully, that today, even a single core is kind of a parallel system already, because you have multiple functional units that will be executing in parallel.

Starting point is 00:32:37 Not all of the functional units, and this is, again, kind of, if you think about it, the difference to how we want to use an FPGA. Because on the CPU, we have a lot of functional units that will only be used certain amounts of time. So if your program is not vectorized, you will never use the SIMD. I mean, the compiler will try to do it for you.

Starting point is 00:33:04 But say there's certain amounts of chip space that you're not using. In the FPGA, you will try to use all of it as much as possible. We talked about hazards, so when the pipelining does not work. Because basically, the way how we access data in the caches and how we are executing branches. We looked at a couple of different pipeline architectures today. Not a couple, we looked at two. Basically we looked at x86 and the ARM M1 architecture in a bit more detailed overview.

Starting point is 00:33:47 Then we put more emphasis on SIMD. So how does SIMD programming, so these vector units on the CPU work. You actually also had to implement a task on that. And then, so how they work, but also how we program them and how we can use this in the database. Then we've shifted more towards, and of course, if you have questions so far, let me know. If there's something that you remember

Starting point is 00:34:24 that was kind of interesting, something where you say, OK, this I would like to look into again, let me know. I mean, the slides are all there. There's no exam, so I know you're kind of relaxed about these topics right now, which is fine. That's also the idea. It's more really for the idea right it's more uh being like really

Starting point is 00:34:47 for the fun of it right so i mean the the programming i think is super useful you will be very happy uh later on that you can do this kind of stuff because you get nice jobs etc um but also this is good to know to get have kind of an overview of how this works because then you can use your computer much more efficiently than if you don't. So in the execution models, we talked about how database queries are executed and how we can do this more efficiently towards hardware. So the classical iterator model, which is really OK-ish, which is really nice in terms of an abstraction, but it's not really fast if you have a modern CPU and modern and large memory. Then materialization model, where you basically just

Starting point is 00:35:43 process operator by operator and then the more towards hardware geared execution models of vectorization and code generation. So vectorization meaning using multiple data or a batch of data items at a time in order to be efficient in terms of memory, in terms of caches, but also using vectorization. And then code generation in order not to interpret the code all the time, not to have all these function calls, but really generate small kernels that execute very fast

Starting point is 00:36:22 on the hardware. And this is an ongoing discussion, basically, and ongoing research, what is good and better for certain cases. So, I mean, code generation always takes some time. Vectorization has more of these function calls, etc. So there's always new kind of architectures popping up, new kind of ideas how people try to do this.

Starting point is 00:36:50 Then we have data structures. So we talked about hashing trees and tries. You all implemented the art, which I think still think is quite nice data structure, not necessarily taught in data structure courses or tries in general. So it's a good kind of knowledge to have. You'll have an additional opportunity. You already heard about the yesterday right about the skip list. So this is also an interesting data structure for certain cases and increasingly frequently used in database research.

Starting point is 00:37:33 So I think it's also good to know. So I mean, know your data structures. We always try to give you kind of an overview of the most relevant ones, but then there is many for special cases. And so it's good to have a deep understanding also of how they work. And I think implementing is always the best way to get a good understanding there. Then we had this new profiling lecture that Lawrence showed you,

Starting point is 00:38:01 where how to do profiling, get a bit more details out of your code and how it runs. So, I hope that helped with some hands-on session or tools. If you were not fast enough like me, there are basically still all the details in the video. You can check how to do this and try it out. And again, I really recommend to try this stuff. So if you haven't done, I'm assuming all of you have done this for the tasks.

Starting point is 00:38:34 If you haven't done, try it out. It's always a pain initially, like anything that you have to learn. So initially it kind of takes a very long time. There's a certain learning curve, but then once you have this knowledge, when you have the skill, things will go much smoother. You can do much more than before. Then we talked about parallelism,

Starting point is 00:39:00 first multi-core parallelism, how this applies to database queries in terms of inter and intra-query parallelism, so inter in between, doing multiple queries in parallel, intra-queries, splitting up a single query into multiple parallel parts. And then we talked about joins as one instance of intra-query parallelism on an operator level. So how can we parallelize a join? And we had three different examples. If we're talking about parallelism, we also have to think about concurrency and synchronization because we have

Starting point is 00:39:46 data structures that will be used with from multiple threads in parallel so we talked about the cache coherency protocol we will talk about this again uh or not we marcel will talk about this again um next week uh when he's talking about cxL, because this is one of the core features of CXL, is this cache coherency. So that's going to be interesting. And then we talked about synchronization, so different kind of latches, which are basically data structure locks

Starting point is 00:40:19 and other kinds of locks, or different strategies of locking in data structures. Coming from multiple cores, we talked then about NUMA, so non-uniform memory access, so as soon as we have many cores or multiple sockets, today even just single socket have many cores or multiple sockets, today even just single socket, but many cores, we'll have different kind of memory regions which have different kind of access speeds. And this is something that we have to think about

Starting point is 00:40:57 if we're programming, at least if we wanna be efficient, because local access is much faster than remote access, even in a single server. And just being aware of that will give us a good performance improvement. So that's why we basically looked at what this means in terms of architecture and then in terms of programming

Starting point is 00:41:22 for different kind of operators. So how to structure the data, align the data and the program in order to be efficient. Then Lawrence showed you persistent memory. This is basically a new technology, unfortunately dead already again, at least for some time, where we talked about how we can use it. So it's a different type of RAM.

Starting point is 00:41:48 However, even though I mean, we were basically debating if we should still keep this in the lecture or not. But on the one hand, it's still used in database systems. So Oracle, for example, uses persistent memory in their system. So it's not worthless information because people are using this. There's also still a lot of research. We still have the servers.

Starting point is 00:42:14 And even, for example, again, with CXL, there will be new persistent memory alternatives. There already is. There's still battery-backed modules, meaning you have DRAM where there's a small battery, which keeps the memory stable and basically persistent. And having, as I said, with CXL, there will be also new persistent memory ideas or persistent memory configurations. Not necessarily exactly like Optane, but with similar, not necessarily characteristics, but similar features.

Starting point is 00:42:58 This means, of course, it's not going to be exactly the same, but in general, we can do the same things with that as we did with persistent memory. Then we talked about storage, so general overview, and then a lot about NVMe, so how to use SSDs efficiently. With this, we also talked about PCI Express in there. How does this actually connect and how then can we use the SSDs efficiently. Then, networking.

Starting point is 00:43:33 So, first general parallelization across multiple nodes. So, rather than, like, as soon as we're going out of a node, unless we're having CXL, we'll have to deal with networking. So this means we'll either have Ethernet or InfiniBand, and then we can have different modes how to connect. So again, either using IP or RDMA. And I talked a lot about RDMA to kind of give you this alternative

Starting point is 00:44:07 on how we're accessing another server. I mean, there's lots and lots of stuff about networking that you can do. So how you use the networking, so basic socket communication, but also message passing, et cetera. Then there's different levels again. And you know, we have a networking professor here

Starting point is 00:44:30 who will be able to teach you much more about this. But for us, it's important to know that there's also very low-level ways of using networking, which is remote direct memory access, which will give us very low latencies and very high bandwidth if you're doing it right. And this again we can use in the database, say for example, I mean the canonical example again, here was a join. Then we had two lectures on GPU, where Ilin discussed the architecture and memory hierarchy. So this kind of basic, how does a GPU work internally, what is the difference to a regular

Starting point is 00:45:17 CPU. And then in the second part, how to program it, how the execution works. And then in the second lecture, also how to use it in a database and what happens if we have multiple GPUs, which we will have on modern servers that feature GPUs typically. So how are they connected to the main memory, to the CPU, and how are they connected with each other? And again, how can we use this efficiently,

Starting point is 00:45:49 say for example, for sorting, or, and we didn't cover this, but we can also, like another idea would be joints again, where we have a lot of data movement. And we saw that they are actually efficient, or not efficient, but fast, but usually limited by the interconnect. So the GP is so fast in data processing

Starting point is 00:46:11 that if we're just doing a simple operation like a join or a sort, then the interconnect is the main bottleneck for us again. Then we finished up FPGA today. So here, for me, important is that you understand what is the difference to a regular CPU or a GPU in terms of general execution. And I hope you understand that this is not like instruction execution,

Starting point is 00:46:42 but it's really mapping circuitry through this. I mean, again, of course, the circuitry, everything's already there. But we're using kind of memory cells to create like small changes in the hardware to map regular circuitry to this reconfigurable circuitry. So we talked a lot about the architecture, talked a bit about the programming and the design flow and today we talked a bit about the data processing on there. Next time you will have an intro to CXL. This is now finalized, so I just took this from Marcel. You will get an overview of the Compute Express Link

Starting point is 00:47:34 and the different modes. Basically, what are different protocols in there. I'm always getting confused, so that's why I'm kind of putting an emphasis here. There's different versions of the protocol which support different modes, and then there's different devices which again support different parts of the protocols. So there's kind of a matrix problem and Marcel will tell you all about it. Unfortunately, I would actually listen to it as well if I were here. And the main idea here is that it features cache coherency across different kind of devices, which is this big thing, right? So far you only have cache

Starting point is 00:48:21 coherency across multiple sockets, but not across multiple nodes, usually not across an accelerator and the CPU. So say, for example, your GPU and the CPU, there's no cache coherence, so you have to deal with this yourself. And through CXL, you get this. And plus, again, low latency transfers and high bandwidth transfers, which are basically tied to PCI Express. So this is a protocol on top of, or a specification on top of PCI Express. Okay, and with that I'm actually through. So the questions for me right now is,

Starting point is 00:49:13 what did you like and what could we do better? And I know it's hard to start, so we're just going to do it row by row. So which was your favorite topic, most interesting stuff? I think the topic I liked the most was the SIMD processing topic because I've heard of it before but I never knew what it really was about. Especially that we had an exercise on that. Okay. I really liked that.

Starting point is 00:50:03 And because I took a lot of lectures by you already, a few things in the beginning were a bit repetitive. Repetitive, okay. But it was fine. Whom was this was helpful? The database recap? OK, good. That's good. Then, yeah, what did you like? Personally, I enjoyed the lectures the most, where we combined theoretical ideas or concepts with practical stuff. So for example, the profiling session,

Starting point is 00:50:43 I really enjoyed a lot. And technically, whenever the terminal was pulled up and some code was shown. Okay, you enjoyed that? Okay, I actually wanted to do more. This is just always a lot of work in preparation. It takes a lot of time during the lecture. But that's good to know. I think I liked the profiling session the most because it showed how to get a space out

Starting point is 00:51:11 the last bit of efficiency out of your code. This was really cool, but I think I like most of the topics equally, so it's hard to mention a single topic. OK. That's good. OK, but the profiling was helpful. This was actually new this semester,

Starting point is 00:51:27 so we basically sneaked this in. Okay, that's good to know. Okay. Well, I like the Zimbi lecture. Okay. I like the Zimbi lecture. It taught me a new way to approach problems and to bring them into the classroom.

Starting point is 00:51:44 And I also write the data structure because they are probably the most likely that I will use often. Yeah, okay, fair. Yeah, I can also say that I like the C++. That was very new to me. It's kind of cool.

Starting point is 00:52:08 Also, everything about concurrency, especially GPU stuff, it's just very close to the heart. Okay, good. Good to know. Okay, thank you. I think I like overall getting into different topics, getting an overview of everything, but also that you can do something. And especially the memory access and humor

Starting point is 00:52:31 . OK, cool. So that's great to hear. So there's something for everybody and not everybody saying, well, just do a SIMD lecture and nothing else, right? Yes. OK, cool. Well, thanks a lot. So this is it from me.

Starting point is 00:52:51 If you still have other feedback, of course, there's EVAP, so feel free to put everything in there that you have. We're always happy about the feedback, and we're really trying to incorporate this and improve the lecture. I actually took a lot more time than I thought. So if you remember, initially, I had a lot of Q&A sessions in there. I think we almost had none of those, which is fine, right?

Starting point is 00:53:15 So I'm quite happy to basically present this. And if I know there's more, like we could be a bit more practical here and there, then I'll try to do this as well um yeah so from from my end also if you have other stuff of course feel free to reach out if this stuff is interesting to you and you don't want to attend the seminar but still want to do something in that area let me know um reach out to marcel and me and there's always projects that we can do in this space right so it's I mean of course we're always trying to come up with interesting stuff for you but if then if you say something I have this great idea that would be something that I would like to work on or this

Starting point is 00:54:02 would be a topic that I would be extremely interested. I can, of course, not promise we'll do this next semester, but we can kind of try to do stuff in the future and see it. And the more I know this is good for you or this is helpful, the more we try to also improve this and integrate it into the lectures or in the curriculum in general. Okay, so other questions or feedback ideas? Well, so the next time you will see CXL,

Starting point is 00:54:40 Marcel will do this with you. And then if nothing goes wrong, I mean, right now there's a bit of a hiccup, but there should be on Wednesday next week, we'll have the data center tour. I'm pretty sure this will work out. So, if it doesn't work out, then you will get information. But I'm very sure this will work out.

Starting point is 00:55:07 And that means you will then, rather than coming here for the lecture, you will just directly meet behind the building. So there's the entrance to the data center. This is like this big door on the back of the building. And Marcel will basically do this with you. So we'll. Actually, I'm personally not there, but Lawrence will be there. OK. back of the building and Marcel will basically do this with you so we'll okay

Starting point is 00:55:28 good to know so not myself look out for Lawrence Lawrence will do it with you and will basically try to show you as much as possible of the data center for sure the servers but maybe also some of the cooling. So there's different, like you will see, it's quite interesting because the data center itself, the room, it's large, but not that large. But then there's a lot of infrastructure around it that's required to basically keep this working, which I find interesting to see, especially for the first time. Do we have a Q&A or something for the data center? Or like a, we probably don't know. There's enough of you.

Starting point is 00:56:12 Or we probably don't need it. There's enough of you. We have to ask the last two sessions, and there were around 10 people. Yeah, but the thing is, be there on time. We'll probably start five minutes past or something. But then once we're inside, we'll probably not hear you if you're outside and looking for us.

Starting point is 00:56:33 Then you missed your chance, basically. I'm trying to do this regularly, but not much more than once a semester. But it's interesting. With that, thank you very much. Enjoy the rest of the semester, and thanks a lot for attending the lectures.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Summary

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.