Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Summary
Episode Date: July 20, 2023...
Transcript
Discussion (0)
Welcome everybody to our next session. Today we're going to do the summary
after I finish up FPGA, but I'll start as usual with some announcements.
So in our seminar with two Darmstadt, today we're going to have Shatai Demiralp.
He used to be a CTO of Sigma Computing and he'll talk about integrating semantics into data systems.
Now he's at MIT.
So that could be interesting if you're into
data systems, not super related to hardware though.
But still I invite you to join. The Zoom link should also be in Moodle.
Then the other thing,
this is gonna be my last lecture here, right?
So next week we'll have CXL
and then the data center tour.
So this is why I kind of need to wrap up today.
So there will be in the winter semester,
we will have a hardware seminar.
It's not in the lecture plan yet,
but we'll announce it soon.
And this will be exclusively for people
who attended this lecture or who passed,
let's say who passed this lecture.
So the idea is that we want to give you something
that kind of further engages like all the
hardware things so if you liked that then there will be more in the winter
term and Marcel will mostly be managing it but if I mean we already had this
discussion with this one or two of you if you say this is something that I'm
really interested in then we'll make sure that this is also in the seminar.
So you say, I really like GPU, for example.
We'll make sure there is some GPU in there in the seminar.
And then we'll have big data systems in the winter term.
And we're right now setting up something called Big Data Lab
that will be exclusively for people attending
or who have attended or are attending big data systems.
And here the idea is that we give you some hands-on experience
on big data systems.
So basically do more of what we're basically,
where we're telling you in big data systems. This is
how it works, this is how you set up the system, or what the
architecture of the system is. In the big data lab you will learn how to set it up
and how to use it. So that's kind of an extension to that. But this is in the making. So I noticed I basically... whoops...
did this... so I mean we're all the way at the end almost, so we're in FPGA. There
will be CXL next week by Marcel. But let me show you the last, at least from my last overview slide.
So we'll, and I didn't fix this here. So we're here at the 19th right now, right?
We're going to finish up FPGA. This is only a few slides, five slides or so. And then because of that and because I'm not here next week,
I'm moving the summary here.
So I have a summary slide deck
where I basically just pasted in all the overviews
and I'll just quickly run through it.
And then the idea is if you say,
oh, this is something that I have some questions on or that I didn't fully understand or where I would say also, like, please have more of this or less of this in the next semester or next time you do this, then I'm more than happy to hear this.
So I can basically adjust.
So that's the idea. And also just to give you an overview again
of all the nice things that you learned during this lecture.
Exactly.
So this is what's going to happen after the FPGA summary,
but now we're going to go to the FPGA summary first.
And so last time we discussed all the details on how an FPGA works, right?
And I showed you an FPGA. I didn't bring it today.
But I showed you some details.
And today I just want to give you like a bit of a glimpse on how we can use this in data processing.
What are the ways you could use this in a database, for example.
As I said, this is going to be brief, but maybe you can take something from it
or you can read up on more details in the referenced literature.
So, coming back, right?
So, as you remember, right, we're not really writing instructions or something, or a program for an FPGA,
but we're setting the layout of the FPGA.
So, if we're programming the FPGA,
we're basically giving a schematic for the circuitry.
I mean, this is not what we're doing today.
We're not really saying, okay, please plug this here and this here,
but the compiler will do this for us.
So in the synthesis, whatever rewriting as low-level code
will be translated to something like this.
And this means, well, basically we have an input to the FPGA
and then we have all the circuitry.
And depending on how we basically lay out the circuitry in there,
we can have a very high degree of parallelism.
And this is good for data processing, right?
So we can basically process a lot of data in parallel in there.
And this is also one of the major ways how we can actually get performance out of an
FPGA.
On the other hand, I mean, the other way is basically we can do multiple things or we
can have pipelines of circuitry that is dealt with in a single clock cycle.
So depending on how complex our task is and depending how fast the signal travels through our circuitry,
we can do arbitrary or many things within a single clock cycle.
And then we can again repeat the same architecture, the same circuitry multiple times on the FPGA,
depending on how much space we have on the FPGA.
So the FPGA that I showed you last time had 8,000 lookup tables,
so we cannot do much on this one.
But say larger ones will have in the hundreds of thousands of lookup tables.
So there you actually have some space, right? So you can actually lay out something there
and can have multiple, like whatever we would do on the small FPGA, we can place many times on the
larger FPGA. And that gives us this kind of data parallelism
or an opportunity for this data parallelism.
So we can apply the same tasks or the same task
to all items in a dataset.
And of course, again, we have to kind of balance
how fast can we actually get the data into the system,
into the FPGA.
So whatever we can get in there,
then we can kind of make sure
that we can process this in parallel
if we have enough space, if the task is not too complex.
But in general, like the chip area,
like different parts of the chip
can operate completely independently.
And for this, we can use this to just replicate
stuff. So, say we do a filter, we can have the same filter many times and we can have
multiple items coming in into the system and being filtered completely in parallel. This
is of course also true for any kind of map function, right? So if you remember your mapReduce, then anything that I would do in a map function, that's embarrassingly
parallel, so it's very easy just to parallelize. It's similar to something that we do in SIMD any kind of vector operations. So this basically gives us a way of using the FPGA efficiently.
Let me maybe get a pointer. So it's basically, say we have one filter here, then we can have many,
or one map function, something like that.
We can have many of those.
We'll have something that needs to distribute the data
across these multiple replicas,
or we just have basically something like vectors
coming in into the FPGA,
and we're operating on these vectors.
And of course, we can also build security
for any kind of reduction function.
Then that would go like that would basically
be something like this collection here.
So with this, we can get speed and performance out
of the FPGA.
And that's basically one way of having this data parallelism.
But then we, of course, can also have pipeline parallelism.
That means, and that I already alluded to a bit earlier,
is basically if we have a certain task, we can break it down into subtasks and then have different components in
between or the different components on the FPGA for these subtasks that then
are connected and this communication or this connection is very efficient
because you remember last time right there? There's basically this matrix,
so there's a connection grid on the FPGA,
and there's switches in between those,
which will be hardwired during programming.
And then we can basically have these pipelines generated
already on the FPGA.
And either we basically do this on a clock cycle,
meaning that we would have something like a register
in between these individual pipelines or pipeline steps,
or if the tasks, the subtasks are small enough still,
like the signal processing is fast enough,
this can basically be done within a single clock cycle.
But then we don't get pipeline parallelism,
so forget about that.
So as soon as we're basically using a clock signal,
we'll have these individual pipeline steps
and then we'll use some kind of registers in between
to take the output of the one computation
and move it to the next computation.
So remember, if we built like in a very simple way,
this would be like the flip-flop registers
that we have in there to keep just the output
of one lookup table result for example. If
we want larger data then we would have to allocate more registers for this if
we want or we could even use something of the the B-RAM, the block
RAM in between the registers to write something in there so if we have larger outputs
and uh and by this we then have these multiple stages and you remember the pipeline parallelism this basically um basically means we can get a higher throughput right so we we don't get better
latency um but we can basically have like these multiple. We don't have to wait until everything finishes.
So here, the problem, and that's actually a problem on the FPGA.
If we have a long pipeline and we're doing nothing about it,
this means the signal has to go through.
Let's look at this.
So if it's a long pipeline, we already broke it down into subcomponents.
But the signal has to go through all of this,
all the way until the end. And that means this will take some time.
And if we're unlucky, and as soon as this is complex enough, this will take so much time that we cannot do it
in the highest frequency that is available on the FPGA.
So that means then the synthesis will basically see this,
right, it will see, well, the signal will not process
or progress fast enough through this pipeline,
so I need to tune down the signal.
So basically I'm going down from, I don't know,
13 megahertz to seven megahertz,
or from 300 megahertz to 200 MHz, like on a modern, larger FPGA.
But using these pipeline registers, I can break down everything again into these smaller bits,
where the signal will be fast enough to progress through, so then I don't have this problem anymore.
The difference is, here in this case,
it will actually take one cycle to process through this, right?
So this will really only take one cycle,
but the cycle will be slower.
In this case, I will need three cycles to go through this,
but the cycles might actually be faster still.
And depending on what else I have in my program, right?
So I might have other multiple different kinds of things going on.
I should still be faster or can still be faster in this,
can be more efficient than if I always have to wait for the longest pipeline to
progress, even if other stuff could actually progress in
parallel.
Okay, and the final thing we also know is task parallelism.
So rather than, and this is similar to the data parallelism
with the difference that we're doing different stuff in parallel.
Rather than doing the same stuff on different data,
we're actually doing different things in parallel.
And that can be kind of anything.
I mean, one thing that if we have complex programs on the FPGA, what we'll have to deal
with at some point is some kind of protocols, right?
Meaning I want to reuse some of the circuitry for different things, say for example my filter.
If I have a simple filter, I will hard-code the filter condition onto the FPGA.
That means I can only filter things that are, say, less than a thousand,
and that will be hard-coded into the FPGA. I cannot change anything.
If I want to change this, I need some kind of protocol to make the...
Either I say these input bits will always be my filter
and I'm always going to send the input bits
or always send the filter condition with every individual input.
Then I don't need a protocol.
I don't need to do anything, but I'm wasting a lot of the input bandwidth
just for these filter conditions all the time.
Alternatively, I basically have to somewhere store the filter condition, and if I want
to update, I need to make the circuitry aware of this, that this is an update to the filter
condition, so I need to basically reroute stuff. And doing these kind of operations, I can have separate parallel tasks, for example.
But this is basically something that I need to be aware of, and where I have subroutines,
subcircuitry in parallel for different things.
So might be just management stuff, might actually be completely parallel, but different things. So it might be just management stuff, it might actually be completely parallel,
but different operations.
So say I do a join here, I do a selection there on a different table,
I can do this in parallel on different parts of the chip.
Or having different regular expressions in parallel, for example.
So this is also something.
So one thing that we've seen before already in SIMD,
and it's basically the same idea if we do it on the FPGA, is sorting.
So the idea is that in order to do sorting efficiently on an FPGA, we will also use sort networks.
Of course, we cannot sort arbitrarily large datasets using a sort network, because the
sort network will grow exponentially with the number of elements we need to sort. That's why we will keep it small.
But we can sort subparts and then we can have merging networks again.
I mean, we remember this from the SIMD lecture, I hope.
The idea is that using these compare and swap elements,
we built this hardware sorting network.
And you basically saw this already, right?
So this is what such a sorting network looks like.
And it basically just is a series of comparisons,
which in the end results into a sorted sub-dataset.
And again, in order to use this efficiently,
or in order for this to work,
we will need some buffers.
Unless our frequency is so low that we can actually go
pass through the whole thing in one step.
But usually we won't, so
that means we'll have some buffers in between.
Again, we'll have individual buffers in between again we'll have like individual pipeline stages
but you can see these parallel comparisons they can actually completely run in parallel we don't
need to buffer in between because they're there they don't need to be sequential right so that
means in this case for example for this sorting network of eight numbers, we'll need six pipeline stages.
And then we could use these
bitonic merge networks again
in order to merge them into larger subsets
and finally completely sort this.
Similar thing which has been used in the past
or one idea what you can do is like
database aggregations and or restrictions so restrictions would be a
filter again and here one idea what you can additionally do is you can use
partial reconfiguration so modern FPGAs have this option that you can say, I want to please reconfigure only
this subset of the FPGA.
So not flash the whole thing, but a partial, a smaller part of the configuration and or
like a block of the FPGA. And that means we can actually,
even while we're processing data,
we can say, okay, now I need some different kind of restriction,
I need some different kind of aggregation,
and I'll place different circuitry for this there.
This is beneficial because we don't need to replicate all of the circuitry for this there. This is beneficial because we don't need to replicate all of the
circuitry. So if we want to do all kinds of aggregations, for example, on the FPGA for
all kinds of restrictions on the FPGA, then this would basically mean we need to have everything
already on the FPGA once we start the program.
So think about complex SQL queries, if we want to fully support them,
everything, and we don't want to flash everything new, and we don't want to input all of the,
or have everything as an input coming with it, then we need to have all of the circuitry in there. And then basically just reroute based on which kind of configuration we need.
Alternatively, we can reconfigure some subpart.
So that would, for example, also work for the filter, right?
So rather than saying I want new code for the new filter, for a filter,
it doesn't really make sense because it's something which I can just use as a small bit of memory to fix this.
But just as an idea, for example, so I would have the filter
somewhere stored in my SRAM, then rather than reconfiguring
this through some kind of protocol, I could also just rewrite
this part if it's in a separate block and the block granularity fits this.
Then I can basically have this or partially reconfigure this on the FPGA.
So you can see this here and then you can integrate this into the host system.
So you can say, okay, I'm going to have some part reconfigured for this query,
some other part, like depending on the type of query that I want to execute,
I'll change some subpart here.
And then of course I want to, I can integrate this with the host system.
And this is also what people usually do. So in order to use the FPGA, they're not going to execute
the full database management system on the FPGA
but just simple or some basic operations.
Also, I told you there was a startup in Berlin
that did the same thing, right?
So they basically had MySQL and certain operations would then be pushed to the FPGA and the FPGA
would basically deal with this.
And the main data would actually still be on the database, but some kind of restrictions,
for example, some aggregations would be executed on the FPGA.
And there's lots of work in this area.
You can basically check the book that I've linked in on one of the first slides, or say,
for example, look at the paper that I've referenced here.
And another idea, how you can use this, would say, for example, be a hash table.
So, and here you can basically build
a fully pipelined hash table.
So the idea is that you can compare
many elements in parallel.
So that's basically what we're doing. So the idea is that you can compare many elements in parallel.
So that's basically what we're doing.
So you build a concurrent mechanism to handle hash collisions. So you basically have your key data coming in,
and you have a buffer,
and you compare to all of the different hashes in parallel
and output which ones are actually
matching which ones are not. So the main idea is that you do all of these comparisons completely
in parallel. And yeah, so I mean the main data still will be stored in DRAM, but just the comparisons will basically be done within the FPGA.
Again, for the details, I'll point you to the paper. This is actually by
Scholt István, who's one of our collaborators at TU Darmstadt right now. He used to do this,
he was doing this while he was at ETH Zurich, but now he's in TU Darmstadt.
Okay, so finally, just ass or GPUs,
which is in the hundreds of megahertz typically
rather than gigahertz.
So it's, I mean, it's not much, much slower,
but it is definitely slower.
But in order to still be faster or be reasonably fast
and while using this,
you really need to use the parallelism.
And at the same time, because they're kind of slower and also not as large,
they don't consume as much memory, also the way they're basically built.
And because of the lower frequency.
So that's actually one of the main drivers
so you can be more efficient in terms of energy and you can also be more efficient because of the
sheer parallelism that you have right so depending on the complexity of the program that you put on
the fpga you can fully utilize the whole chip space in the execution.
Of course, this only makes sense if you're dealing with some kind of specified task
and the circuitry is actually used all the time.
So, in the CPU, you have all these instructions and all of the chip space,
in order to be able to process anything that you throw at it.
And the FPGA, you really want to specialize for some subtasks, because then you can actually
use the chip space.
And this means the more you actually, I mean, of course you want to use it efficiently,
of course you could also do lots of stuff on the FPGA, which doesn't make sense.
But if like thinking about efficient compute,
the more you can use the chip space through parallel units, etc.,
you will actually benefit from using the FPGA.
However, the problem is that the compile flow is very slow.
So remember, synthesizing and then mapping and routing,
especially your program, that takes a long time.
This can take up to a day or even more than a day.
And that means the task that you're dealing with
that needs to fit this problem.
So you cannot say I'm having a new ad hoc query
in my database that I need to be answered in a few milliseconds maybe.
That doesn't make sense, right? I cannot put this in there.
I need to somehow be able to tolerate this reconfiguration.
Or I have to be able to pre-compile this
because the mapping, the programming of the FPGA,
that's not that slow.
It's still not in milliseconds, but it's faster.
You can also partially reconfigure.
So we had this today.
That will again be faster because you don't have to send as much data to the FPGA
to reconfigure it.
Still, it will take some time.
And you can, like through these reconfigure it, still it will take some time. And you can, like through these reconfigurations,
you can create hardware with some modifiable behavior
at runtime.
Okay.
So with that, I'm wrapping up the FPGA.
So we walked through the architecture, how to program it, and especially I put, I think,
most emphasis on how it internally works.
So you get an idea how to get some performance out of it, I hope.
And we talked about the design flow and today very briefly about how this can be used in
data processing.
And there I invite you to read some of the papers.
If you attend Marcel's seminar, then you might actually come up with some additional ideas
how to use it efficiently, if you want.
With that, are there questions to this part?
No questions? Then, very good.
Then we're going to switch to the summary today.
And as I said, this is really just a summary.
So, I just brought, basically, I copied in all of the summaries
or the overviews of the lectures. So we basically, we started with like an info overall
and the introduction of computer hardware and the lecture.
And then I gave you an introduction to,
or a recap of database management systems.
And then we started with performance analysis.
And this is basically something that I
have in many of my lectures because I
think it's really useful.
And maybe it helped you to some degree with the tasks.
Basically, get a basic understanding of performance
think about what the hardware can do in terms of bandwidth in terms of uh throughput and latency so
what are the numbers that the vendors tell you how does this relate to what you're seeing and
what would you expect to see because then you can basically see if your program actually does well or not or if you have like a complete error in
your in your thinking on how to use this we talked a bit about measurements and
benchmarks and fair benchmarking so this is something relevant if you ever want
to do a research paper if you want to do master thesis with some kind of researchy aspects,
then it's good to know how to properly evaluate stuff and how to benchmark stuff.
Then we basically started with the real stuff.
So CPU and caching gave you an overall overview of the CPU architecture,
the buses, the memory hierarchy, etc.
And then we talked about memory accesses.
So how do they work?
So remember, there's these multiple level of caches and virtual addressing.
And we want to make sure that we keep data in the caches as much as possible.
But not only data, but also instructions.
So we want to keep them there.
Of course, it's not always possible, but we want to be cache efficient.
And then basically based on this, we can change our data layout.
We can change data alignment, but also instruction alignment,
and make sure stuff works well.
With this kind of knowledge and this in mind,
we also talked about the instruction execution,
so how your program is actually broken down
into micro operations, which are then executed on the CPU.
And you remember, hopefully, that today,
even a single core is kind of a parallel system already,
because you have multiple functional units
that will be executing in parallel.
Not all of the functional units,
and this is, again, kind of, if you think about it,
the difference to how we want to use an FPGA.
Because on the CPU, we have a lot of functional units
that will only be used certain amounts of time.
So if your program is not vectorized,
you will never use the SIMD.
I mean, the compiler will try to do it for you.
But say there's certain amounts of chip space
that you're not using.
In the FPGA, you will try to use all of it as much as possible.
We talked about hazards, so when the pipelining does not work.
Because basically, the way how we access data in the caches and how we are executing branches.
We looked at a couple of different pipeline architectures today.
Not a couple, we looked at two.
Basically we looked at x86 and the ARM M1 architecture in a bit more detailed overview.
Then we put more emphasis on SIMD.
So how does SIMD programming, so these vector units on the CPU work.
You actually also had to implement a task on that.
And then, so how they work, but also how we program them
and how we can use this in the database.
Then we've shifted more towards, and of course,
if you have questions so far, let me know.
If there's something that you remember
that was kind of interesting, something where you say,
OK, this I would like to look into again, let me know.
I mean, the slides are all there.
There's no exam, so I know you're
kind of relaxed about these topics right now, which
is fine.
That's also the idea.
It's more really for the idea right it's more uh being like really
for the fun of it right so i mean the the programming i think is super useful you will
be very happy uh later on that you can do this kind of stuff because you get nice jobs etc um
but also this is good to know to get have kind of an overview of how this works because then you can use your computer much more efficiently than if you don't.
So in the execution models, we talked about how database queries are executed and how we can do this more efficiently towards hardware. So the classical iterator model, which is really OK-ish,
which is really nice in terms of an abstraction,
but it's not really fast if you have a modern CPU
and modern and large memory.
Then materialization model, where you basically just
process operator by operator and then
the more towards hardware geared execution models of vectorization and code generation.
So vectorization meaning using multiple data or a batch of data items at a time in order to be efficient in terms of memory,
in terms of caches, but also using vectorization.
And then code generation in order
not to interpret the code all the time,
not to have all these function calls,
but really generate small kernels that execute very fast
on the hardware.
And this is an ongoing discussion, basically,
and ongoing research,
what is good and better for certain cases.
So, I mean, code generation always takes some time.
Vectorization has more of these function calls, etc.
So there's always new kind of architectures popping up,
new kind of ideas how people try to do this.
Then we have data structures.
So we talked about hashing trees and tries.
You all implemented the art,
which I think still think is quite nice data structure,
not necessarily taught in data structure courses or tries in general.
So it's a good kind of knowledge to have. You'll have an additional opportunity.
You already heard about the yesterday right about the skip list. So this is also an interesting data structure for certain cases
and increasingly frequently used in database research.
So I think it's also good to know.
So I mean, know your data structures.
We always try to give you kind of an overview
of the most relevant ones, but then there
is many for special cases.
And so it's good to have a deep understanding also of how they work.
And I think implementing is always the best way to get a good understanding there.
Then we had this new profiling lecture that Lawrence showed you,
where how to do profiling, get a bit more details out of your code and how
it runs.
So, I hope that helped with some hands-on session or tools.
If you were not fast enough like me, there are basically still all the details in the
video.
You can check how to do this and try it out.
And again, I really recommend to try this stuff.
So if you haven't done, I'm assuming all of you have done this for the tasks.
If you haven't done, try it out.
It's always a pain initially, like anything that you have to learn.
So initially it kind of takes a very long time.
There's a certain learning curve,
but then once you have this knowledge,
when you have the skill, things will go much smoother.
You can do much more than before.
Then we talked about parallelism,
first multi-core parallelism,
how this applies to database queries in terms of inter and intra-query parallelism,
so inter in between, doing multiple queries in parallel, intra-queries, splitting up a single query into multiple parallel parts.
And then we talked about joins as one instance of intra-query parallelism on an operator level.
So how can we parallelize a join?
And we had three different examples.
If we're talking about parallelism, we also have to think about concurrency
and synchronization because we have
data structures that will be used with from multiple threads in parallel so we talked about
the cache coherency protocol we will talk about this again uh or not we marcel will talk about
this again um next week uh when he's talking about cxL, because this is one of the core features of CXL,
is this cache coherency.
So that's going to be interesting.
And then we talked about synchronization,
so different kind of latches,
which are basically data structure locks
and other kinds of locks, or different strategies of locking in data structures.
Coming from multiple cores, we talked then about NUMA,
so non-uniform memory access,
so as soon as we have many cores or multiple sockets,
today even just single socket have many cores or multiple sockets, today even just single socket, but many cores,
we'll have different kind of memory regions
which have different kind of access speeds.
And this is something that we have to think about
if we're programming, at least if we wanna be efficient,
because local access is much faster than remote access,
even in a single server.
And just being aware of that will give us
a good performance improvement.
So that's why we basically looked
at what this means in terms of architecture
and then in terms of programming
for different kind of operators.
So how to structure the data,
align the data and the program in order to be efficient.
Then Lawrence showed you persistent memory.
This is basically a new technology,
unfortunately dead already again, at least for some time,
where we talked about how we can use it.
So it's a different type of RAM.
However, even though I mean, we were basically
debating if we should still keep this in the lecture or not.
But on the one hand, it's still used in database systems.
So Oracle, for example, uses persistent memory
in their system.
So it's not worthless information because people are using this.
There's also still a lot of research.
We still have the servers.
And even, for example, again, with CXL,
there will be new persistent memory alternatives.
There already is.
There's still battery-backed modules, meaning you
have DRAM where there's a small battery, which keeps the memory stable and basically persistent.
And having, as I said, with CXL, there will be also new persistent memory ideas or persistent memory configurations.
Not necessarily exactly like Optane, but with similar, not necessarily characteristics,
but similar features.
This means, of course, it's not going to be exactly the same, but in general, we can do
the same things with that as we did with persistent memory.
Then we talked about storage, so general overview,
and then a lot about NVMe, so how to use SSDs efficiently.
With this, we also talked about PCI Express in there.
How does this actually connect
and how then can we use the SSDs efficiently.
Then, networking.
So, first general parallelization across multiple nodes.
So, rather than, like, as soon as we're going out of a node,
unless we're having CXL, we'll have to deal with networking.
So this means we'll either have Ethernet or InfiniBand,
and then we can have different modes how to connect.
So again, either using IP or RDMA.
And I talked a lot about RDMA
to kind of give you this alternative
on how we're accessing another server.
I mean, there's lots and lots of stuff
about networking that you can do.
So how you use the networking,
so basic socket communication,
but also message passing, et cetera.
Then there's different levels again.
And you know, we have a networking professor here
who will be able to teach you much more about this.
But for us, it's important to know that there's also very low-level ways
of using networking, which is remote direct memory access,
which will give us very low
latencies and very high bandwidth if you're doing it right. And this again we
can use in the database, say for example, I mean the canonical example again, here
was a join. Then we had two lectures on GPU, where Ilin discussed the architecture and memory hierarchy.
So this kind of basic, how does a GPU work internally, what is the difference to a regular
CPU.
And then in the second part, how to program it, how the execution works.
And then in the second lecture, also how to use it in a database
and what happens if we have multiple GPUs, which
we will have on modern servers that feature GPUs typically.
So how are they connected to the main memory, to the CPU,
and how are they connected with each other?
And again, how can we use this efficiently,
say for example, for sorting,
or, and we didn't cover this,
but we can also, like another idea would be joints again,
where we have a lot of data movement.
And we saw that they are actually efficient,
or not efficient, but fast,
but usually limited by the interconnect.
So the GP is so fast in data processing
that if we're just doing a simple operation
like a join or a sort, then the interconnect
is the main bottleneck for us again.
Then we finished up FPGA today.
So here, for me, important is that you understand
what is the difference to a regular CPU or a GPU
in terms of general execution.
And I hope you understand that this is not like instruction execution,
but it's really mapping circuitry through this.
I mean, again, of course, the circuitry, everything's already there.
But we're using kind of memory cells to create like small changes in the hardware
to map regular circuitry to this reconfigurable circuitry.
So we talked a lot about the architecture, talked a bit about the programming and the design flow and today we talked a bit about the data processing on there.
Next time you will have an intro to CXL.
This is now finalized, so I just took this from Marcel.
You will get an overview of the Compute Express Link
and the different modes.
Basically, what are different protocols in there.
I'm always getting confused, so that's why I'm kind of putting an emphasis here.
There's different versions of the protocol which support different modes, and then there's
different devices which again support different parts of the protocols. So there's kind of a
matrix problem and Marcel will tell you all about it. Unfortunately, I would actually listen to it as well if I were here.
And the main idea here is that it features cache coherency across different
kind of devices, which is this big thing, right? So far you only have cache
coherency across multiple sockets, but not across multiple nodes, usually not across an accelerator and the CPU.
So say, for example, your GPU and the CPU, there's no cache coherence,
so you have to deal with this yourself. And through CXL, you get this.
And plus, again, low latency transfers and high bandwidth transfers,
which are basically tied to PCI Express.
So this is a protocol on top of, or a specification on top of PCI Express.
Okay, and with that I'm actually through.
So the questions for me right now is,
what did you like and what could we do better?
And I know it's hard to start, so we're just going to do it row by row.
So which was your favorite topic, most interesting stuff?
I think the topic I liked the most was the SIMD processing topic because I've heard of
it before but I never knew what it really was about.
Especially that we had an exercise on that.
Okay.
I really liked that.
And because I took a lot of lectures by you already, a few things in the beginning were a bit repetitive. Repetitive, okay.
But it was fine. Whom was this was helpful? The database recap? OK, good.
That's good.
Then, yeah, what did you like?
Personally, I enjoyed the lectures the most,
where we combined theoretical ideas or concepts
with practical stuff.
So for example, the profiling session,
I really enjoyed a lot.
And technically, whenever the terminal was pulled up and some code was shown.
Okay, you enjoyed that? Okay, I actually wanted to do more.
This is just always a lot of work in preparation.
It takes a lot of time during the lecture.
But that's good to know.
I think I liked the profiling session the most
because it showed how to get a space out
the last bit of efficiency out of your code.
This was really cool, but I think
I like most of the topics equally,
so it's hard to mention a single topic.
OK.
That's good.
OK, but the profiling was helpful.
This was actually new this semester,
so we basically sneaked this in.
Okay, that's good to know.
Okay.
Well, I like the Zimbi lecture.
Okay.
I like the Zimbi lecture.
It taught me a new way to approach problems
and to bring them into the classroom.
And I also write the data structure
because they are probably the most likely
that I will use often.
Yeah, okay, fair.
Yeah, I can also say
that I like the C++.
That was very new to me.
It's kind of cool.
Also, everything about concurrency, especially GPU stuff,
it's just very close to the heart.
Okay, good. Good to know.
Okay, thank you.
I think I like overall getting into different topics,
getting an overview of everything,
but also that you can do something.
And especially the memory access and humor
.
OK, cool.
So that's great to hear.
So there's something for everybody and not everybody
saying, well, just do a SIMD lecture and nothing else, right? Yes.
OK, cool.
Well, thanks a lot.
So this is it from me.
If you still have other feedback, of course,
there's EVAP, so feel free to put everything in there
that you have.
We're always happy about the feedback,
and we're really trying to incorporate this and improve
the lecture. I actually took a lot more time than I thought.
So if you remember, initially, I had a lot of Q&A sessions in there.
I think we almost had none of those, which is fine, right?
So I'm quite happy to basically present this.
And if I know there's more, like we could be a bit more practical here and there,
then I'll try to do this as well
um yeah so from from my end also if you have other stuff of course feel free to reach out if this stuff is interesting to you and you don't want to attend the seminar but still want to do something
in that area let me know um reach out to marcel and me and there's always projects that we
can do in this space right so it's I mean of course we're always trying to
come up with interesting stuff for you but if then if you say something I have
this great idea that would be something that I would like to work on or this
would be a topic that I would be extremely interested.
I can, of course, not promise we'll do this next semester,
but we can kind of try to do stuff in the future and see it.
And the more I know this is good for you or this is helpful,
the more we try to also improve this and integrate it
into the lectures or in the curriculum in general.
Okay, so other questions or feedback ideas?
Well, so the next time you will see CXL,
Marcel will do this with you.
And then if nothing goes wrong,
I mean, right now there's a bit of a hiccup,
but there should be on Wednesday next week,
we'll have the data center tour.
I'm pretty sure this will work out.
So, if it doesn't work out, then you will get information.
But I'm very sure this will work out.
And that means you will then, rather than coming here
for the lecture, you will just directly meet
behind the building.
So there's the entrance to the data center.
This is like this big door on the back of the building.
And Marcel will basically do this with you.
So we'll.
Actually, I'm personally not there, but Lawrence will be there. OK. back of the building and Marcel will basically do this with you so we'll okay
good to know so not myself look out for Lawrence Lawrence will do it with you
and will basically try to show you as much as possible of the data center for
sure the servers but maybe also some of the cooling.
So there's different, like you will see, it's quite interesting because the data center itself, the room, it's large, but not that large.
But then there's a lot of infrastructure around it that's required to basically keep this
working, which I find interesting to see, especially for the first time.
Do we have a Q&A or something for the data center? Or like a, we probably don't know.
There's enough of you.
Or we probably don't need it.
There's enough of you.
We have to ask the last two sessions,
and there were around 10 people.
Yeah, but the thing is, be there on time.
We'll probably start five minutes past or something.
But then once we're inside, we'll probably not hear you
if you're outside and looking for us.
Then you missed your chance, basically.
I'm trying to do this regularly, but not much more
than once a semester.
But it's interesting.
With that, thank you very much.
Enjoy the rest of the semester, and thanks
a lot for attending the lectures.