Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Introduction
Episode Date: April 9, 2024...
Transcript
Discussion (0)
Welcome, everybody.
We're going to start with hardware conscious data
Processing summer term 2024. So who's new at hpi this
Summer term? nobody?
Okay. So i see a couple of familiar
Faces. I don't think i have seen
Everybody so far, but i also have very bad face and name memory so feel feel free to remind me
and i'm already sorry if i didn't recognize you right away but it will get better over the course
of the semester so this is on uh yeah hardware consciousing, so this is a course on how we can use hardware most
efficiently for doing data processing, so most notably database management tasks, so
how can you do joins, store data, et cetera.
And we're going to go down through the different kind of levels of hardware, so it's also a
bit of a hardware architecture course and first a heads up
you probably already know this right so this will be recorded and will be available on teletask
if we have online sessions we don't record the videos of online sessions so feel free to show
your video although i'm not really planning on doing any online sessions. So we're going to try to do everything here.
But if for some reason, some exception, random rare exception,
you cannot attend the class in person, you can also always watch the video.
And this means we have everything online.
So the slides will be online, the videos will be online, we have Q&A forums etc. online.
So there is no need to take any videos or screenshots or whatever and distribute them because everything is publicly available anyway.
What I want to do today is give you a bit of an overview of our group, our curriculum, so let's say
broaden the scope so you can see what else is there, what we're doing, then give
you details on the course organization and do a bit of a motivation also today.
So well, everything you need to figure out if this course is for you and you want to stay with us, which I really hope.
Okay, so who am I? I'm Tilman Rabl, professor for data engineering systems here at HBI
since 2015. With that, I'm one of the older ones by now. And before that I was at TU Berlin, University of Toronto and University of Passau.
And besides being a professor for data engineering systems, I'm also ombudsperson here at HBI.
So whenever there's a problem with scientific integrity in whatever form,
you can come to me or at least ask me how this can be dealt with or who to ask, whom to ask.
And I'm director of the HBI data center, which is kind of nice because we have direct access to all
of the hardware that we're going to present here. And for those of you who don't know, the data
center is basically, you cannot see it, but I can see it uh just the second part of this building
and um with that uh if all goes well we'll also do one slot where you can actually go there and
see the hardware for yourself although this is always a bit of a struggle to be honest
because there's many regulations for you to be safe if you see the hardware,
but we're going to really try to make that work. Okay, so, but of course, I'm not alone, right? So, this is my group. Two of the people here, you will, or three, actually, you will have more
contact with. So, it's not really working well. This is Marcel. Marcel, stand up.
Marcel will help with the lectures,
and especially with the labs.
And Florian as well.
Florian, please stand up.
And thank you.
And so these two guys, you will see
whenever you have to program something, which
you will have to do.
And then we also have Martin, who is still in Irvine
and will come back next week.
And you will have full exposure with Martin next week
because I'm traveling next week.
So and he will do parts of the lectures whenever I'm not around.
And all three of them have great knowledge in database systems and programming.
So they will be able to help you a lot along the course in many ways, often even more concretely and better than I will be able to.
OK, so what do we do? What is data engineering systems?
It's database systems in essence, right? So we do look at database systems, meaning how do we build database systems? How do we optimize database systems? How do we run database systems on modern hardware? These acronyms here, these are conferences that we typically publish on. So if you want to read some of our papers, you can go to VLDB, SIGMOD, or ICD. I don't like ICD that much.
So go to VLDB and SIGMOD.
That's the best conferences for database research.
We also do a lot of stream processing and real-time analytics.
That's something that I brought from TU Berlin because they're super strong in that direction.
Of course, machine learning is important, but I'm not a machine learning guy, so I'm
really interested in how can we look at this from a system perspective.
So there's many tasks that require data management, that require being efficient on hardware,
and that's where I'm getting interested.
So we do research in that.
And then I've always been involved with benchmarking.
So this is something that we always also do.
So we check how fast is the hardware, how
fast are systems, we compare them and we build benchmarks.
So in the end, our research approach often looks something like this.
We start from a certain application scenario, build benchmarks, so
build basically frameworks, tooling and guidelines how to benchmark applications
and systems and then we do the actual benchmarking. So that's basically running
your things and checking the times based on the application. we see if systems are good enough or which system might be best.
And then we might extend existing systems.
We might exploit new hardware.
If existing systems can really help or can really use the hardware, or we're just building completely new systems.
And if any of this is interesting to you or during the course
you figure out well I finally found what I was looking for I want to do work on
this then feel free to reach out to us and we'll help you with doing some work
with us in form of a project a student assistant position or a thesis. And this course is part of a greater curriculum that we somehow do in winter and summer
term and not everything done by us. So in the bachelor's there's database systems one and two,
which mainly is taught by Felix Naumann, but every now and then I jump in there as well. Then we do
a lecture series. I'm going to give you a bit more details on that in a bit.
And then in the summer term,
we have hardware conscious data processing
as our big lecture.
In the winter term, we have big data systems
as our big lecture.
And then besides that, we have seminars on hardware,
typically in the winter,
machine learning systems seminars every semester basically. And new in the winter, machine learning systems, seminars every semester basically.
And new in the winter term, we had the Big Data Lab,
which is kind of a practical version of big data systems.
And so, I mean, we're kind of trying to give you
all the details on everything you need to know
about database systems and data engineering systems.
This semester, we have hardware-conscious data processing,
you already noticed.
We have a lecture series on HBI research.
So this is something that I'm trying this semester
for the first time.
So since we've been growing quite a bit,
I thought it kind of might be interesting also for you
to see what kind of research is there actually at HBI.
It's also interesting for me,
but it might also be interesting for you.
So there's a lecture series where every week
we have another HBI professor presenting the work
of the group in a lecture.
So not like a pure research talk,
but kind of an overview of what is the research about,
like something specific and a bit of an overview of the group.
So that's going to be Tuesdays. Today will be the opening.
So again, just course logistics, etc.
And then from next week on, every week there will be a professor.
And this is for bachelors and masters.
So you are all masters
for three credit points and we have other lecture series and other other lectures and courses where
you can get another three credit points then machine learning systems will start tomorrow
on at 1 30 and i don't know the exact. I think we're trying to do it in F downstairs,
but that's something you can find on the website. And besides that, we have bachelor projects,
we have master projects. Every now and then this term we don't have one, but there's a bachelor
project on the climate footprint of the data center. also something that's near and dear to my heart.
So how can we make stuff more efficient also climate-wise, not just performance-wise.
And we're always happy to guide you in some kind of projects.
And one thing that might be interesting to you is the SIGMOD programming contest.
This is tight already, so it already started. But there's still some time. So if you feel you don't have enough to do right now at the start of the semester, you
might still sign up for this and check or build a hybrid index
for vector queries doing some KNN search, basically, on vector data sets.
So stuff like this, every now and then something
like this comes up.
The SIGMOD programming contest is kind of nice,
because it usually has a database management aspect
or a database system aspect like this one.
Every now and then, it's not. and then I'm not super promoting it but if you participate in
this and you are part of the finalists then we will sponsor your trip to SIGMOD
and this year this would be in Santiago de Chile so that's also maybe a reason
to participate in these things next year would would also be fun, right? It will be in
Berlin, so it's cheap for us to send you there. But I don't know what's
going to happen next year. Okay, so, but these things, if you're interested in
this, reach out, check out the website. I actually have the website open somewhere
to show you. So, if I find my mouse...... So this is basically... You can find all the
details here and there's a price, money etc. So it should be fun. Okay, with that, if
you want to do a research project, just a quick recipe for you.
Also, this is kind of like more of the meta level information.
Whenever you do a research project, that might be a good idea to follow this kind of seven step recipe,
because it helps to keep you kind of motivated and focused and not end up in a dead end somewhere along the path. So if you're doing a
project which has something to do with performance in one way or a measurable outcome, then this is
a good approach to start this. Also if you want to write a thesis. Even if you don't want to write it
with us, I'm quite happy to give you some guidance in order not to fail.
So this is always good.
So of course, you need to do literature research.
So you start, depending on your previous knowledge.
If you're already expert in a field,
you should know the literature.
Otherwise, go out there, check the most relevant conferences,
check Google Scholar and Google, of course, in general,
to find
what's already there then you identify a research problem right so usually it
shouldn't be just an idea so many people start with an idea and then later on
figure out it actually doesn't work or it's not neat idea but it doesn't help
you in any way so try to find a problem that you can solve. Often we can find a problem based on idea
just by framing it differently.
So rather than saying, oh, I can use, I don't know,
this hardware for something,
I can frame the problem in the sense,
what would be the performance if I use this hardware?
So then I have a research problem.
I can actually identify the problem.
If I think, oh, using this hardware will be great,
and now I'm trying it, then later I find out, oh, it
doesn't really work.
It's not really faster because there was no problem.
It was just a neat idea.
Then my result is negative.
Figuring out if it works
at all or what the performance is, you can always have a positive outcome. So the framing
is different, your motivation will be different if you do this.
If you identify a research problem, you can describe a novel solution, and then you perform
back of the envelope calculations. So this is like just a very rough estimate what the performance will be.
So will this be good or not?
Typically, you can do this in a day or even less.
You can figure out what is the performance of my setup.
So even if you try to do a new startup here at HBI with eSchool or something,
and you come up with a needs microservice
whatever application server something something architecture you can do a
quick back of the envelope calculation if this will be fast enough or not so
just by basically knowing some basic numbers of how fast each individual
service can respond what are the latencies in between,
you can figure out what should be the performance in the worst case, in the best case.
And with this, you can figure out if your idea is good or if it's stupid. And this is also called a
bullshit filter. And this is basically just to figure out. And if you don't do this, you build
everything, you do experiments, you figure out everything's way too figure out and if you don't do this you build everything you do
experiments you figure out everything's way too slow and then you work your way backwards
and you already have so much invested that you cannot really change everything from the get-go
doing these simple calculations often helps a lot and then of course we have to implement we
perform the real experiments and then you write up the report.
If you want to publish this, you have to endure a couple of revision cycles,
even in your thesis, right?
Your thesis is not a single shot, write down the whole text,
but it's typically, even if yourself, you are revising the thesis,
you should go through many cycles.
This is the only way to improve the text.
Nobody can write a perfect text from the get-go.
Just write it down.
You really have to do this incrementally and revise, revise, revise.
Okay, questions so far?
Perfect.
Then we're going to get right into the course logistics.
So, course hours and structure.
The lecture and labs are here, Tuesdays and Wednesdays.
So, you found this, so you know this.
This is probably mainly for people online or watching the video.
The labs will typically have on Tuesdays, every now and then, there's
not that many of them, but this will be in the same course. These will mostly not be
recorded, just because we're going to present you the solutions and we want to reuse some
of the tasks, so that makes sense for us not to record it. But if we don't record, we typically provide a Zoom link
if people cannot join.
We might do this proactively or reactively.
So I don't fully know yet.
So meaning, and this is always the case, right?
If there's something where you need some help,
you cannot join, something is missing on the website or something, just reach out.
So we're trying to always accommodate everybody.
If you're sick, you cannot perform the task in time, let us know in advance.
So it doesn't make sense to send us an email after the deadline.
Well, unfortunately, I don't know, I had to take a vacation during the last couple of
weeks, couldn't write an email.
This is typically the things where we cannot really help.
But if you come and say, something happened, I cannot finish, I don't know, I have to deal with something else, can we move the deadline?
We're always trying to accommodate that, if possible, if the reason is plausible. We're trying to always have this here.
There might be a case where we're using an online version of videos, so if there is another pandemic
we'll go fully online. I hope it's not. Otherwise, I'm hoping we'll have everything here.
We always have all of the details in Moodle. So, I hope all of you already found our
Moodle page. So, let me show you this again. So, this is the Moodle page, right? Number 740, the course.
And in here, we put all the slides.
We have all the programming task details,
the grading, prerequisites, policies, discussion forums.
And you can see already there's lots of stuff that's still hidden.
We're updating the schedule, I I hope did we update the schedule
perfectly updated the schedule we have an example coding task I'm also come to
this we have the slides for the day for example you can see there's all
everything else is already here as well
okay I'm available by email always so you can always send me an email if you have questions.
If things are hectic, right now I get lots and lots of emails because of the semester
start and some other conferences that I'm involved in. So every now and then emails
get stuck in my stack, somewhere lower down.
That means feel free, if something is urgent,
feel free to remind me.
If I see an email is urgent, I try to answer right away.
If you don't get an answer within three days,
probably it went down the stack a bit,
then it might take a week or two until you get a reply.
If everything, like stuff gets locked or something,
then you can always also send me a second email.
I'm almost never, almost never upset about this.
And I also have an open door policy.
So whenever my door is open, which is usually the case,
you can just come in and ask your question, right?
This is something we can answer quickly. I'm in building F in the second floor.
Just come in, ask your question,
and we'll solve this right away.
Unless I'm in a meeting which is super secret
or I'm in a call or something,
then I might tell you, well, give me five minutes
or give me 10 minutes or something
or come back at that time.
And again, we'll solve it.
If this is something where we need more time, half an hour,
well, let's say starting from 15 minutes onwards,
then let's book an appointment, right?
And this is always something that we can do.
Usually smaller appointments I can fairly flexibly make.
If it's something long, like an hour hour it will need some time okay good so
course contents we have quite a bit of stuff to cover in the course this is
continuously evolving this is the third iteration of this course and every now
and then I'm trying to bring in some new stuff. We're trying to change in some software kicking out some old stuff
This is basically
We're working our way through and you can also see this in the in the timeline later
So what we're going to start after going through the introduction and some performance analysis
We'll start working our way through the CPU, right?
So this is basically the CPU architecture with the individual cores, multiple cores,
different levels of caches, and then the DRAM.
So this is the stuff that will take us some time first going through this and thinking
about the implications this architecture has for data processing. Then we'll walk out of core onto multiple cores,
and then out of the multiple cores into the peripherals.
So basically on disk, network, GPUs,
and then later other kind of accelerators like FPGAs, right?
And in between there's some other things,
some goodies that we sneak in,
something like CXL, Compute Express Link,
a new standard for PCI Express,
a profiling session where we help you
how to profile your code
and really get an understanding of the performance
of your code, so where does time go in there,
and things like that.
So that's the rough outline.
And learning goals is like a couple of different things
all packed into one because I think this is basically
like you learn a couple of different things by doing this.
So the one thing, the major overarching motivation
for this is efficient data
processing on current cpu architectures current memory architectures in parallel distributed and
on accelerators so this is basically the aim of the course this is why we're doing this is so you
know how to do this and this actually makes sense because if you can do that, you will find a nice job in industry
or you can do great research in academia, whatever.
So with this kind of knowledge,
you have a deep understanding of processors,
of data management and data processing,
and a lot of people will be happy to work with you.
You will have to experience this yourself. So this is something that's really important for me.
So you have to program yourself.
Major part of your work in the course is programming. So it's actually the biggest part of the things that you have to do,
except for listen to me, which might also be a heavy task every now and then.
But the things where you actively have to do something is,
or more actively is the programming.
And I think that's really important.
If you go through this, my experience with
the people who went through the course is they're really good to work with, right? So
people who understand architectures, who understand data structures on a low level and make them
performant, they have an easy time doing their thesis with us, doing their thesis with other people as well,
and other groups.
So, and as a side thing, what you also see is,
or what you also learn is computer architecture,
because we have to understand computer architecture
in order to make our programs efficient.
So mainly CPU and memory,
so this will take us a lot of time,
but then also accelerators, which is also important
if, God forbid, you go into AI
and want to do something there, right?
So then this is also good to know,
just to know how the data has to flow, et cetera,
and how the architecture and the hardware works.
And then, of course, efficient data processing, etc. and how the architecture and the hardware works.
And then, of course, efficient data processing,
so query processing and how to utilize the hardware,
and efficient programming.
So this is not so much what I will teach you,
but this is what you will learn in the labs and where Marcel and Florian will help you
to make your programs run fast.
And you will self, I mean, there's also a lot of self
teaching, so how to make your programs
hardware conscious and fast.
And this is not super hard, it's just something
that you have to practice, and we'll give you some space
and time to do so.
So with this kind of overview of the course,
so we're in the introduction,
we're gonna do performance management.
Next week I'm not here, so Martin will do database basics,
kind of a recap for those of you
who haven't heard database systems 2.
And CPU basics then, so this is the first step on or the first overview of CPU architecture.
This will take us probably two sessions, depending on how fast Martin is. I will continue
where he left off. Then we'll talk about instruction execution on a CPU. So how do
the instructions actually get encoded and decoded and then executed on a core on the multiple functional units.
We'll then take a bit of a deeper dive on one particular part of unit on the CPU, which is the SIMD units.
So the vector processing.
Then, unfortunately, on the 1st of May, there is no class because it's a holiday.
We'll also enjoy that.
Then you will have the first task, and that one will be on SIMD.
Then another SIMD class.
Then we'll talk about the execution model.
So how do we execute queries efficiently?
So there's different ways how we can do this.
There's also different, let's say, religions on how we can do this. There's also different, let's say,
religions on how we can do this and people fighting against each other, what would be the best.
So we'll talk about this. Then data structures, how to make them efficient, the profiling that
I promised. Marcel will tell you how to do that. Then multi-core execution, the second task that I wrote to be discussed or to be determined
but we're already pretty sure
that this will be query execution.
So query compilation actually.
So one specific kind of execution model.
And then it's, we're slowly walking out.
So you can see this is all basically on a single CPU, or at least on the CPU with multi-core
with, yeah, we're still single CPU.
Then we're working our way out of the single CPU, going to multiple CPUs, multiple sockets,
then out of the sockets to the storage.
While we're at storage, we're looking at the PCI Express bus.
So how is the storage connected to the CPU?
And there is a new standard for that,
which is Compute Express Link, where everybody's super excited,
but there's no good hardware yet.
So we have some prototypes, but nothing productive yet.
Then you'll learn how to program a buffer manager, then talk about
networking, and then finally we're completely going out of the single
server into the core CPU architecture into GPUs, more networking,
FPGAs, and we're going to try to have an industry speaker, so somebody who's
working on hardware in industry, data processing on hardware in industry, and then have a summary
and a data center tour. And as you can see, there's no exam, so everything will be determined
in the programming exercises. Okay, so with that, that's already kind of the grading in a nutshell. It's
very easy, at least from a high-level point of view. We have four graded programming tasks.
Each of them is 25% of the final grade, and you must pass all. So there's basic tests,
there's advanced tests, and then there's basic performance and there's advanced performance.
And you basically have to test all the basic tests which are known, right?
And the advanced tests are hidden, right?
So you know all the basic tests.
We have some advanced tests.
They're not super hard or something, not super unexpected.
But basically, we don't want you to hard code all of the tests.
So that's why we have some advanced tests that
are not public.
And if you pass all the basic tests and some advanced tests,
you can already pass the course.
Then if your performance is good enough so it's fast like we
have a baseline solution and we have an advanced solution if you pass the baseline solution and
some of the basic tests and you can pass the course if you pass the basic and the advanced
tests and some of the baseline performance then you will go yeah basically you you will get into
better and better ranks and we also have a leaderboard
and if you are leading i don't remember we have some marcel will go through the details regarding
this in the in the task description so there's also some bonus points for the fastest solutions
meaning uh depending on the number of participants essentially in the
end. We'll basically see if you are in the fastest solutions, so this is not required
for a perfect grade, but it's basically making it easier. So if you have like one of the
solutions that's super fast, you can get some extra points. Otherwise, you just need good solutions,
and you can also get perfect grade.
In order to make sure that you've done the stuff yourself,
we'll have an individual presentation,
meaning one person presenting to us each of the tasks.
So otherwise, around each person will present one of the tasks. So otherwise around each person will present one of the
tasks. So everybody will basically get a personal slot where we'll just discuss
the solution. We'll also check all of the solutions on plagiarism. So basically
just check with us not only syntax check, but also like basic program structure checker
in order to make sure that you're not just copied somebody else's solution.
And there's a couple of more details.
I think I have some on my slides, but you also get some more details on how to behave
during the programming.
So this makes it kind of easier towards the end because you don't have to do like study for a program.
You just have to continuously work on this.
There's a question.
When the tests are in, will we then push to GitHub or something?
Will we see the points that we get on the hidden tests or will we only see it in the end?
So we have everything automated,
so when you push a commit,
then the CI will execute the base test,
and when the base test is passed,
then the advanced test will be executed.
And if you pass this advanced test stage,
then you get the points for this stage of advanced tests.
And if some tests fail,
we also provide you with some additional
information about what might be wrong in your code. In the past, this worked out quite well.
But if you're struggling a lot, then also please reach out to us. Then we might reveal some more
information. But we think with the current information that you get from failing tests,
you can figure out what's wrong and how to improve the code.
Yeah.
So we have everything automated.
It's also using GitHub a lot.
And I mean, you're not supposed to just try an error
or to fail an error and just repeat, repeat, repeat until you
find the right result or figure out what might be our test
case.
But if you're struggling, again, feel free to reach out.
There was another question on the top.
Did you have the same question or a different question?
No question?
It was not?
OK, then we have a question here.
But this means you can, like, if you're not, I don't know, happy with our results, No question? It was not? Okay. Then we have a question here.
This means you can, like, if you're not happy with our
Results, we can reupload the notes.
Yes.
So you can, so the question is, can you upload multiple
Solutions? yes, you can upload many
Solutions, and also during this discussion, we will pick the one
That you want to discuss. So meaning if you, we will pick the one that you want to discuss.
So meaning if you, in the end, do
some nasty hacks for some performance, which you don't
really want to, I don't know, which makes the code somehow
cluttered or something, then we can also go back.
Or the last version was not the nicest version.
Then we can go back.
OK, so now there's questions up here, left and then right.
Will we present in front of the class or just in front of you, two, three, four people?
Yeah, the presentation that you do will be individual in my office.
So it's not in front of the class because we need to basically have multiple
people present the same task.
And that then means we'll do this individually.
So do we get the task like a week before and then we can work on it at home
and then the session is just for presenting?
Or what do we do in the session?
Or do you discuss the ways to solve the task?
So what do we do in the session?
So in the sessions we introduce the task.
We'll also introduce the solutions.
And we'll in the first session we'll introduce the overall
Setup, right? so how everything works.
And then in the individual session it's really just
Discuss your individual solution.
So then we'll ask questions about your solution.
Why did you program this this way?
How else could you have done this?
And we'll assign this randomly, meaning you will randomly be chosen for one of the tasks, not multiple.
But if we feel that you couldn't really answer
the task properly, so if we feel, okay,
this was somehow shaky, we're not really sure,
if you did this yourself,
we're gonna ask you for another task.
And either, so if we're starting, say,
with the SIMD binary tree, and then we notice this didn't work out, you'll get another appointment later on.
If you're assigned to the log-free skip list, and we figured out, okay, this didn't work out really well, then we're going to open up another of your tasks and ask questions on that one as well. And of course, that's basically the only way besides not handing in a functioning program.
That's the other way how you can fail the course.
That should be a low bar, but if you really cannot explain anything of your course,
of your program, then we'll have to fail you.
Is the presentation going to influence the grading?
No.
Okay.
It's just a check.
So there was another question here.
So sometimes there are,
especially if you want to make your code fast,
there are some true solutions
that you find on Stack Overflow or something,
and should we use some of those approaches
that were not taught in the class,
but we found them, I don't know, on Stack Overflow?
Should we comment on the code?
Yes.
So if you use something, some other help,
write comments, right?
So we can basically see this.
You shouldn't use GitHub co-pilot.
I think that was one of the things that we didn't want
because that kind of produces false positives
in our automatic checks.
What was, I mean, Marcel, you will do the details there.
So, about the tasks, what is okay and what is not okay.
I mean, you should program yourself.
You should really try hard.
Try the first.
We have a task zero that gives you an overview.
I think I have this on my next slide.
It's a concurrent linked list.
And this will help you set up the environment
and also see if the course is for you, right?
I mean, it's not super hard.
You don't have to be a C++ whiz to do the course.
But you will have to improve your C++ to a certain degree
unless you're already really good while at this course.
But this doesn't hurt, right?
So this is always, as I said, it's a very good skill to have.
It will help you in your life later on if you do this,
if you do this yourself.
Okay.
Further questions on the tasks? Not yet? You can also come back to that, right?
So the task one is also already completely prepared. So there, but it's not completely
up to date. We still have the binary tree, right? We said, or is it the SIMD scan?
It's the simd scan.
So then this is correct. So here we basically
Implement the simd scan. And you'll find how to use
In-memory column compression. So how can we compress data in
Memory, how can we do things or execute queries in a vectorized execution model and navigate
some SIMD code, which is not super easy.
I mean, it's also not super hard, but it's just a lot of different ways in which you
can program this.
And you'll write some SIMD code.
So this will start in early May.
So it's still some time, right?
So you can still relax, sit back and enjoy.
And once this has started, then the real fun actually starts.
There's a lot of literature.
You can check out this.
There's also different things every now and then
that I put in on the lectures, which might be interesting.
If you're curious about computer organization,
then I think one book that I really can recommend is the Structured Computer Organization and the Computer Architecture.
They basically cover the same things,
whatever approach and writing style you prefer.
But this is something very worthwhile to know.
So how a computer internally works, and this is discussed in detail in both of these.
We also have Hasso Plattner's book here, which is an interesting read.
We have lots of performance analysis, and every now and then we add more books in the
lecture, as I said.
And most of these books are available in our library.
So we have a shelf where we have a couple of books.
So if you want to grab a physical copy of the book, at least one we have for each of
these books, and you can for sure have them for a while.
And some of them are also available, or I guess most of them are available at the university
library.
OK, so I'm refining my slides until the day or hour
before the lecture, typically.
And then I load them up into Moodle.
Sometimes I forget.
Then they are in Moodle right after the lecture
or whenever I finish lunch, etc.
Well, this is not a new course anymore, but you can still be patient, right?
So if you find out something that didn't work out that well, then just let me know.
If you find errors in the slides, if you think we can correct something, send me an email.
I'm always happy to fix the slides.
I might not update this in Moodle, but I'm generally very happy to have an improved slide
deck for the next iteration.
If it's a major error, I'm also very happy to update the slides in Moodle.
And of course, everything is also available from previous years.
So there's recordings from previous years.
There's the slides from previous years.
I'm always trying to somehow refine it,
but the changes are not that major.
I already said this, right?
So Martin, he's not here this week.
Next week, from next week on, he will jump around here and tell you about computer architecture or
CPU architecture and database basics.
And Marcel and Florian will help you with the programming.
Code of conduct, very important.
You can always ask questions.
You can also always discuss with each other you can also
discuss the programming exercises with each other and you should right so it's very good that you
have a class here that we have a setting where we can actually meet and talk to each other you
can help each other you might even be able to look at each other's code, but please don't give
it like share the code completely and just reuse the code, because if you don't
submit your homework individually, we'll have to fail you for the course, right? So
as soon as you share your solutions and and somebody copies your solution,
or we find any other form of dishonesty,
then we'll fail you for the course.
And this is important.
Didn't happen in this course yet.
But I had other courses where I had people who basically just
reused other people's exercise.
It's not true.
We actually had people who just copied stuff.
So people pushed their code on GitHub,
somebody else found the code and reused it.
And then of course we've,
I mean, there's lots of discussions.
Oh, I didn't know and whatever.
So don't do that, right?
So just be smart and program yourself.
Then everything is good.
Also, we're working on the hardware here.
We're giving you access to sometimes expensive hardware.
So please don't break anything.
So it might be easy to break out of our environment.
This is not a security course.
It's a performance course.
So if you try to break our stuff, again, we might have to fail you.
If it doesn't happen, I mean, typically nothing should happen.
But just please do not try to escape the Docker environment and do not try to break the hardware.
This would not be good. And of course, in communication, so we have forums, we have emails, etc.
Try to be nice. I'm trying always to be nice, right?
So this is, for me, this is kind of a happy place.
I'm doing this because I like teaching, I like research,
and I want this to continue.
So this means I need happy and nice communication.
Doesn't always have to be happy, right? Sometimes things are problematic. It's still good to be
nice, right? And polite in emails. I'm trying to do that. Please also try to do that please also try to do that not only to me and to my colleagues but also
with each other it just makes uh social life at hbi and everywhere much easier and much nicer
and i think this is like a general rule i know many people think it's okay to just like write
very short emails or something but just being nice and polite always helps,
makes everybody more happy.
So this, and in general,
you should always treat everybody
with respect and consideration.
This is even more important in an online setup
because there it's often not easy to read
what somebody else thought or what the feeling was,
the intention in which something was written.
So in that sense, always try to be extra polite
in a personal setup, the misunderstanding
or the way we can misunderstand each other is not as big,
but it's still given, especially in an intercultural setup
that we're increasingly are so
still try to be respectful and considerate to everybody and this course
and HBI in general should be a safe space for everybody right so this is
basically it's not just a course for a certain crowd of people at HBI that
are interested in something.
But it should be open to everybody.
Everybody should be happy to come to this place
and come to this lecture.
And if there is something that you don't feel is good for you,
then feel free to let me know or let some of my colleagues
know or somehow get some information to us
so we can fix this
and change it so you feel good about the course.
Of course there's official course registration which we'll also get so we'll get a list of
everybody who signed up at a certain point but please also sign up in Moodle. So we are aware that you exist.
And we can give you one of the GitHub accounts, et cetera.
So you can, or we need your GitHub account.
But then we can give you access to our setup.
So you can do the programming tasks.
You find all slides, all resources, et cetera, on Moodle.
Please use the forum.
Most questions, especially for the setup and everything other people also have right um so if there's something that you don't feel sure about
don't be shy just write in the course i'm always happy if i see questions in the forum and i'm also
happily uh or i'm gonna be happy to tell tell Marcel or Florian to answer to you as quickly
as possible and every now and then I like if it's part of the lecture then I'm also
trying to answer quickly. Quickly means within a day or so so I'm trying to not do this at night
too much but we'll try to give you answers as soon as possible. Of course, for the programming, I know it always happens, right?
Most of the questions come all the way at the end,
but an easy way for you to make your life much easier
is even if you do the work towards the end,
just check out everything at the very beginning, right?
As soon as we open the task, just check if you can download it,
if you can basically get everything set up
to a certain degree.
Don't invest hours and hours if you
have some different kind of schedule.
Let's just invest half an hour or so
to check if you can access everything.
Because if you do this last minute,
probably it's going to be very stressful for you.
It's going to be stressful for us.
And so this is one of my major advices.
Even if you like to work close towards the deadlines, which
I also do, I still check everything
right from the beginning.
So I have all the material.
I can actually do the work to close towards the deadline
and not then in the end basically fail just because I
didn't have to set up.
OK, so one other thing that I want to always make you aware.
So if you've ever been in a course with me,
this is also something that's important.
So you as programmers,
engineers, and maybe data scientists, you have a lot of responsibility, right? So essentially,
you can program many systems, or let's say, very central systems that other people's lives depend on, at least to a certain degree, or that influence
other people's lives.
And in that essence, make sure that what you're doing is safe and is at least to some extent
morally good.
So most stuff is neutral.
Most systems that we build are somewhat neutral, but they might also not be safe.
So not everything, like most people are not malicious, they don't build
anything with a bad intent, but people build things and don't think about the
consequences. So whenever you build a system that deals with people, with
people's data etc., make sure that we were thoughtful and careful and responsible responsible
with the system and make people aware of the limitations right you can easily make mistakes
or other people can easily miss make mistakes misuse or manipulate your system and then
other people get hurt right so my my is, if you're building a kettle,
that's kind of safe to make some hot water.
If you build it at home yourself using this,
you might be okay making water like this.
But if somebody else tries this,
they probably electrocute themselves.
So make sure that your systems are safe
and that they're used in a safe way. And
in general, I also recommend to think about the consequences later in life and whatever you're
working on. Okay. Before we go into the motivation, a few contact people.
So I already said I'm one of the ombudspersons at HBI.
Another one is Holger Kahl.
And if there is any problem with good scientific conduct,
so meaning plagiarism, misuse of data, et cetera,
then feel free to reach out to us.
We'll be happy to help you.
Somebody misuses your data or misuses some other person's
text, or you're not sure, is this what I'm doing here?
Is this plagiarism or not?
Can I use ChatGPT to write my thesis?
Things like that you can ask us.
So not everything, I can give you a full-blown answer right away,
but at least I know where to look.
And then whenever we have equality issues,
there's Gleichstellungsbeauftragte at HBI.
So you can write them an email if you feel something, there's some problem with equality.
We also now have a diversity manager.
I forgot to put that on the slide.
And also very important, and I think very, very good that we have this.
There's a psychological counseling hotline.
So whenever you feel super stressed out, of course, if you're super stressed out about the course feel free to reach out to me right so we'll try to
help you make sure that this is not yeah not getting too much for you in general
I mean this is just a course right so if you don't do it nothing happens in the
end there's other courses that you can take or other repetitions so that's always like don't get too stressed out with the course we're also
trying to make the like not too stressful for you right it should not be way too much work but it's
called programming and the amount of work that you have to put into of course depends on how much
you program before if you really didn't program a lot before,
this will be harder than if you did this regularly
in every course that you do.
So anyway, if you feel stressed out,
this is one thing that you can do,
or have any other kind of issues where
psychological counseling might help.
And I recommend to do this if you don't believe
HBI can do something safe in that thing then there's also the nightline Potsdam so this is
also something that you can call if you need to talk to somebody so this is out of like students out of University of Potsdam
that provide this service.
Okay, so under this, we're gonna do a five minute break,
but there's a question first before we do the break.
Yes. can we step back from the course until the first assignment or until the last assignment?
I think we said the first assignment or after the first assignment.
We have the date on the website.
But Marcel?
17th of June.
17th of June.
17th of June.
Yeah.
So we'll stick to that.
17th of June, basically.
I mean, we can also, this is also something we can stick to that 17th of june basically i mean we we can also this is also something we
can negotiate to some degree if we i mean at a certain point we'll just say this is the date
and that's it um again if there's something uh where you feel like there's a certain circumstance
or something just reach out we'll always try to help you with that.
Other questions?
No other questions?
Then five minute break, four minute break,
and then we'll talk a bit about motivation.
Why do we do this?
So some people already know this, right?
So I like to do these short breaks in between.
Usually I have one small break, often a bit too late,
because I somehow need too much time in the beginning.
But somewhere in the middle of the lecture,
I'm trying to have a short break just for some regeneration,
let's say, for everybody.
So with this, let me give you a quick motivation
on why we do this.
We're going to speed through this a bit,
because you will also hear a lot of this again.
But this gives you yet another overview of the topics
that you will see in the course, and maybe, hopefully,
motivates you to stay in the course.
OK, so with this, it the like one of the major questions
why we need to do this this kind of course right so you've already heard
database management systems one or database systems one and two hopefully
or at least some of you if not you will get the recap next week and there's one
major thing to know about database management systems
or classic database management systems.
They were all built around one certain performance gap.
So essentially, for a long time, basically, database architecture was just built around this access time gap in between
main memory and hard disk.
So for a long time there was nothing but spinning disks and then on top main memory and you
had a 10 to the power of 5 times latency and performance difference in between the two.
Meaning that accessing an individual data item
in main memory is 100,000 times faster
than accessing it on disk.
And that again means if you have to,
like if your data is large enough
and it needs to go to disk, then everything is slow, right?
Everything basically, the CPU, everything just waits
for these disk accesses.
And that means the overall architecture
of a database system, of a classical database system,
is just designed to make this performant,
to make these individual accesses as worthwhile as possible.
So that's why we have a buffer pool.
That's why we have a row oriented layout.
That's why we have a tuple in time processing
in the upper layers, because it doesn't matter.
Everything that happens up here,
if this is so slow, right?
Everything that happens up here is negligible.
So we don't have to deal with any performance up here because this is where all the time goes.
And this is basically reflected in many database architectures.
And people looked at this and saw, well, what if we have cheap RAM?
So this is also what Hasso Platten at a certain point said, right?
So what if we put all of our data, the complete data set in memory, in HANA?
So SAP HANA is one of these designs.
Well, all of a sudden, the architecture needs to change
because before, everything was designed just for this disk
access.
But now all of the infrastructure,
all of the buffer management, et cetera,
basically is unnecessary.
Or it's basically additional overhead.
So if you look at, there was a paper a while ago that
basically analyzed the different kind of useful work and then
the different kind of overheads.
And they basically saw that less than 10% of the work that the database engine does
is really useful work.
And the rest is just management around this basically mostly the disk bottleneck.
So traditional database architectures cannot utilize
in-memory setups.
And by this, they also cannot utilize modern hardware.
And modern hardware meaning just basic CPUs
that we have today.
So a classic database engine will always just
work on the disk, do a lot of disk access, et cetera.
And everything up there in the CPU will be slow.
The CPU is basically underutilized.
And with terabytes of main memory,
most databases can actually fit into main memory.
So if you think about student databases, et cetera, well,
how many?
We have 1,000 students at HBI around about.
Each of them have a, well, if we just think about grades,
et cetera, it's not that much data.
So this will be in the megabyte range, et cetera,
the active amount of data.
This doesn't even have to be in the megabyte range, etc., the active amount of data. This doesn't even have to be in RAM, right?
So this basically fits into caches.
So that means we can be super fast when dealing with this kind of data.
But if every access always goes to disk and maybe has like a couple of transactions, all
of a sudden, even that might be slow.
Okay, so and today, we don't just have, like, much larger RAM.
We also have multi-core CPUs.
And this means we have high parallelism for task parallelism.
So, multiple threads can perform different tasks at the same time,
meaning we can have different kind of management.
We can have management tasks. We can have different parts of a query be run in parallel,
and we have data parallelism.
We can have the same instructions, the same type of work on multiple cores, but also within
a single core on multiple data items in vector units.
But in order to use this effectively, we actually have to reprogram everything.
We have to design our program differently.
In a classical database, we would have a single thread for a single query, even for a single
connection.
There might be many queries, it would be a single thread, and that will always underutilize
a modern CPU. So on a modern CPU, you might have 100 cores.
If you have just one task, then one core will be busy.
Everything else does nothing.
And you have much higher memory bandwidth as well.
So we can have not even 51 gigabytes per CPU is even low.
We can be in the hundreds of GB per second per CPU.
With DDR5 we have now with multiple channels.
And then we might not just have a single CPU, but we might have multiple CPU, multiple processors,
right? So each processor, multiple CPUs, multiple cores,
then two, four, or eight CPUs on a single motherboard,
which then also all have memory connected
and each processor can access all of the memory
across the different CPUs, but with different latencies.
So again, we have to think about where do we put data
and how do we access
the data in order to be efficient across. So this is also reflected in the processor trends.
A lot of this stuff you will see again. And I firmly believe in repetition to just remember
stuff. So please bear with me if you've got it the first time and you can remember everything
in the first time.
But I think every now and then it makes sense to look at this again, also from slightly different angles.
But as a first view on this, you can basically see how processors evolve right now.
So until the 2000s, mid-2000s, everything was single-threaded.
Essentially, we had a single core, single CPU, single core,
but we continuously increased the frequency.
And that basically means that makes everything easy.
So we have a single core that just gets faster and faster.
This is a logarithmic scale here on the side.
So this means you just wait a few years or two years
or something, and you get double the speed
or 10 times the speed.
You don't have to do anything with your program.
So you have your original program.
You wait a year.
Your new processor will be faster.
The same program will just run faster.
But this stopped in the mid-2000s, from when on we basically, like, the frequency already got so high
that it didn't make sense to make it much higher, just because of electricity and cooling issues. Higher frequency means more power,
means basically we have to cool more,
also means more power leakage,
meaning that we cannot get the chip as efficient.
So more density, like getting the chips smaller,
and at the same time heating up,
you basically have something like a small stove there.
So right now, this is basically like your cooking plate.
So this is what a CPU today is.
You have your cooking plate on maximum heat, and at the same time, you cool it down so that it keeps at 30 or 60 degrees, something like this.
That's kind of what you're doing with a processor today.
So it's not super, super efficient somehow.
And if we put more power into this,
the problem gets even more.
So we need to cool more.
Everything gets less efficient just for a little bit more
performance.
So with that, we basically, or CPU manufacturers have started to add more cores.
And this is how we can still get performance.
But that means we cannot just use our program as is.
We have to parallelize.
We have to basically split up our task into smaller tasks in order to spread it out across multiple cores.
And then even multiple CPUs.
Because we can see, basically, the single thread performance
also doesn't go up that much anymore.
It still goes up, and that's because
of architectural changes.
So we're basically changing the CPU in certain ways
so that a single thread still can be faster,
even though the frequency is not faster anymore.
But that also means, of course, we
need to use these new techniques
or the new parts on the CPU that make things faster,
which, again, means reprogramming things.
And we're still in the range, at least here,
where the number of transistors goes up, right?
But also this is a problem.
So we're reaching physical limits.
So eventually this will also not just
remember the GPUs and CPUs are just getting bigger and bigger
right now.
And at a certain point, also, there's a physical limit.
So right now, a modern GPU will be like this size or a CPU.
I mean, eventually, you're on a basically what
they're called, like tile size.
So basically, this is the maximum size
that a processor physically can have,
because that's the amount of silicium you can have, right?
So the signal blocks.
So once you're there, and this is already super error prone.
You will get like on this level, you have many errors just in manufacturing.
So this is physical limits.
At a certain point, you just have to do things differently.
And that means rethinking the programming, rethinking also the hardware.
And that needs to be co-designed.
And what we have then, like additional trends, is fast networks.
So in the past, network was the slowest part.
Disk was already slow, but network was much slower.
Now we have things like InfiniBand and remote direct memory access,
and then CXL to some degree, which is slightly different,
but goes into the same range.
And this is close to memory speed.
So now it makes sense to some degree or in certain setups
to write to a different node rather than writing to disk
because the memory or the network can be as fast as the memory,
or at least have the same bandwidth.
Of course, not the same latency,
but throughput-wise we can get the same bandwidth. Of course not the same latency, but throughput wise we can
get the same kind of speed. We have different kind of processors and this is the sort of
specialization that we'll also look into. So rather than just making the chips bigger
and bigger and more coarse, we can also redesign them. And the first thing that people, or
one of the first things that people started
is graphic processing units.
They figured out, oh, it makes sense
to do a specialized processor for graphics
because graphics basically is very repetitive computation,
meaning you do lots of number crunching
on many pixels, essentially.
So we do the same thing on many cores.
We just split this up and all of a sudden
we can use this parallelism
with very simple programs in essence.
So a CPU can do a wide range of programs
and can easily switch contexts
and do basically different tasks in parallel.
A GPU is good if you do the same task in parallel across many cores and this is basically what modern gpus also we can
also use them for data processing or just number crunching not besides graphics processing so now
they're basically general purpose gpus and you you can see there's a neat comparison.
So this is basically if you look at a CPU with, say, four cores,
then you have a large cache, you have control infrastructure,
and you have your individual arithmetical logical units.
And on the GPU, you will have many of this
and very little kind of control infrastructure, less cache.
So it's really just you do everything in parallel.
All of these will basically execute.
I mean, on a modern GPU, you can somehow split it up.
But for simplicity reasons, you can think all of these
will do the same kind of computation,
maybe on different data items, but completely
in parallel in lockstep.
And with this, you get much higher throughput.
But you're not as flexible.
You cannot switch contexts as easily.
So that will basically take a long time.
But we can use this.
We can utilize this for data processing as well.
If we want to be more specialized, more tightly
catered to the exact program that we want,
we can use reprogrammable hardware.
So FPGAs, or field programmable gate arrays,
are specific kind of chips or processors
that are constructed of logical units or logical blocks that you can
reprogram where you can then basically program logical gates so and or gates etc in the simplest
form and directly basically wire up from a programming point of view, wire up your exact program.
And this is more like electrical engineering, right?
So you basically create a path through the CPU
or this processor that represents your program.
And that then makes it, of course, very fast.
It's a very different type of programming.
Of course, there's some abstractions that make it easier.
But it's much faster, but it has some certain restrictions.
So the programming is slower, the reprogramming is slower,
and the layout, creating the layout for the chip is super slow.
So if you have a very large program that you want to place on the chip,
it might take days to compile the program, just to do basically the circuitry,
how you fit it on the FPGA.
But then it's highly efficient, it's highly parallel,
but somewhat hard to program.
Okay, and this is kind of an overview.
So the idea is, you see, there's many different hardwares.
There's trend in hardware.
And if we want to be efficient in data processing,
then we'll have to look at these trends.
We'll have to really look at the hardware,
and this is what we're going to do in this lecture.
So really see how does the hardware work.
If the hardware work.
If the hardware works like this, how
do we use it for data processing?
OK, so as a summary for today, I gave you
the introduction to the course, or let's say to the group,
and what else we do.
And here as a quick reminder, it's also
to these competitions.
So if you feel you don't have enough to do,
look at the SIGMET programming contest.
Then we talked about the course organization.
The slides are online in Moodle.
Also, we have lots of descriptions
about the course projects,
etc. So read through that. If there's still questions after that, feel free to send us a
question by email. Ideally, if it's about course organization, etc., use the forum because then
other people will for sure have the same question
and we can answer it for everybody. Then we talked about exercises, et cetera, and then
I gave you a bit of motivation why it actually makes sense to think about hardware.
And the reason is this, while hardware is changing and if we want to be efficient
in data processing then we also need to change our programming of course this is also true for
everything else right so the the hardware is changing not only for data processing hardware
is changing for any processing and if you want to be efficient in any processing, you have to think about the hardware that you will be using.
So that makes sense in any case.
But in this course, besides looking at the hardware, we'll specifically look at the data processing parts.
Okay, so thank you very much for your attention.
Tomorrow, I'll talk about performance analysis.
And do we have questions so far?
No questions?
Very good.
Well, there's a question.
How are the dates for the individual sessions
determined?
Are these assigned generally as well?
So the question was, how do we assign the individual feet or the individual interviews for the
programming exercises.
And we select randomly.
So basically each of you, if you participate in the course, will get one of the programming
exercises. And then we'll find a block, maybe Wednesday
after the session or Tuesday after the session,
so right after the course, and ask you if this fits.
If it doesn't fit, we'll find another slot,
so something like this.
So closely after the programming task is done, we'll start with these interviews, let's say.
And some important aspects regarding the Moodle.
So please regularly also check out the announcement section, which we will regularly post news.
And there is also the task zero that we mentioned before.
It's already online, so feel free to check it out and to try to program it because based on
that you can also better assess if this course is for you or not if you have
some a lot of troubles solving this task zero then this task this course might be
quite hard for you and there's also a section for active programming
participation you can always of course also join the course but if you want to get credits you
should do the tasks and we also need to prepare some docker setup for that so therefore it would
be good to know as soon as possible who wants to do the tasks. So therefore, if you already have decided,
just click the According button in this section
and let us know that you will participate
in the task implementation.
But of course, we'll have another announcements
regarding this.
Yes?
So how hard would you say is task zero
compared to the tasks 1 to 4?
No, I mean, the tasks are not that much harder.
They're just a bit more, I would say.
Right?
So it's, I mean, if you can handle task 0,
most likely you can handle task 1, et cetera. So in the past, we were kind of, well, said, OK,
you really need C++ programming, or we really
need to already know it.
And then a couple of the feedback that we got was,
well, it wasn't that hard after all, at least
from some people in EVAP.
So now I'm a bit more careful about this.
So just try it.
If you can deal with task 0, you most likely
will also be able to deal with the other tasks.
And if you need help, just reach out.
And that's, of course, for everybody.
OK. So did you get the announcements for Zoom, whoever was already? for everybody. Okay.
So, did you get the announcements for Zoom?
Whoever was already, okay, perfect.
Then this works every now and then.
At some point I didn't correctly
configure Moodle and people didn't
get the announcements, but if that worked
you will get all future announcements
and that means you will be informed
about all important things
around the course, hopefully in time. Okay.
So with that, i hope to see you tomorrow. Thank you very much.