Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Introduction

Starting point is 00:00:00 So, welcome everybody to hardware conscious data processing. Nice that you made it. So this is the first week, and we have this room today, because of conflicting other events, as I said, but in the future will be in the other building so tomorrow is also going to be here. And the idea is that we'll learn about computer architecture and how we use this architecture most efficiently in database systems. I teach this class in English, but if you have questions you don't feel confident asking in English, you can also ask them in German. Like everything, we can basically also interact in German, but just the lecture I'm going to do all in English. But feel free to also practice your English while we're here. So this week is mostly introduction, especially today.

Starting point is 00:01:01 It's really motivation and introduction of the topic. And then we'll dig in deeper starting from next week okay so as i said i'm tillman rave i'll walk you through some things and you will record all of the sessions and have them in Teletask. And so, yeah, so for you to revisit this or if you cannot make it in one of the weeks, then it's also there. We also have last year's course there. But if you're, I mean, if we ever should have a hybrid set up something like this, then still feel free to end your video sharing because we're not capturing this. And because I'm recording the lecture, there's no need for you to record the lecture, take any pictures, something like that. All of the slides, everything will be in Google.

Starting point is 00:01:57 So everything's there. There's no need for you to take pictures or something like that. And this is still a relatively new course. So we prepared this last year in the summer. And so, well, there might still, we will still change a lot. So every, like a lot of things worked out quite nicely. Some things I was not 100% happy. So those we will change. Then also, of course, it's about current hardware and hardware trends. So this is something that's continuously evolving. So also we have to evolve the lecture, include new parts. And that means there's lots of new material, especially designed for you, right, to enjoy. And that means there might be errors here and there if you find something like this let us know ideally send me an email so i can fix it and i'm very happy to do that also for the

Starting point is 00:02:52 slides etc um this also means sometimes you need some patience every now and then i i'm not ready with everything well before the lecture meaning the usually the slides go online just before the lecture. So today's slides, for example, are already in Moodle. Tomorrow's slides are not there. And I will just update them as I go, basically. And always, any kind of actionable critique is highly appreciated. Actionable means you have something

Starting point is 00:03:27 you think we could do better, and you tell us how or what exactly we can do better. If you say this course stinks, well, I cannot do much about this. This is too vague for me to actually act on this. But if you have something specific that you would want to improve, or you think you have a good idea, let us know. I really highly appreciate this. And we

Starting point is 00:03:51 try to incorporate all these things all the time. Of course, there's a lot of diverging interests, how we do courses in general. And there we try to do the best. It's really, I mean, lectures are designed for you. And at least that's what I try to do. best it's really i mean lectures are designed for you uh and at least that's what i try to do so you can really enjoy the get the most out of it um and so i try to make it not everything is always super uh let's say easy or the the most let's say um easy way the most pleasurable way to do things, but it's designed to be the most effective way for you to learn something. Just so you know.

Starting point is 00:04:32 So that's my intent. But anyway, any kind of criticism, feel free to shout it out. I'm always happy about this. Today, it's about our group, about our curriculum. So what else do we do? So you get to know us. Some of you I know already know us. So you can just sleep a few minutes.

Starting point is 00:04:53 And then now it's new content. Then you can start waking up again. Then we'll talk about the course organization. So how does this work? This course exactly? What is the idea? What is the grading, etc. Most of this, or actually all of this grading, etc. is also in Google. If you find something that you're not sure about, we'll try to document this in Google as well. So all the content, timeline, textbooks, exercises, grading, registration,

Starting point is 00:05:27 and projects, etc. And then I'll talk about the motivation towards the end. So why do we do this course? And what is my learning goal for you? So that will be towards the end. So who am I? I'm Tilman Rabl, I already said this. I'm a professor here since 2019. Before I was at TU Berlin, before that I was in Toronto, and before that I was at the small University of Passau. And I always did databases and I kind of like complex systems. That's also why I really like this kind of course. I'm curious about hardware. I'm curious about benchmarking. And that's kind of, I've done this all through my career,

Starting point is 00:06:15 basically this kind of work. And that's why I think it's, because I like it myself, I want to teach it and I want to show it to you to see how this works. And then I do some other stuff as well. I'll point to this in a bit also, or you get some details where it makes sense. Then this is my group currently. So if you see one of these people walking around at HBI, you can basically grab them. They should be able to tell you what we're up to, should be able

Starting point is 00:06:45 to tell you about this course and ask or answer any kind of questions that you might have. And during this course, we'll mainly depend on these people, these two people. I have them on a separate slide as well to help you with the projects, etc. And otherwise, if you're curious, if you want to do something in research, always also reach out, right? So this is at least for project, etc., there's always space. My research topics, I briefly touched on this already, or the groups research topics I want to say is, well, not diverse. It's really database systems. So the idea is anything that has to do with how do we process data in an efficient way, that's kind of what I'm curious about and what kind of how do we build systems. So this is all database systems. We always try to publish a lot on good conferences.

Starting point is 00:07:45 So if you're curious about our research, feel free to check it out. We do database systems on modern hardware or just database systems application, not application, database systems, really. And this is really where this course is, what this course is all about. So this kind of aspect. We also do stream processing. So how, if you have fast data coming in, how do we process this? On the fly, not really in a database, but just as it continuously is going.

Starting point is 00:08:20 We do machine learning systems, not machine learning. So how do we build a system that can do machine learning systems, not machine learning. So how do we build a system that can do machine learning, train models, use models, et cetera, efficiently, not how do we build the best possible model for whatever. So not applications, but systems. And then also in all these aspects, and this is also important for this course, we do a lot of benchmarking and evaluation, meaning we're trying to figure out how fast are these systems actually, and where does time go if we use them. So if you have a certain system,

Starting point is 00:08:59 if you have a database, what is the performance this database can give you in sort of a realistic setup, and why does it have this performance? So where does this time go? This is really what this course is also a lot about. And I mean, I'll repeat this, if you're curious about this kind of stuff, feel free to reach out. And my approach for research is always kind of a multi-level approach. If I try to go to something new, like find something new, like attack a new angle, work with a new application, something like that. So I'll start with some kind of application scenario. So this could be actually a concrete application. I don't know, say something like a search engine or a medical application.

Starting point is 00:09:47 Could be more general like, oh, database system in general. And then what we do is we develop benchmarks. We try to figure out how fast our system is actually. So what are applications? And so what's a realistic setup for such a system? Then we actually benchmark the systems. We try to figure out how fast are they actually. And once we notice, we can basically see

Starting point is 00:10:16 how can we extend current system? How can we improve current systems? And yeah, as I said, where does time go? We can exploit new hardware. So this again, is what this course is about in many cases. And if this doesn't help, if this doesn't help, we can actually build new systems. So this is a quite nice way of doing research because this always has purpose, right? We always have a clear idea what we're talking about. It's not so much about having a, I don't know, miraculous idea and then later on after a year figuring out it didn't work out as well as we thought. So it's always very clearly determined. I always

Starting point is 00:10:57 have kind of, not really straight, but a clear idea where I want to go next. And what else do we do? So this is kind of an overview of everything if you want to do lectures with us. So every now and then, I'm involved in the database systems. This is Bachelor's, so it's not really interesting for you unless you've never heard a database course. Then one of these two might be interesting for you.

Starting point is 00:11:29 In the winter term, every second year, we have a lecture series. So this winter, again, we'll do the lecture series where we try to get a lot of quite interesting people to talk about their research. So it's been industry and academia and last iteration. And I actually like that quite a lot. We've been basically walking around through Germany, finding people who do nice database research and inviting them. And so this is probably going to happen again in the winter. So you can basically get an overview of the database community in

Starting point is 00:12:05 Germany, which is quite good. I mean the German database community is quite strong. We have a lot of, we have a good track record of research here. Then we have this course in the summer term, new since last year. We have big data systems every winter term. I know some of you already did this. We'll extend this winter semester, most likely, unless something weird happens. We'll have an add-on to this one to have more applications as well. And then we usually do some kind of project seminars on either machine learning systems topics or on modern hardware topics. And sometimes might also happen to have like a seminar in stream processing, something like this. So that's the current curriculum. And right now we have hardware conscious data processing and we have a seminar on machine learning systems.

Starting point is 00:13:09 So to give you a bit more in detail. So this is the lecture you're in. I'm not going to say anything about this, but we also have the seminar that's going to start today. So if you're curious about machine learning inference in database systems, so how can we do efficient machine learning, like use models in a database system efficiently, that's going to be a seminar that we're going to do start today. It's a bit late, but it should be fun. And other than that, of course, we do bachelor projects we do. master projects there's no massive project this semester, but there might be one next semester there's a question. Just a short question regarding the time slot is this correct because it's not one of the usual HBI time slots.

Starting point is 00:14:00 It's maybe i've messed it up, it should be on the website. So, Lawrence, can you check? While I'm continuing elaborating on this part with Lawrence, so we have, we also do projects, meaning if you're curious, just doing research about something not necessary for credit points. I know few of you are interested in this kind of stuff, but every now and then somebody is interested. Feel free to reach out to us. Some stuff that's actually fun and useful are programming challenges. So there, for example, the SIGMO conference, which is the largest database conference. There is a student research competition. So if you do some research in databases,

Starting point is 00:14:49 this is something you can attend to and I'll be happy to mentor you. And if you get accepted, like we do collaborate on this, you get accepted, we'll pay the travel. It's an hour earlier. It's 3.15 to 4.45.

Starting point is 00:15:04 Okay, so then I missed this. It's 3.15 to 4.45. Okay, so then I missed this. Yeah, it's all the wrong on our webpage. So maybe... I'll ping Ricardo. So this is Ricardo's work. So Ricardo will fix it and will hopefully send out a message on Moodle and put it on the website. Okay, so research competitions, there's also programming challenges at some of the conferences. So this is stuff that's fun, and it's a cool way of connecting with the community, learning about database systems, about programming, etc.

Starting point is 00:15:40 And if you pass the course, you will have a good way or it's not going to be easy, but still you know what to do in order to participate in something like this. So the kind of programming that we do will give you good tools to work in this kind of area. Okay, so if you want to do this kind of research, I have a seven-step recipe for you. That's also my recipe for my PhD students. So if you want to do a research project, if you want to do a master project or a master thesis, at least from my experience, this is a smart way to go about it. So first of all, of course, you've got to know the literature, right? From my experience, this is a smart way to go about it. So first of all, of course, you've got to know the literature, right?

Starting point is 00:16:34 So if you start from scratch, if you have limited idea, if you're not an expert already in the area, you really need to do a thorough research on the literature. So what have other people done? Because in any case, like anything you can think of people already have done research still in any way any angle that you can think of there's still more research to be done so we're never going to be able to finish research in that sense there's always more to explore there's always more to find out be it systems be it hardware etc so there's always stuff to do so that's um but that's the first step. You have to figure out what's already there

Starting point is 00:17:09 in order not to reinvent the wheel all the time because that's also happening a lot. Then we want to find out about a problem. So rather than having a neat idea what we can do, for me, I think it's much better to start with a problem so basically i'm trying to find a challenge that i can solve because then if i want to do a project basically then i have no motivation problem if uh so if i start with an idea then basically somebody might say yeah this is actually a nice idea, but what is it good for?

Starting point is 00:17:47 And then you do a lot of programming, you do a lot of testing, you find out, well, it's a nice idea, but it doesn't really change much. It's more complicated than something else. The performance difference is marginal. And, well, your research is nice, but it doesn't really help anybody. If you start with a problem, then it's something you can solve. You have a clear agenda on what you want to solve. And the problem, even it might just be framing, right? The problem just might be the question, what is better, this or that, right?

Starting point is 00:18:21 So this is a clear thing, a clear question and a clear problem that you, a question you can answer, a problem that you can solve. While saying I have this idea, this might be good, might actually be a wrong hypothesis from the get-go. So starting with identifying a research problem is actually a neat way of going about this. And then you can describe your novel solution to this problem. It might just be comparing it to existing solutions. And you perform so-called back-of-the-envelope calculations. So we'll talk about this in the second to next session on how do we figure out quickly if my idea, if it's about performance, if my idea can actually work or not, even if my idea is viable in a realistic setup.

Starting point is 00:19:16 And that's back of the emulator pre-migration, just like quickly figuring out, does this make sense or not? Then we do actual experiments, we figure out if it works or not, and then we write up the report. And if we do this in the publication, well then there's a lot of stuff even after this. So then there's a lot of revision cycles, etc. A lot of work you have to present, the work, etc. That's additional work, but again it's kind of straightforward and if you do this kind of approach you're never in a in a dead end anyway well if you're starting like if you're not doing proper literature research eventually you will figure out well somebody has already done this then you have to shift focus a little bit it's going to be more tricky again. OK, so questions so far, not so much in this.

Starting point is 00:20:08 But then coming to the course logistics, so those of you who already heard this, the previous part now will wake up. And we're talking about this course. So this course is today and tomorrow here. Then it will be in the lecture hall in building L. So this is a quite nice lecture hall. If we're going to just be five people in the end

Starting point is 00:20:33 or something like that, like with this group, it's fine. If we're going to be very small, we might just relocate it. But usually it's going to be there. Every now and then something happens. I have to do a presentation. I have to attend something else, whatever. There's a conference. Then we might do it online or I might use a video from previous year.

Starting point is 00:20:59 Shouldn't be the case, but can happen. Check Moodle for any kind of details, but I also send out announcements. So just like I did today, for example. And there's questions. The labs are on Tuesdays. Sorry? The labs are on Tuesdays if we do them.

Starting point is 00:21:18 We changed that last week. OK. Well, labs, true. So the lecture is Tuesdays and Wednesdays. The labs, as we changed it last week exactly, will be on Tuesdays mostly. You'll see the detailed agenda if I am, hopefully I've updated here, but I think I've updated. If you have questions, send me an email. So that's the easiest way I usually answer emails.

Starting point is 00:21:52 So if you have a question to me, if you have a question about the course, put it in Moodle. So, because then everybody else also benefits from that. And if you have a question, most of the questions other people will also have. So, and we're basically checking Moodle every now and not every now and then. You're getting the Moodle information, like you're getting emails. So we're trying to update this, not necessarily at the same in the same hour, but usually within a day or so. I'm trying to not do this at night, but every now and then it doesn't

Starting point is 00:22:26 doesn't help, so then I have to do it at night. You can also come to my office if you have a question. If it's something that takes half an hour, we need to make an appointment, and that means send me an email or ask me after the lecture to make an appointment. And then we can discuss. If it's something that takes five minutes, you can always come to my office. If my door is open, you can just basically not briefly say, hi, I have this question and come in. That's as long as I can. And I have this as long as I can here since I'm here.

Starting point is 00:23:01 And so far, it's not been a problem. So if you have a question, feel free to drop by asking a question. Also, Lawrence's office and Marcel's office are in front of mine. Basically, if you walk in, you will pass them first. So some of the questions they can already handle before you come to me.

Starting point is 00:23:21 Exactly. So then an overview of the course contents. So this is still rough. You will see more details in a bit. So we'll start with the introduction part. So this is the first three sessions. We also have kind of an intro-y part somewhere in the middle here that Loris will give you. So first we'll start with the introduction today, then give a quick recap on database terminology basically. So just so you know if I talk about query optimizer you know what I talk about or buffer manager stuff like that. Then we do this performance analysis where you also hear about back of the envelope calculations.

Starting point is 00:24:07 We have an additional topic new this year, which is profiling, which kind of is also intro-y, but it will be a bit later. So in two or three weeks from now, just to give you a bit more details about how do you actually work with the code, etc. But I would also say this is kind of the intro part. Then we'll start with the CPU. And we'll start with kind of the CPU package and a single core actually. So just talking about how does the CPU work, how does the CPU, how do the caches work, the individual parts, how are they structured, also on different kind of hardware, not complete, because I mean if I do just this exhaustively, we will just talk about this in the course. So it's just about to give you enough details to explore further, right? So you basically know that there are caches, they behave this and this way. If

Starting point is 00:25:05 you're going to a new architecture, you will be able to find out how this architecture will work. So that's the kind of idea. But we'll talk first about the single core and the caches, how this works, how the instruction execution works. Then we'll start exploring multiple cores, how does the single CPU kind of work, how is the interaction with DRAM, then what about multiple CPUs, so what are the effects as soon as we have multiple sockets in our system. Then we'll also talk about persistent memory. This is a technology that was super hot for say two or three years. We did a lot of research on that recently. It's been discontinued by Intel, which is a shame,

Starting point is 00:25:49 but there will be new persistent memory eventually. So this is something that's right now has kind of a legacy status, let's say. But it's still interesting, so that's why I didn't kick it out yet. Then we'll start going out of the CPU package. So the DRAM is still connected directly to the CPU. All of a sudden if you're talking about storage etc that will go through PCI Express today. Basically everything except for very few

Starting point is 00:26:21 very special architectures you will always have PCI Express and then we'll connect to this. So we'll talk about storage. We'll talk about network, different kinds of networks, different kind of interfaces, let's say, and how to use them. Then we'll also talk about GPUs. How do we use GPUs for data processing, single GPUs, multiple GPUs?

Starting point is 00:26:48 And finally, not finally, second to finally, we'll talk about FPGAs. So, reconfigurable hardware. So how does this work? And then really finally, this is all new, we'll talk about CXL. So, compute express link is a new interconnecting, it's actually a protocol, a new type of hardware that supports this protocol that sits on top of PCI Express and will make a couple of new kind of systems possible just by having cache coherency,

Starting point is 00:27:22 different kinds of accesses to the hardware through PCI Express. This is still super hot, super novel. That's why we keep it towards the end, because we don't know how it will actually evolve. It might just be the new thing that basically replaces some of the parts here in the middle, it might just die off. And then I'm not going to be super dependent on it in the lecture. We'll see about it. Okay, so what do I think you should learn from this? So what are my learning goals? I mean the first and foremost thing is efficient data processing, right? So this is what my research is about.

Starting point is 00:28:08 So this is also, of course, what I want to excite you about, to basically learn about efficient data processing on current CPU architectures. So whatever is out there right now, what's the next generation, how do we utilize this? Because current database systems that you're using, say MySQL, Postgres, whatever you can think of, even industry architectures, will not be able to get the full kick out of modern hardware or out of current CPUs. So we'll look into this, how do we actually get there,

Starting point is 00:28:47 what, how can we use this efficiently. We'll also look at current memory architectures. We'll look on how can we do this in parallel, so how does parallelization work in the system in a distributed fashion, meaning multiple processors, mostly not necessary multiple servers, but as soon as we're talking about networking, also multiple servers. And then also accelerators. So what if we have a heterogeneous setup? So currently people are thinking the new kind of trend is going to heterogeneity. So building new types of processors,

Starting point is 00:29:27 new types of accelerators for all kinds of things. And that's where you basically will figure out if we do data processing, how can we use those? Where does it make sense? And the idea is to also give you the opportunity to experience this yourself. So in the programming tasks, it will be mostly about this top three parts. So meaning we're doing programming.

Starting point is 00:29:49 We're looking at the hardware in detail. We're not so much going on the accelerators. This is mainly because it's super stressful to set this up. And we don't have enough hardware to make this easily available to you. So that's the one reason why this part is kind of not super hard on focus on the programming tasks. Anyway, but if you're interested, we have all this stuff here. So everything that we're talking about is available.

Starting point is 00:30:19 Meaning if you're curious, you want to do something there, ReachOut will make it happen. So that will be in the, let's say, hands-on part. However, let's say in parallel or orthogonal to that, this course is also a lot about computer architecture. So if you want to learn how a computer is structured, we'll talk about a lot of this. Of course, not holistically. This is not a pure computer architecture course.

Starting point is 00:30:51 But you get an understanding of how a CPU works and how instruction execution works and how accelerators works, et cetera. And then you will learn, as I already said, about data processing. You will learn, as I already said, about data processing. You will learn about databases to some degree. It's not a holistic database course. So we're not talking about the complete database architecture because there's a special course for this, database systems too.

Starting point is 00:31:18 But we're talking about, if you already know this, kind of how to move this there. Some of the parts, how can we actually make them super fast. Some other parts we're not covering, so this is not exhaustive. And you'll also learn some, and this is more with the hands on, so you're for yourself, you learn efficient C++ programming. And last year we kind of, when you say scared away a couple of people by emphasizing the C++ programming. In the end, we got the feedback you shouldn't have scared us so much because it wasn't actually that hard. But it's going to be C++ programming. So you have to do some C++ programming to pass the course. But it's not going to be you have to be already

Starting point is 00:32:06 godlike coders to do this. So it's not super hard, but it's C++. But doing this, the projects are reasonably sized. So you'll learn a lot about programming. And this is, I mean, in general, for me, this is always important. Any course that I teach is always important any course that i teach at least any lecture that i teach you will have to program because i think this is a skill

Starting point is 00:32:31 that will bring you very far right so whenever you want to have a job later on or wherever you want to have a job later on if you can program well you're going to get or your opportunities will be much bigger. You have much more different opportunities than if you cannot program. So that's why I think this is super useful. And it's not only about C++ programming. Of course, it's always about programming in general, but also about looking a bit deeper into the hardware. How can we actually do something efficiently with hardware? Okay, so with this quick rundown on the agenda, so I'll repeat this basically almost every week, you will see this, just to give you kind of an update where we are and what's going to happen. And every now and then, so this is subject to change.

Starting point is 00:33:30 If I'm sick, which rarely happens but might happen, then something might be moved if something else accidentally happens. If I'm way too slow one week, so I'm getting all caught up about talking about something or we're going to start discussing, then things might shift a bit here and there. And we have some buffer, so no worries about this. So we're in the introduction. We're going to do basics and then instruction execution, then already parallelizing with SIMD. And then we start with the first task, which already we'll touch on SIMD. And you can see it's kind of alternating. Then every now and then we'll have a task and later on we'll start having Q&A sessions, meaning once I've kind of covered the basics, so the CPU package and the caching, etc., in more

Starting point is 00:34:35 detail, we have a bit more time for you to also discuss about the larger programming tasks. So this will be the first time hopefully in June. We'll have a Q&A session. If I'm slow, we'll extend some of the lecture into here. But then there will still be some Q&A. Of course, if you have questions, there's always time for questions somewhere, or we make it happen. Then in the second part, basically after this first part, we've covered all of the CPU parts. So we basically have the complete CPU package,

Starting point is 00:35:14 everything that's on the main board, essentially. Then we're going out to the peripherals, which is basically all everything that's kind of plugged into the main board besides the RAM and the CPUs, which is, well, the system memory to some degree, then the storage, so the disks, the network, the GPUs, the FPGAs. We'll have an invited talk. I've not decided yet who I want to invite, but we'll find somebody who does research, probably industry research on this kind of data processing. So using modern hardware in their system and we'll have them probably talk here, but this is always tricky. So this might move somewhere. So usually as soon

Starting point is 00:36:06 as I invite them, I will announce who I'm asking and then they will basically not have time here. They will have time somewhere else. So in one of the slots, this basically messes up all my calendar, but it's going to be fine because we have a couple of Q&A sessions and we're always flexible. And then towards the end, we have kind of a summary session. And if you're curious, we can then go, some of you I know already visited, but for those of you who've never been, we can go to the HBI data center and check out the hardware there. So we'll basically see, you can touch the servers, make the admins go crazy and see basically,

Starting point is 00:36:49 I mean, you don't see that much, but you can already get kind of an idea about how large are these servers, the networking, et cetera, the cabling. And I mean, for me, most interesting about the size of the data center itself. So, I mean, the server room itself is basically this size, and the rest is just infrastructure

Starting point is 00:37:10 for the building, et cetera. And we'll also walk into the cooling area so you can see what that looks like. So I think it's quite interesting, but I've already seen it a couple of times. So I'll do this if you're curious about it and want to see this. Okay, so grading in a nutshell. Very simple. We have four graded programming tasks. This

Starting point is 00:37:47 is your grade basically. Each of those are 24, this doesn't make sense, 25% of the final grade. No, not true. 20% of the final grade. So that's, sorry, we have four graded programming tests. Each are 20%. You must pass all of them. And for one of them, we'll basically ask you to give an individual presentation just to us. So it's more like an interview, just to make sure that you actually coded this yourself. So that's all that we want to see. Because, of course, we know there's GitHub Copilot. We know that ChatGPT can do some programming now, et cetera. So this means, and of course, you might have friends who already did the course, or I don't know, or you find some online repository that does something similar.

Starting point is 00:38:30 So that's why we ask you to explain your whole essentially, not more, not less. And then we also have a leaderboard and some bonus points for the fastest solutions. This is just on top. It's just to gamify this a bit. Yes? Is this like the fastest solution in the sense that the first person that submits it? No. Or the first person that submits it? Yeah, the best performance. Yes. I mean, that would actually also be nice for some additional bonus points, but that would be stressful. So I don't like things to be stressful.

Starting point is 00:39:08 So there's enough time for doing the programming. At the same time, the tasks are difficult enough that if you try to do them the last day before the submission, you're going to fail. Very simple. So it makes sense to start early. But we try to somehow set this up for you to be able to do so as well. So most of it is already ready. We've done most of the, not most of the, half of the exercises last year already. So it's clear how it goes. But again, you will have to explain your code.

Starting point is 00:39:46 So if you cheat, you fail. Very simple. There's no exam. This also means for the programming, of course, not all of the lecture is relevant. Still, I really do this lecture because I think this stuff is really interesting. And it helps you a lot later on. If you do some kind of programming, if you work with systems in one way or the other,

Starting point is 00:40:10 meaning if you're just in a position where you're using systems, knowing how they work and knowing how they perform on hardware gives you so much more capabilities in using them efficiently that I think all of this makes a lot of sense to know about. And so that's why I think attending the lecture also makes sense. At the same time, I'll be honest, the programming covers part of the lecture,

Starting point is 00:40:40 and that's gonna be your grade. The tasks, we have four tasks. Some of them are fixed, some of them are still in flux and the first one I didn't update yet. So initially we had something else, we had a SIMD scan, then I thought we're going to do a binary tree. We're not going to do a binary tree now, we're going to do something else. I have updated this on the next slide. The second one that we've already done, that's an adaptive radix tree. It's a data structure that a person from TU Munich invented and published and that's actually quite nice data structure. So it's the so-called ART. It's actually a try data structure that's used in some real database systems nowadays.

Starting point is 00:41:31 And it kind of gives you a nice view on how these data structures are built and learning them in depth. Then we'll have some kind of buffer manager with some kind of locking. We're still figuring this part out and we have a lock-free hash table. That's even more things to figure out yet. So the first, but still on top of that we have a task zero and that's basically just for you to try

Starting point is 00:42:01 out. So we have a concurrent linked list that's already on the website. You can basically download it, the framework, et cetera. You can see, you can set up your environment. And that's actually something that I would say, even if you're not, don't intend to solve this in full, just setting up your environment with this one makes a lot of sense because then you have everything that you will need later on in the course and if you have a very hard time solving this there's no idea how to solve this the next part will even be harder than that one of course i mean still this is it's a linked list so you should have seen linked lists before in your life so this is really um well you

Starting point is 00:42:47 see if this course is for you you set up a c++ environment still if you're struggling no problem right talk to us we'll try to help you through with this you'll learn as you you'll learn running while you're walking or walking while you're crawling, whatever your current stage is, right? But that's fine, right? So this is, I mean, we're not kind of, we actually want to include everybody here. So if you have fun in this, we'll try to help you with this. This is already online.

Starting point is 00:43:20 Still, if you have no idea, it's going to be hard. Let's say if you don't manage to set this up at all, it's going to be hard. It's not going to be possible. It's very simple. Okay, then the first task, the first real task is SIMD compression. So we're talking about specialized parts of the CPU, instruction sets and execution units in the CPU that we're using to speed up processing. And this is done in databases and all kinds of systems. All modern processors have this, these kind of SIMD units, and we'll use it for compression, for Delta compression in this case, but the task is not 100% finished. So we're still working on this,

Starting point is 00:44:11 but it's basically in memory compression. You learn about this kind of hard, close to hardware programming, different kind of aspects, how you can program this kind of hardware and navigate, let's say, complex instruction sets as well. So this is not a super hard task. It's just figuring out how to use it is basically the challenge. How do we use this kind of instruction sets? That's I would say. And the approximate start date will be

Starting point is 00:44:48 early May. So we will have a session on this where then Lawrence will explain how to set it up, etc. Give you just stops. And we have a quite neat setup with Google Classroom that usually works quite well. So if you've done a course with us, this usually works quite well. So if you've done a course with us, this usually works quite well. So you can basically upload everything. We can test everything quite nicely. Second task is the Radix tree, very nice data structure, highly recommended.

Starting point is 00:45:18 So just doing this is basically, on the one hand, you'll enjoy doing this kind of data structure, learning about the data structure and building something like this, because this gives kind of the ideas behind this and setting up this kind of data structure, creating this kind of data structure, gives you new perspectives on how to design programs, how to design data structure. So that's why I really like this. Then, as we said, we'll have the buffer manager that gives you an idea how to go out of the caches,

Starting point is 00:45:56 how to go out of RAM even. What if we have to deal with disk accesses, which will happen if you have large data, and which also will happen if you have to deal with disk accesses, which will happen if you have large data, and which also will happen if you have to persist data. So if you are in one way or the other, you have to deal with something that you don't like, some kind of data that you don't want to lose, you will have to deal with this access and this lecture kind or this task gives you an idea. We'll start in late June. And then probably, so this is the one most shaky one,

Starting point is 00:46:33 but I don't know how far we are. We'll discuss this soon. We will give you more details about it. So the idea was to talk about the log or have a task about the lock-free data structure or lock-free hash table actually. But you learn how to program this, right? So this is a bit more complex to code, but then a very efficient data structure. So you learn basically about this is very, let's say, database specific. Well, I mean you can also use it in other code, but if you want some very highly efficient code,

Starting point is 00:47:13 then you might invest in a lock-free data structure. In many code bases you won't find this just because it's complex to program. And we'll start in early July with this. And then we're going to be done towards the end of the semester. So you're basically free to enjoy the semester, no exam that you have to learn for, and just basically the feedback session

Starting point is 00:47:43 where you have to explain the code at the end. Not at the end, we'll basically randomly assign you to explain one of the codes, one of the programming tasks. And then if you're lucky, you have an earlier one or depends on which one you like most, basically. But it's not gonna be, well, next week you have to want an earlier one or depends on which one you like most basically but it's not going to be oh well next week you have to do this or something we'll assign this very early on as soon as we know who's going to attend you will basically get one of the tasks assigned and you

Starting point is 00:48:17 know when you will have this session where you do the feedback or where you explain the task okay there's a lot of literature for this course. Usually the main literature for a single lecture I'll have in the beginning of the slides. Most of the books we have upstairs in our library as well, which is a, well, sorry excuse for a shelf with books. But we have some of them at least there. Some of them are somewhere scattered across different offices, but if you don't find them in the library, you can definitely

Starting point is 00:48:53 come check them out. We'll have them there. If you don't have enough, we'll just buy more. And usually the ones that I point out are also the ones that I think are worth reading. So say, for example, this structured computer organization or computer architecture books, these are actually really nice books. So there you learn a lot if you read them, for example. And the rest you can see. Some of them you can also find online. So freely available in one way or the other. Yeah, I already said this, right? So the slides are available in Moodle

Starting point is 00:49:38 or will be available in Moodle. We also have last year's course. You will find last year's slides. I always update something. So this means We also have last year's course. You will find last year's slides. I always update something. So this means the slides from last year probably never will be 100% the same as the slides this year. Still, if you want to take a peek, you can also check out last year's slides. Otherwise, be patient, right? So the slides will be there once I'm ready. Sometimes there's an error, especially if I fix something last minute.

Starting point is 00:50:11 I might just put in more bugs. If you find something like this, notify me. Sometimes I just do it while I'm actually in the lecture. I'm going to fix it. There is like a really stupid mistake that might mainly confuse me, basically, while I'm trying to explain it. Still, if you find something, feel free to send me an email. I'll fix it.

Starting point is 00:50:37 The core of the labs will be taught by Lawrence and myself. So please stand up and wave so everybody sees you, right? So these two guys will help you with the labs and the project. And they also will have an idea about the course. So they basically know about the topics, et cetera. So you can also ask them about these things if I'm not available, which rarely happens. Then a few reminders for most of you probably.

Starting point is 00:51:13 I like discussions, right? So if you have a question, feel free, and I'm completely happy if we're not finishing a lecture in time, then we're just going to continue in the next lecture. So if this is an open discussion, ask questions. It's actually good. Sometimes I actually will ask questions to you just to make you more comfortable here speaking in this audience, right? So I think this is really helpful. At least for me, it was always very helpful while I was studying. If it's more of a discussion rather than a person in the front just talking, I fall asleep, right? So, and this is something I'm also not, you know, I'm not angry if somebody falls asleep

Starting point is 00:51:54 because I used to fall asleep all the time. So that's perfectly fine. But if you engage in a communication and discussion, you feel much more alert. Also discuss questions with each other. There's no exam, so forget this. But submit your projects individually. You can talk about the projects, but don't show each other code. I mean, of course, you can if there's something,

Starting point is 00:52:22 some stupid mistakes that you don't can't figure out, some code snippets, et cetera, again, fine. But if you have the same kind of code structure, we run a plagiarism checker. So basically code checker. If the code is the same, you fail. Very simple. And there's no debate or nothing, right? Explain your solutions. Don't share code.

Starting point is 00:52:44 Any kind of dishonesty means you failed the course. And I mean, I also don't, you know, I'm not angry about it or nothing, but that's basically it, I don't discuss this. Also don't break the programming set up. So this is something, it's clear how you're supposed to use it. If you try to cheat or if you try to find out how you can break our stuff, well, again, you're

Starting point is 00:53:12 going to fail because it's just going to make a lot more work for us. Again, if you find something that's not stable or something, talk to us. This is always, we're open to any kind of discussion. We're open to any kind of discussion. We're open to any kind of improvement for whatever we're doing. But if you break it, if you try to hack it, if you break our hardware, you will fail, of course. Very simple. That's just the basic rules.

Starting point is 00:53:41 Again, if it's an honest mistake, no problem. But if you notice you try to somehow destroy something, you try to cheat or whatever, it's a problem. Then communication. I like emails. I get a lot of emails. But it's good to learn how to write emails. Just be nice in emails because

Starting point is 00:54:06 emails all kind of online and written communication is hard to interpret basically uh you will always receive it i don't know if you noticed yourself uh you will always receive it in the moods that you're in yourself right now so if somebody writes something that kind of sounds angry if you're in yourself right now. So if somebody writes something that kind of sounds angry, if you're angry yourself, you will read it much worse than if you're in a good mood. And anything that's kind of critical already, that will be received much worse than you think you actually write it. So in general, it's good to just write, learn how to write

Starting point is 00:54:46 professional emails. This is, it's not super serious, but it's actually a nice structure on how to write professional emails. So the link, I find it quite nice because it also somehow reflects how I write emails. So that's why probably I like it. If somebody had used a completely different structure, I would not show this link. Anyway, you can check this out. So try to be, use Netiquette in all kinds of online setups, even when chatting, et etc. And not just to me or to Lawrence and myself, but also to each other. So this is really important. You should treat everyone with respect and consideration. More important in an online setup, but also in this course.

Starting point is 00:55:39 So if somebody doesn't know how to program this or that, that's not a shame or nothing uh we're here to help each other this is a small course usually so um it should be a safe space for everyone so i'm serious about this stuff so if if you're unhappy in one way or the other let me know and also i'm i've tried also if you feel something doesn't work out or or let's put it more in a positive way right so if you have an idea how to improve this and to make it more inclusive for you for everybody else uh let us know i mean i'm really happy if if you enjoy this course and this is all about right i have to spend a lot of my time in teaching and in research and if i'm if i don't have fun with it then well i have to wrong job right so this is really

Starting point is 00:56:31 this is supposed to be fun for you for uh for me for everybody um you register uh through moodle of course you also have to do the official registration, whatever, but we only see what Moodle gives us. Eventually I will get a list from Studienreferat with a lot of names. But for us, this is mainly, especially for the programming tasks, this is important. So if we have this, you will be part of the course. And then we also can deal with other stuff with the student referral later in cases you missed something up there. All the slides, all the resources, everything will be linked from there. If we're missing something, I will put it in there. So Moodle is kind of the base for everything. The HBI website, the general website, it's just

Starting point is 00:57:23 marketing whatever. So you basically know where to find Moodle and all the contents are in the link. User forum, it's really helpful. We try to answer a lot. I hope we've configured it correctly that we see it because that not necessarily always happens. But if we notice something gets lost, again, notify a citizen, et cetera. Also, all kinds of courses that do data processing, build systems, et cetera, you have to kind of think about the ethical considerations or ethical consequences of your work. So if you're building a system, if you're building an application,

Starting point is 00:58:07 whatever, you will have ideally at least some kind of impact. You will have users. And this means you have a lot of power. And this means there also is a lot of responsibility that comes with this, at least I think so. So a lot of people will tell you, well, this is just an algorithm. It depends on the data that comes in or, well, this is just a system. It depends on how you use it. I don't think so. Right. This is really, you can try to design safe systems. You can think about the uses of your system and you can make sure not necessarily, you cannot, of course,

Starting point is 00:58:44 always make sure that your system and you can make sure not necessarily you cannot of course always make sure that your system is not misused but you can try to build your system in a safe way and guide people to use it in a safe way so this is true not everything is a tool some things are more like a weapon or something unsafe and we want to build safe tools that are good to use. And everybody in the process, not only the users, everybody, also the people that build systems need to be aware of the consequences that mistake, misuse, and manipulation might have. So if we're building something, let's build something that's good for people,

Starting point is 00:59:26 not bad for people. And it's not only that people, like most people are not malicious, right? So not like if we build a database, we're not building it as a weapon. At least most people won't. However, a lot of systems are not safe to use. And if they're not safe to use, we should make sure people know this, right? If we're building a prototype, make sure we're building it or we're not selling it as a final product because otherwise things will go wrong in one way or the other.

Starting point is 00:59:59 So make sure that the goals for your systems, the use cases are on the one hand safe, and I think they also should be morally sound. So at least from my perspective, I always hope that if I teach you something, use it for making the world a bit better rather than making it a bit worse. So a lot of, let's say, machine learning use cases that we have to deal with, or even database use cases. I don't want to bash on other people. A lot of database use cases are not necessarily in the best intention for everybody. So let's not optimize our systems for those.

Starting point is 01:00:41 Let's optimize for those where people actually, most people, or hopefully all people benefit in one way or the other. And with that, also important for me, completely different topic, unrelated to the lecture, but in general, there's a couple of people that you can talk to if you feel things are not going well in general or in this course or in your studies or in your life on the one hand uh oliver carl and myself are ombudspersons at hdi so if there's some problem with scientific conduct so good scientific practice that you feel you want to talk about so how your thesis is going how your about. So how your thesis is going, how your project is going, how your supervision is going, whatever, scientifically.

Starting point is 01:01:31 Then you can talk to us and we can basically help you and try to guide you and think about ways or explain what we think good ways are, how to solve this. There's an email address you can also write to me directly. Eventually, I want to set up an anonymous form or something, but I've never gotten around. So eventually, there might also be some kind of anonymous way. But usually, ideally, you're not trying to solve other people's problems, but your own problems.

Starting point is 01:02:03 So if you have a problem with this, feel free to reach out. This is as ombudsperson. So this is basically by the DFG, so the German Research Foundation. We have to have this at HBI and it's private, meaning we're not going to disclose anything that you're discussing unless you want us to. If you have equality issues, there is the GBAs and the Frauenbeauftragte at HBI, so there's one email address here. There's also, I don't know if you've seen that, there's also psychological counseling.

Starting point is 01:02:40 So I mean, through Corona, everything's gotten a bit harder for everybody. And if you've been struck by anything like this, there's a special HBI counseling hotline that's free of charge, 8 a.m. to 8 p.m., Monday to Saturday. And there's also an email address that you can reach out. And if you don't believe in HBI or you feel this might not be safe enough for you, you can also go to the nightline in Potsdam. So that's another option that you have where you can reach out. So this is a slide that they always give me

Starting point is 01:03:18 where you can also have a chat or send an email or even call them and figure something out or if you have problems and this is really i mean this is really for you as support of course you can also always talk to me but i'm not a person that i don't any kind of kinds of academic or scientific things, I can probably help or at least give my opinion. But as soon as it's more like general life troubles, feel free to use this, right? So this is really for this kind of cases. If you have this kind of, and everybody's every now and then struck by this, and if you feel this is helpful, try it out.

Starting point is 01:04:06 I encourage these kinds of things. And on this bright note, we'll have a quick, let's say, three-minute, four-minute break. So I usually do breaks in my lectures just to get kind of catch a breath, bring something. You can get up if you need a washroom break, take a washroom break, and then we'll continue and do a bit of motivation towards the end. So everybody ready again? Very good. So I hope you like this kind of break. I find it's quite helpful to get a bit of focus back again. Okay, so last few minutes, we're going to talk about motivation.

Starting point is 01:04:49 So why this course? Why think about hardware in general if you're doing data processing? And well, if we think about the classic database architecture, if you look at the classical, let's say cache, et cetera, hierarchy, storage hierarchy, we're having registers, we're having caches.

Starting point is 01:05:15 So we have the registers where basically data is located on the CPU if we're processing it, then we have the caches data is still located on the die, but not in the processing units or just the caching data in typically multiple levels, so one to four additional levels of caches. Then we have the main memory. So we're talking about nanoseconds here, we're talking about tens to hundreds of nanoseconds here. We're still in the hundreds of nanoseconds if we're talking about main memory. And all of a sudden, in a classical database system, like classical hardware, we have hard disk, right?

Starting point is 01:05:57 The spinning hard drive. And the spinning hard drive is in the milliseconds, the axis. So this is 10 to the power of five access gap. So this is way, way, way slower. And this means everything, like data used to be, or the main memory used to be small. Data was larger than the main memory. So this means all of the time, we have to access this. So all of the time we have to access this.

Starting point is 01:06:28 So all of the database architecture is just about this. All of the classical database architecture is just evolving around this access time graph. So how do we make the disk access fast? Because everything up here doesn't matter. If this is so slow, right, if the hard drive is so slow, then the CPU is just waiting for the hard drive anyway. So if we spend a couple of extra cycles here and there, a couple of hundreds of cycles. So I mean, the cycles is in the nanoseconds. So this means we're spending thousands, tens of thousands,

Starting point is 01:07:03 even more of cycles just waiting basically. And this means how efficient we're up here doesn't really matter. And this didn't matter at all for a long time. How we use the CPU in the database system. And a lot of database architecture just reflects this. Meaning we're doing a lot of nice software architecture up here, making it, let's say, nicely composable, etc. Easy data structure, so it's more easy to manage the code, basically. We don't really deal with performance up here, but we're dealing with performance down here, making sure that this part is optimized as good as possible.

Starting point is 01:07:51 And all of a sudden, stuff basically changed, right? So all of a sudden, we don't have hard drives anymore, or we still have hard drives, but we also have SSDs, which are kind of in between here. And the main memory is so large that a lot of the data sets actually completely fit into main memory. So a lot of the processing, we can just do completely in main memory.

Starting point is 01:08:14 I mean, think about it, right? So this has, I don't know, eight gigabytes of RAM. I mean, eight gigabytes of RAM is a lot of data. In a structured way, if you're not storing audio, video, images, et cetera, I don't know, a lot of data in a structured way. If you're not storing audio, video, images, et cetera, I don't know, a lot of Jason, something, but structured database data, a gigabyte is a lot of data. I mean, you need kind of a big business tool to get into the gigabytes

Starting point is 01:08:43 if you're not storing random log data, right? So active database data, gigabyte will give you many, many customers or whatever you're working with, right? And so this means this is not timely anymore, right? So this architecture part, this is basically the problem today. So here we're still basically spending our time in the Postgres, MySQL, classical SQL Server, Oracle, et cetera.

Starting point is 01:09:15 So the old architectures spend their time making sure this works well, while we're actually only using this top part anymore. And this is also reflected in if you're analyzing the data. So you have large RAM sizes, you can have terabytes of main memory. And then if you look what the database is actually doing, you have a lot of logging, you have locking,

Starting point is 01:09:40 latching the buffer manager, et cetera, lots of optimizations, a lot of code just evolving around how do we get this access time here, like handle this. And then you only do very little useful work. So below the, let's say, 2% of what we're actually doing is the processing that is the actual data processing, at least in the classical database design. So this means, well, we have to rewrite, right? So, and this is also what keeps us busy in systems. Essentially, there's a lot of different trends in hardware,

Starting point is 01:10:20 in whatever, in applications, etc. So all of a a sudden the previous system is not really efficient anymore. So we have to redesign, we have to rewrite the systems to make them efficient. So we want to be somewhere here where we're actually doing the useful work rather than all of these code bases, a lot of the code that we did not actually want to use. Yeah, so traditional database architectures cannot really utilize this memory and we're going to talk about how we can. And there is actually stored. So when we have it only in the memory and then the system crashes, then everything is gone and our company is not good. So we still have to write to some slower disk,

Starting point is 01:11:13 so all these problems still are there because we just have to write. Well, yes and no, right? The question is how do we access the disk? Again, do we access the disk for the active data that we're doing processing on? So if we're reading the data, for example, we're doing joins, et cetera,

Starting point is 01:11:31 we're doing analysis over the data. Do we want to access the disk there? No, we don't, right? And then what kind of disk do we have? Do we have actually like classical hard drive where random access is super slow? We don't have any kind of parallelism. Or do we have an SSD that's actually parallel, right?

Starting point is 01:11:51 Where we can write multiple things at a time. And then, I mean, what you're talking about is basically logging, right? So you want to make sure that everything, like all of the interactions with the database, are somehow stored. So if the database crashes, we can go back in time to a point in time, or we can make sure that everything that was committed to some degree, everything that the database has seen is actually still there. Or where we told the user, this is in the database now, everything should be there.

Starting point is 01:12:28 And this we still have to ensure. You're right. So we will be limited to some degree by this. But everything else, all of the other processing, we can just do purely in memory. And the logging, we can, to some degree, make sure, for example, we're batching this, right? So rather than doing individual log entries, we're batching this. We can parallelize this. We can use different kind

Starting point is 01:12:52 of storage. So we talked about persistent memory. So this is something where we can have much faster access or we're just replicating. So rather than writing into disk, we can also just write to multiple nodes and hope that some of them survive a crash, for example. So there's many different ways of going about this. In any way, we have new architecture and we can use this to basically be much faster. And even SSDs, for example, they will have some buffers in there. So modern SSD for small writes, they will be very fast, basically. And you're basically just bound by PCI Express. Eventually, it will have to go to the chips. But then internally, they basically have like a little battery, etc., making sure that all of the writes there will eventually

Starting point is 01:13:46 go to, will be stored permanently. So it's a lot about knowing about the hardware. So what can I basically expect from the hardware? Then I can be much faster. If I expect nothing or if I expect this to hold here and I design everything around that, then I will be much slower than if I know, okay, all of this processing can just stay completely in memory, right? I can just make sure that this is fast on the CPU, then I'm going to be much faster.

Starting point is 01:14:20 That's what we're going to try to figure out, at least to some aspects in this course. Okay. Does this answer your question? Okay. Cool. So then, so what are these new hardware technologies? So we said large memory. Yes. One aspect. On top of that, multi-core CPU. So this has been going on for a bit already. Rather than having, and you probably know this all, right? Rather than having a single core on your CPU, rather than having a single core on your phone,

Starting point is 01:14:55 you will have multiple cores. And this is due to some problems with basically chip design. So at a certain point, basically you cannot really get any smaller and you cannot increase frequencies efficiently anymore. Of course you can increase frequencies, but there's some physical limits. We'll see this on the next slide where basically just using a single thread

Starting point is 01:15:22 or a single core won't give you more performance anymore. And that's why we have multi-core CPUs this time, or nowadays. And if we're talking about data processing, well, we'll have to figure out how to do multiple things in parallel at a time. Otherwise, we're not going to be able to use them efficiently. So if we have only one thing to do at a time, and again, thinking about one disk where we have one data item at a time that will process one data item at a time, we're not going to be able to use this. So we need parallelism there.

Starting point is 01:15:54 And then we can actually be faster. And also, we have much faster memory these days. So up to 50 gigabytes per CPU, even more. So multiple tens of gigabytes per channel. We have multiple CPUs organized in non-uniform memory access fashion. So we have basically RAM connected to each CPU, but we can connect all or we can access all the RAM in one system, but with different kind of latencies and different kind of speeds.

Starting point is 01:16:33 So we have to think about this if we want to use this efficiently again. And we have coherent access or cache coherent access across all CPUs. So basically each of the CPUs has, or each of the cores even has caches, but if they access somewhere else the data, or there's concurrent accesses to the same data from another CPU, we basically don't have to deal with this. So the accesses will be coherent.

Starting point is 01:17:01 So I'm doing something on my CPU right now in my core. I'm changing the data. I don't have to deal with, oh, I need to write this back to memory before I do something new. But the CPU will, the main board, the controller will deal with this. The CPUs actually will deal with this, have a protocol inside to make sure that the data is not corrupted while we're doing this. And in the future, this will also be, hopefully if everything goes well, this will not just be in between different CPUs,

Starting point is 01:17:32 but will also be in between different accelerators. So writing to this GPU memory, reading from main memory, from the GPU, so, but in order to utilize this, we have to rewrite, we have to make sure that our system, our program knows about this. And so the processor trends across the years, so this is basically where hypothetically I could have started somewhere, right?

Starting point is 01:18:05 So then this is where I actually started. Everything was single port, right? So we basically, and it's much easier to think about a programming in a single threaded fashion. I just need to do this and this and this. So it's all very sequential. But around the 2005 or something, even before that,

Starting point is 01:18:28 we saw that the frequency of the processors would not get faster anymore. So the problem is basically having a higher frequency on the CPU means we have more voltage, we have more power. And the higher the power, the more we're basically leaking courage also. And the CPU gets warmer.

Starting point is 01:18:51 So that's one thing we have to put more cooling, but also we're leaking more. So we need, basically, we're getting more inefficient. So at a certain point, the CPU vendors and manufacturers noticed this doesn't really work anymore. We cannot make them faster efficiently anymore. So they stopped increasing the frequency. So before, this is a logarithmic scale, right? So when I grew up, basically, you just bought a new processor. It would be twice the performance of the old processor. And then basically my program runs twice as fast.

Starting point is 01:19:31 Problem solved. So my program is slow, buy a new processor, program twice as fast, perfect. These days this doesn't work. If I have a single threaded program, I buy a new processor, it will be the same performance as the processor before, just because the frequency is not faster anymore. So it's just the same exact same performance, basically, just from the clock speed. And this is because we cannot put more power or much more

Starting point is 01:20:01 power into the CPUs anymore because cooling gets really problematic. So what we started basically, or what the vendors started is increasing the number of logical cores. So around 2005, basically, the CPU started having multiple cores. And these days you can have,

Starting point is 01:20:24 if you think about graphics cards, you have thousands of cores. And these days, you can have, if you think about graphics cards, you have thousands of cores. And still, even the single thread performance is increasing to some degree. So while we're not increasing the frequency, we get some, let's say, optimizations in hardware design. Say, for example, the MacBook, the M1 chip is one

Starting point is 01:20:46 where we see some of the programs are just much faster because the design of the chip is a bit smarter in one way or the other for some applications. So we get a bit better performance, but all in all, this is also flattening out. But the number of transistors is still growing, right? We're just putting more stuff in the die, still more transistors, but organized in multiple cores. And this means we really need more parallelism. In order to utilize this, this is fine if we're doing 100,000 different things at a time, right?

Starting point is 01:21:22 So let's say I have, I don't know, gaming, not even gaming, lots and lots of multitasking on my CPU. But I mean, you're not going to do thousands of different applications at a time. So in the 10s, not a problem, right? So this is something where my OS can start a lot of threads, whatever. But if I'm doing data processing,

Starting point is 01:21:45 all of a sudden I have to think about this, right? I don't have a hundred different maintenance threads that do some organization around my data while one thread is doing the actual processing. I really need to have many threads that do the data processing. I need to organize the data in a way to do this efficiently.

Starting point is 01:22:06 And I noticed I'm running out of time. So what we'll do, this will stop here. I'll do the rest of the motivation next time. Do we have, what kind of questions do we have? Let's ask it openly. For today. No questions for today. That's fine.

Starting point is 01:22:30 So we talked about the group. We talked about the course organization. If something's not clear to you regarding the course, no problem. We also have next time to ask. You can ask the Moodle, et cetera. I've started the motivation. I'll finish up the motivation next time. And next time we'll also talk about the database basics. So this is for those of you,

Starting point is 01:22:53 it still makes sense to come, right? Still, but it's really just about what is the database? What is the classical database architecture? That's more about, it's not only what is the classical database architecture? That's horrible. It's not only what is SQL, it's also what does a database internally look like. So if you've only seen database systems one, there's still a lot for you to learn because I'm basically doing quick runs in database systems two. If you've seen database systems too, it's just a pressure of terminology, so you know what I'm talking about. Okay, with that, that's it for

Starting point is 01:23:32 today. Thank you very much. I hope I'll see all of you tomorrow.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Introduction

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.