Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking

Starting point is 00:00:00 Welcome to our second session. I'm glad to still see a couple of faces here. We're gonna talk about performance management and benchmarking today in hardware-conscious data processing. But before we do, I have three small announcements. One, Marcel already sent out something on Moodle. We have this survey or questionnaire on active participation.

Starting point is 00:00:30 Please fill it out if you want to participate in the programming exercises. It doesn't have to be today, but soon, so then we can start the setup for all of the GitHub, et GitHub, etc. stuff. So that you get your account, etc. And you can do the programming exercises. Today we're going to have the first session of our project seminar on in-database machine. That's not true. It's machine learning systems.

Starting point is 00:01:03 Sorry, I didn't update this. So we have a. It's machine learning systems. Sorry, I didn't update this. So we have a project seminar on machine learning systems done by Elaine, Ricardo, and Niels from my group. And today is the first session in room 111, I think. So they also said they were trying to find something else if the room is just too small. But check Moodle. If you're going to this room, definitely we will find you.

Starting point is 00:01:33 And if we're doing it somewhere else, we'll reroute you to the right room. And then finally, we have a bi-weekly research seminar with TU Darmstadt. So with Carsten Binig and Scholt Istvan, who are both the database professors at TU Darmstadt, both in the systems domain as our group. And we, well, for quite some time now,

Starting point is 00:01:58 I think it's the fifth or so incarnation of that seminar, we basically do joint research sessions. So we invite guest speakers. And today there's going to be Victor Lais talking about co-designing database systems and unikernels for the cloud. So this is basically how to do OS and database co-design. That should be interesting. And you're invited to join us. So here you can find the Zoom link. The Zoom link is in the slides.

Starting point is 00:02:32 The slides are in Moodle. So if you want to join, feel free to join. And I forgot to set the time, but fortunately, or put the time here, but fortunately, Niels also sent an announcement in Moodle so you can find the details there as well. It's open for everyone to join. We'll also, so this is sort of every other week with some exceptions if we don't have time etc or conflicts. But there's usually interesting talks on database-y research or database system-y research.

Starting point is 00:03:11 Okay, so with this, where are we? We're in performance analysis. I have lots of stuff on this. I'll probably not be able to go all the way through, so I might ask Martin to finish this up next week. And it's mainly about understanding or getting the tools to understand system performance. This is still introduction.

Starting point is 00:03:40 Next time, next week, Martin will cover database basics. And then we'll really dive into the topic. But the performance analysis is something that's useful, especially for systems development, but of course, generally, if you do something where you want to be efficient, not waste compute power and energy in general, then performance analysis makes sense. So you still remember the timeline, right?

Starting point is 00:04:11 So we're in the first week. And I can skip this. So we'll do a very quick introduction to performance analysis, then talk about these back of the envelope calculations that are already hinted on yesterday i'll give you i'll give you two examples one very concrete like low level example that i can do also on my laptop here then i'll talk about some measurement and statistics details and this is probably as far as I get.

Starting point is 00:04:45 Maybe we can cover some benchmarking and the FAIR benchmarking as well. So that's interesting stuff. Just to give you some ideas about how to do benchmarking, how to measure devices. And why do we do this? Why do we need measurement and benchmarking? Remember, the seven-step procedure for thesis or papers that I showed you right on step four and five, we need to do, or we should do, back-of-the-envelope calculations and we should do experiments. Today, if you're writing a paper or a thesis in computer science, which is not on a meta topic somehow, you will always have to do experiments, right? So you will always have to

Starting point is 00:05:34 benchmark something. You will always have to show that what you did works in practice to some extent. And for that, you make experiments. And in order to not be super surprised by the experiments, it makes sense to do some back of the envelope calculations beforehand. Because typically, people come up with ideas that seem reasonable if you think about it. But thinking about performance is actually not so easy. So a lot of it is basically integration. So you have to basically integrate

Starting point is 00:06:09 over many different kind of measurements and integrating is something our brain cannot do well. So this is something that where we've like, if we just guess we're often very wrong. So that's why, and doing the calculations is very simple. So with that, we get much easier or much better results typically. Okay, so why do we need a benchmark in general? Well, if we build systems, we need to evaluate them. And as systems programmers, which we are in our group and you are in this course, we're really always considered or interested in performance.

Starting point is 00:06:55 And this is not the only way you can experiment or how you can measure something. I mean, usability would also be something interesting. But for us, in most cases, it's really about, how can we get most efficient? Of course, that's not necessarily always the best thing, right? So just without looking left or right, just running after the fastest performance often

Starting point is 00:07:21 will get you into some kind of corner that where you lose genericness of your solution where you really specialize for a very narrow use case so that's not something you want right you still want to be able to apply your solution to a broader set of results, you still want to be able to use your system, right? So that's, or also to evolve your system to new hardware, for example, at least newer generations of the same hardware. So that's something where we still should have in mind. But often the first goal is just to see, OK, how fast it is. And this is really, if you think about database software,

Starting point is 00:08:11 this is often the first choice why you would choose A versus B. So if somebody buys Oracle over SQL Server, then this is usually due to some bake-off performance benchmarking, right? So they basically, both companies get the workload, both vendors basically, so I don't know, my super-duper company wants to buy a new database software, I give to Oracle and I give to Microsoft my workload,

Starting point is 00:08:43 and then they are supposed to show me how fast their system is and the faster one I will buy. At least that's how often it works. Of course other things are important as well like energy efficiency and price per performance. So if you think about the cloud the second one will be super important price per performance. If you think about the cloud, the second one will be super important, price per performance. If you think about EU regulations, energy efficiency is super important because with new regulations, every company has to provide their energy bill or their energy consumption and CO2 impact on paper and if the database is the thing that will basically really make this look

Starting point is 00:09:28 bad or give you a bad number, then you will probably not choose one product over the other. Okay, still what we're thinking about is performance evaluation. And while we can talk about benchmark and analysis. And I really like to make this differentiation. And I don't know if this is standard terminology. For me, it's terminology. So I mean, you probably can also say evaluation for analysis or something like this. But if you have a new system, if you build a new algorithm or a

Starting point is 00:10:08 special kind of operator or some hardware optimizations, then there's two things that you can do. You can analyze it like the single thing that you did, right? Your operator, your system, and check the individual optimizations. Basically, really look into where does time go here, right? So, if I run my query on this, what actually happens inside and why does it take the amount of time that it takes? And that's analysis, right? So, this is basically doing a breakdown, checking what happens inside. And then we have benchmarking. And benchmarking typically means we're comparing different things.

Starting point is 00:10:52 So classical benchmark is something that some kind of carpenter or somebody does, like somebody who produces furniture will have their workshop and a bench in there. And if they do some kind of specific type of furniture they will mark the bench where to have the right length so we're basically checking the performance we're marking the performance and comparing it to others right so this is comparing multiple systems on typically a standard or a real workload. So as I said, if you're a company acquiring some kind of software, you will look at

Starting point is 00:11:30 standard benchmarks but you will also give them your workload and see what is the performance on your workload. If you're writing a paper or a thesis, you should do both, right? And you should always understand what these numbers mean way too often i get uh proposals or thesis where people say well the result of my benchmark is seven and i have no idea what that means and the person who did the experiments has no idea what that means and uh well so you basically as you know as much as you knew before you did the benchmarking so you should always have an idea before you do an experiment you should have an idea what the result should be and how to interpret that result and for this

Starting point is 00:12:19 we often start with a basic modeling exercise right so we basically start with a basic modeling exercise, right? So we basically start with this rough, getting a rough idea. And this is also, in order to understand system performance, we can do multiple different things. So we can do these experiments or measurements in terms of analysis and benchmarks, or we can just model the whole thing. So we can just go about and say, oh, my system has these components, and my system has, like my algorithm has these stages, and this stages takes us, like, does these operations, these operations typically take that amount of time.

Starting point is 00:13:01 So overall, my performance should be this and that. We can also simulate, and in in many cases this actually makes sense. In our case usually this doesn't make sense. So in our cases in terms of system performance, if we have the hardware and we have the software, then we do real experiments or we model the software to get an understanding. If we don't have the hardware, or if we reason about something like, oh, my peer-to-peer nice large-scale network something-something, which uses a million different nodes, I don't have a million different nodes that I can run something on,

Starting point is 00:13:45 so this stuff I have to actually simulate, right? So I can use individual nodes, I can do benchmarking there, but the whole large scale, I will have to do some kind of emulation, some kind of trace-driven simulation, just have the whole thing. And now we have these different roads that we can do. We can do modeling, we can do measurement.

Starting point is 00:14:06 And the golden rule of validation tells us, well, do not trust one of the results or a single technique, but validate with another one. And for us, this typically means we're validating the measurements with a model. So in essence, we're coming up with a very basic model that gives us some rough performance idea, and when we're doing the benchmarking,

Starting point is 00:14:34 we can check if the model or if our measurements are actually correct because we have a mind model of what this, or a basic model of what the performance should be like. If this is completely off, then either we know our model is wrong, so we basically looked at like the bottleneck is not what we thought the bottleneck is, so we're spending time somewhere else.

Starting point is 00:14:57 Our measurement could be wrong because we're not like, I mean, what often happens, for example, we have some kind of driver architecture that measures everything, and the driver is too slow for the system actually to provide enough data. So then we're just measuring the driver rather than the system. Or we're just measuring something completely that the measurements itself are not really collecting what we actually thought,

Starting point is 00:15:24 or there's some bug in our measurements, right? Or there's a bug in our model also works. So having two, the two gives you some safety. You can basically reason about the performance and then do the actual measurements. And if there's something largely off, if the trends that you expect are not the same that you would uh that you actually see well then you know you can basically drill down

Starting point is 00:15:53 further if everything is round about what you thought it should be well then you're basically happy and you're okay the benchmarks tell me what i actually expected and we're good to go we can actually publish this okay good so with this I'll show you a simple technique for for modeling it's actually not really a technique it's really just like okay do it right so you don't have to do much. You just come up, like, OK, where does time go in my problem or in my system? So this, like, one person who really pushed this is Jeff Dean. So he's a Google Systems guru.

Starting point is 00:16:37 So one of the, I don't even know his title right now, chief architect senior or whatever. So he's one of the guys who built MapReduce, built many other systems, is behind TensorFlow, et cetera. So one of the core systems guys in Google. And he basically suggests, and I really like it, and also started doing it for a while already these back of the envelope calculations.

Starting point is 00:17:10 And the idea is on the one hand, just to get an understanding of performance and also to get good enough performance. So basically do I have a solution to my problem that's good enough? And often we come up with some kind of architecture. We think, that might make sense, right? So this and that, combining these and these techniques.

Starting point is 00:17:36 But then in the end, if I have too many parts, for example, my latency will go up. Or I have one weakness in my system. My bandwidth is too low, et cetera. So things might get slow. And as I said, just thinking about performance will not give you a good understanding, because our brains are not made for this. So our brains, if you just assume

Starting point is 00:18:02 what will be the performance of a system, you're very often very wrong. And that's very natural, happens to me as well. And that's why just sitting down and doing some simple calculations on the back of the envelope or if you're in Germany on a beer coaster, right? So this should be the amount of calculation that you do, just basically picking the most important factors for your performance and then calculating the performance within an order of magnitude. So not 0.1 or 0.01 error on the performance, so I don't need like a very exact model. I just need to know within like

Starting point is 00:18:46 a factor of 10 that this is my performance. Because then I know, okay, if I'm 10 milliseconds or 100 milliseconds, this is still interactive. If I'm 10 seconds or within a minute range, okay, this is not interactive. So my users will probably walk away. And if I can already estimate that something will be that slow, then okay, I have to do something different. And so this is why this is also called a bullshit filter, because you can filter out stupid ideas early. And as I said, this is not necessarily obvious right away.

Starting point is 00:19:23 So it actually is a good exercise. Of course, if you then want to know for sure, you have to benchmark. And then you can basically see what the actual performance will be. And for this, there are a couple of latency numbers that are useful to know. And I haven't updated this, right? So this is from Jeff Dean. This is probably 10 years old. You will get slightly different uh and like faster latency numbers here and there not everything is much faster but it's

Starting point is 00:19:51 just to have an understanding of orders of magnitude you don't have to memorize these again it's just good to have an understanding or have a way of looking up uh what these performance numbers are so again This is a bit small probably not everybody can read everything It's not necessary. You can look at the slides. I'm gonna tell you a few things here and there Right so say for example you have an L1 cache reference. This is basically within a cycle right so this is 0.5 nanoseconds if your frequency is a bit higher or a bit lower, you can calculate how fast your GPU would do this. Then we have branch mispredictions. This

Starting point is 00:20:33 is basically whenever the CPU thought, okay, we're predicting to go in an if statement to go the true route, then if we're right, so the CPU will always say, I'm not waiting until I can actually calculate the branch. I will just predict we're going that route. And then it will already execute all the statements, the instructions within this branch. If this is wrong wrong then we have to go back and start the other branch so that's basically and if this happens a

Starting point is 00:21:12 lot this is basically where a lot of time goes and so this is five nanoseconds so round about ten times a single cache reference might even be up to 20 times a single cache reference so then up to 10 nanoseconds roundabout so depending again on the CPU architecture it's just about the average or some some rough ideas here if we look at an L3 cache reference where so this is basically not first level cache, but then third level cache could be something like 20 nanoseconds. If we're going to main memory, we're in the 100 nanoseconds range. If we're going to NVMe, we're in the 50 microseconds there already. We can see a lot of shift today.

Starting point is 00:22:08 So newer systems might get even lower or newer hardware might get lower performance or lower latency numbers here. If we're looking at disks, we're in the milliseconds. Right, so this is, you can see this here, right? So here we're already going further down. And this is why basically the CPU has to wait so long because we're many orders of magnitudes lower

Starting point is 00:22:37 than if we're in memory still. So this just, as I said, some numbers. And if you're really curious about current numbers, write small programs and check yourself. Write some small benchmarks, programs. That's a good exercise to get an understanding of system performance. There's also some throughput numbers that I have. So here DDR4 channel

Starting point is 00:23:05 bandwidth is, for example, 20 gigabytes per second. So this is something that it would have here on this system, for example. Again, this one is a bit special. So the M1 has a unified memory architecture. You can get a bit higher general bandwidth on this system. Other systems might be slower. Now there's DDR5. You might get even faster performance there. Old PCIe Gen 3 channels have 12 gigabytes per second. Marcel will tell you a lot more about newer PCI Express versions where you can get like much higher

Starting point is 00:23:41 throughput as well. NVMe, this is also large range, right? So you can be in the gig, 10 gigabytes per second these days already. Yeah, this bandwidth and NVMe IOPS, you can, we'll talk more about this later. So, but this is, as I said, so having a rough understanding of these numbers

Starting point is 00:24:05 will later on help you to calculate or also to better estimate performance. And we'll do this right now with a very simple example. So also from Jeff Dean, a simple example is on how long to generate an image result page with 30 thumbnails. So this is basically a Google image search problem. So if you're looking for the newest whatever cat images, you're going to Google and you're saying, well, give me nice cat images. So then Google has to produce one of these pages for you.

Starting point is 00:24:48 And then they have to think, OK, how do we do this? So how can we do this with good enough performance? And a very simple approach would be just read these individual images sequentially. So each of the images 256 kilobytes, for example, we're reading them serially, so one by one on the fly. Google images are of course many, right? It's large database, so this will reside on disk because it's the cheapest. We don't want to spend too much money on storage here because otherwise it's not going to be efficient cost-wise. So if we just do this sequentially, then we'll have to seek on each disk.

Starting point is 00:25:39 So remember rotating disks, right? So the disk is spinning. The arm first has to find the right track, so we're moving toward the head, moves to the right track, this kind of seek takes 10 milliseconds per seek, and then we have to read this 256 kilobytes with around about 30 megabytes per second, can be up to 100 megabytes per second on a good disk but not much more so you're ending up with 560 milliseconds which is half a second and that just on the server side so this at this stage you haven't transferred a single image to the user this is

Starting point is 00:26:20 really just on the server side so this is is clearly slow, right? Half a second. If you have to wait for two seconds for the first image to pop up or something, you will probably say, well, Google is long gone. I'm going to Bing now, right? So there they have to do something else. Okay, so what else can we do? We can issue the reads in parallel. Because these are many images, right?

Starting point is 00:26:47 And they are stored on many disks. We don't need to wait for a single disk to respond. We can read all of the images in parallel. There's going to be thousands of disks. We can replicate the images to some degree. So redoing the 36 in parallel on 30 different disks will not be a problem. And if we do this, then we have only one seek, right? Because all of the seeks are going to be in parallel. And we only count one read. So one times the 250 kilobytes read,

Starting point is 00:27:26 because again, all of the reads will be in parallel. We're doing the same amount of work, but we're parallelizing. So in that sense, or in this case, we have 80 milliseconds for the server-side processing, which of course ignores any kind of variance, right? So if you're doing this massively in parallel, you will have some stragglers. Some disks will be slower,

Starting point is 00:27:50 some stupid other job will go in your way and will take some of your disk time, et cetera. So this will take a bit longer, but we're definitely much faster than the initial sequential execution, right? So here we're basically in 80 milliseconds, which is fine on the server side, then only the data transfer is a problem for us. Okay, and of course we can do much more, right? So rather than doing this for every single image, every time, we can also do c do caching well we can do individual image we can

Starting point is 00:28:26 do whole thumbs a nail data sets we can pre-compute the thumbnails so then we only have to read less this still won't help us with the seeks right if we have the thumbnails on the disk we still have to seek them and this is a constant factor on a disk. So we'll still have to pay something, but again, the read itself will be lower then. And you might come up with many other ideas, but just guessing what might be a good idea won't help you much. If you calculate, you will actually find out what is a promising idea, right?

Starting point is 00:29:01 So like just guessing how much time this will take will probably not help you much. So doing the back of the envelope calculation helps you identifying most promising things. And often you need a very simple prototype just to get these latency numbers initially. If you are already aware about your infrastructure, so you have a good understanding of the hardware, then you know many of these numbers. Before that, you will need to do some very simple benchmarking, some very simple prototype to get an understanding how fast

Starting point is 00:29:39 is your disk actually. So if you have a large RAID with many disks, you have everything distributed just the reads might take more time than just the disk speed for example so these kind of organizations but it doesn't have to be a complete system and that's important right so you get that you can get these numbers fairly quickly and to my students i always say, when we do some paper prototyping, etc., these kind of numbers, you should be able to produce it within a day maximum. Ideally, within a few hours. You come up with a rough understanding of the performance of your solution.

Starting point is 00:30:20 So let me give you another example. And that's sorting. And sorting is something that we need in databases all the time. So many database operators are based on sorting, so this is something that we actually need to do. And so my question right now to you, if I have one gigabyte of four byte numbers in memory how long does it take to sort them who has an idea just guess what do you think

Starting point is 00:31:04 just a rough idea, guess. One gigabyte would be in main memory? Everything will be in main memory, yes. And then we have... Just guess. Don't calculate too much. I really just want guesses. 10 seconds? Do we get another number?

Starting point is 00:31:26 And don't be shy. Really, this is not about pointing out stupid guesses or something. It's really just about, do you have any idea? Okay, I'm assuming everybody who doesn't say anything has no idea how long this will take. All right, so, yes? Between 10 and 30 seconds. 10 yes? Between 10 and 30 seconds. JANIS LIEBHOLDT.

Starting point is 00:31:47 Between 10 and 30 seconds. JANIS LIEBHOLDT. Between 10 and 30 seconds. OK. Who says less than 10? Who says more than 30? Who says in between? Who already checked the slides? OK.

Starting point is 00:32:14 Good. We'll see. So let me keep the tension for a bit, right? Even though most of you or some of you already have an idea. I got more variance in the numbers in previous years. right, even though most of you or some of you already have an idea. I got more variance in the numbers in previous years. I should maybe change the example. OK, so how can we go about this?

Starting point is 00:32:33 So one gigabyte is 2 to the power of 28 integer numbers on most 32 or 64 bit machines. So meaning we have, yeah, this is like if we have int numbers, so four byte numbers basically. So if we do sorting, what do we have to do? So who remembers how sorting works? Yes? .

Starting point is 00:33:02 JAN SCHNEIDERMANN Yes? That's about 30 or 28. Yes. And then I assume one second per gigabyte. OK. So that's good. So 30 or 28. We have n log n.

Starting point is 00:33:25 We have to basically read the whole thing. We have to read the data and we have to make these comparisons. We have to compare the data, right? We have to basically read and write the data and we have to compare the individual numbers. That's basically what we need to do. And we need to do this n log n times, that's the number of comparisons,

Starting point is 00:33:48 and we need to go through the whole data set basically a couple of times. And you already said it, we basically need to do this 28 times. We have to, 28 times we need to go through the whole data set roundabout for all the different comparisons. So we said comparisons and reading data. These are the two factors. If we have comparisons, what does this result to in terms of code? I think we get an if statement and the if statement produces branches, yes, and branch mispredictions. Exactly.

Starting point is 00:34:35 How many branch mispredictions will we have? So if you assume completely random, the CPU hasn't got a clue on what it predicts of 50% or 10% or 15%? Very good. So we'll basically do branch. So on the one hand, we have to read the data, we have to write the data, and we have to basically on each comparison, which we will have n log n of those, we will basically have to figure out if it's a... or we will basically... the CPU basically does a branch prediction and since the data is

Starting point is 00:35:18 randomly distributed, half of the predictions will be wrong. So that's about 50% will be wrong predictions. Okay, so again, we have quicksort, for example, we have n log n. We do log of 2 to the power of 28 passes, so 28 passes over 2 to the power of 28 numbers. And we do basically these individual comparisons, so 2 to the power of 28, log to the logarithm of 2 to the power of 28, of course, is 28.

Starting point is 00:36:10 And that is roundabout. So just to calculate these numbers, that is so 28 in powers of 2 would be roundabout 2 to the power of 5. So it's a bit less. So it's in between 2 to the power of 4 and 2 to the power of 5. But so we say 2 to the power of 5. So it's a bit less. So it's in between 2 to the power of 4 and 2 to the power of 5. But so we say 2 to the power of 5. And then if we add 2 to the power of 5 or multiply 2 to the

Starting point is 00:36:34 power of 5 with 2 to the power of 28, we get these 2 to the power of 33 comparisons. If you say half of those are mispredictions, then we have 2 to the power of 32 mispredictions, right? Because we just divide by 2. And if we say 5 nanoseconds per mispredict, then all in all we get 21 seconds of mispredictions, right? And now we just have to additionally go through the data. So mostly sequentially read through the data. So that's 28 passes of these two to the power of 30 bytes. So of these 28, so of this one gigabyte,

Starting point is 00:37:20 so 28 gigabytes roughly that we need to read. If we say we have 20 gigabytes of memory bandwidth, then this costs us round about 1.5 seconds. So all in all, we're round about 22 seconds, 23 seconds of what we assume that this sorting of one gigabyte on a CPU will take. So, quite good guesses. I'm positively surprised you understand a lot of our performance already. So, that's an assumption, right?

Starting point is 00:37:55 That's a model. So, now maybe let's try. So, let me quickly delete what I already had. If I still figure out how to do this. Yes, so what I have here is just a simple program. I hope you can read this and I hope I can find my mouse. Let me see. Okay, so what does this do? So this is a very simple C++ program.

Starting point is 00:38:29 I also have it on GitHub. You can basically check this at home. I have an integer array or vector of the size of 2 to the power of 28. So this is the number of two to the power of 28, right? So this is the number of integer numbers that we need. We'll just put this in memory. We're keeping your counter for the time. Then we're randomly sort the data, not sort, we're randomly distributing the data

Starting point is 00:39:05 across this area. So it's first just filled, or it's randomly filled. And then we start basically start the timer. Or no, we don't even start the timer. First, I basically output a couple of lines, or the first couple of lines of the array. So, you can see, okay, initially it's not sorted. Then I start the timer. Then I sort with just standard sort.

Starting point is 00:39:37 So, this is a fairly advanced sorting algorithm. And this is also why we really need the whole domain. So, we need to basically have a large enough data set and we need to fully randomly filled like with the whole integer range, fill the data set because if we're using smaller numbers then this will just be an insertion sort. So then basically it will just be directly

Starting point is 00:40:06 sorted into a large enough area and will be faster than n log n than what we would expect. So we're sorting. Then after the sorting, we're stopping the time. We're outputting again a couple of lines to see that this is actually sorted. And we're outputting again a couple of lines to see that this is actually sorted, and we're outputting the total sort duration. And this is in Xcode, so I have to find, I'm just running this, and you can also run this in command line. It doesn't really make a huge difference in terms of performance. So now we're building, and we're running, and're running and this is single threaded, right?

Starting point is 00:40:46 So I'm also not super worried that PowerPoint will basically steal all the performance here. And you can see, oh, we have one gigabyte of data, four byte integer size. So I just put this out for reference. We have a couple of initial numbers, which you can see are randomly distributed. Now it takes, of course, a bit of time. So we set 20 seconds roundabout, hopefully not five minutes. Otherwise, I'm going to wait here for quite some time. And then you can see, well, this was actually slow today, but you can see here we have the numbers that are now the first couple of numbers.

Starting point is 00:41:36 So this is basically, you can see, well, we have only small numbers in the beginning. It took 32 seconds, which is really slow. So I've never had that slow of a number. But it's within an order of magnitude of what we calculated. So again, we have the complete sorting program. If I run this while the computer is not doing PowerPoint presentation and everything, I get something around 20 seconds, which is really close to what we see.

Starting point is 00:42:08 We can also do some counting. We'll see that the mispredictions are round about the number that we would expect. So, very nice, right? So, this basically really shows us with some basic calculations you get round about the performance measurements or you can estimate performance quite well.

Starting point is 00:42:33 And although I'm really surprised that you already had a good idea about the performance, in maybe other cases you probably don't. So without, or I'm not sure how certain you were about these timings, right? So it makes sense to do these simple calculations to get a good understanding. And with that, we're going to do a quick break here, four-ish minutes, and then we will talk about measurements and metrics. Questions up to here? Yes? How much time do you think can be parallelized? How much time can we get?

Starting point is 00:43:13 Well, I mean, it depends on the parallelization, right? Every, like, 10-year-old player, let's say, is it parallelized completely? Yes, so we should, depending on the number, of course, to a certain degree, we can probably get up to an almost linear speedup

Starting point is 00:43:39 if we're doing it right in terms of parallelization. Overall, of course, then you still need to do the final sorting. I've been off the top of my head. I'm not going to tell you a number because it might be slightly off. But I would say on this machine, if we parallelize, so I have 8 cores, I would assume I can get a performance boost of factor six or something like that. 2, you think like, you have a geometric standard line.

Starting point is 00:44:46 Then like, the first half takes one, the second half is third one, quarter and so forth. And the standard line is. So you say the total performance, the question is what kind of performance in an arbitrary number of machines would you get? And you say the total performance would be arbitrary number of machines would you get? And you say the total performance would be a factor of two maximum? Instead of multiplying everything with a factor of 28, you would only multiply with a factor

Starting point is 00:45:15 of two. Oh, you mean for one gigabyte exactly, if you parallelize. I'm not following what you're suggesting right now. Yes? Basically, the assumption, the argument is that if we go to infinity with the number of powers, then we can't

Starting point is 00:45:40 go below two seconds. Oh, we can't go below two seconds performance. Because of the the the performance that we have in in terms of uh of uh reading and branch mispredictions you're saying yes because we have that final sweep that's one second and then we have yeah And then we have. Yeah. Well, you're saying the final writing the data or going through the data once finally, that's 0.5.

Starting point is 00:46:19 Or we have 20 gigabytes per second. If we have one gigabyte, we have 1 over 20, which is 0.05. So then if you say a factor of 2, then we would be at 0.1 seconds if you're saying that the reading or the writing would be the dominant factor. So the sorting of the last layer, that's going through one gigabyte, costs one second? No, but going through one gigabyte

Starting point is 00:47:04 doesn't cost a second. Going through one gigabyte doesn't cost a second. Going through one gigabyte, we have 20 gigabyte memory bandwidth. Yes, but sorting, like the final step of sorting, the last merge operation. The last merge operation, okay. Only if we don't do merge, so we can fix it I mean if we're completely parallelizing, we're probably not going to use quick sort as is, right? So then we would do something else. We would do some parallelizing.

Starting point is 00:47:42 We would do some kind of merge step. We would do sort individual sub ranges and then merge again. We'll actually talk about parallel sorting later in the lecture. So let's maybe follow up on this once we're doing parallelization strategies and we can do another back of the envelope calculation for that one. And with that you still have some you still get a few minutes of break.

Starting point is 00:48:12 Okay so measurements and metrics. As I said just before the spec of the inward calculation really is a very powerful tool to give you a good understanding of performance. Also, everything else. So when our, let's say, just reasoning about these kind of things on paper is a lot more efficient than doing this in your head. So I always suggest to do this. If you think about an architecture,

Starting point is 00:48:43 something like this, putting this on paper and thinking about how something works what are the parts the steps what does it takes how much time helps a lot makes your life much easier and yeah your thesis etc your project is much more efficient. And also gives you good understanding when a research idea is stupid. And there's so many stupid research ideas out there. I mean, not necessarily stupid, they might look nice, but all too often we figure out that, well, it's a nice idea, but it doesn't have any impact. So it doesn't have a real performance or whatever

Starting point is 00:49:25 useful impact and then the problem is then you you basically have a nice idea but you cannot motivate it and that basically brings you to this problem that oh my whole thesis doesn't really make sense right so i have uh the idea is nice but there's no performance benefit whatsoever. There's no practical benefit. And you can rule that out very early, at least to a very high certainty. Okay, so with that, let's go into measurement and metrics. And so with this, a bit of terminology. So if we do measurements or we do benchmarking, then if we have a database system,

Starting point is 00:50:07 this is usually what the kind of things that we have to take care about. So we have data that's often delivered by something like a data generator. If we're doing benchmarking, we have a workload that again is often delivered by a workload generator, or both can also be some kind of traces or data sets from real world setups.

Starting point is 00:50:29 And we have a driver that needs to put this into the system under test. And the system under test, if we're doing benchmarking, usually some form of a black box. So we have it as is. If we're doing analysis, we're trying to break this open as much as is. If we're doing analysis, we're trying to break this open as much as possible.

Starting point is 00:50:48 If we're benchmarking an existing system, I don't know, again, SQL Server, Oracle, what have you, or a cloud database, there's little we can do to look inside, right? If we're analyzing something, then we really try to crack this open.

Starting point is 00:51:04 If you're building your own algorithm, you really want to know where does time go when. So say you're building a new improved version of Quicksort or a parallel version, then you actually want to know where does time go. And for that, you need some tooling, some connectors. And then once you run the benchmark, you collect metrics, different kind of metrics, or you have some certain metrics that you want to collect, and these are then in a form of individual measurements. So the system under test is a deployment, and it's also important to know, right? So you cannot measure an algorithm or a system without hardware.

Starting point is 00:51:49 And the hardware will have an influence on the performance. So running two different things on different kinds of hardware will give you different results. Or running the same thing on different hardware will most likely give you different results. And in our case, if we do this on a low level, the benchmark, then the hardware has a very high impact, has a very clear impact on where time actually goes. If we are completely bottlenecked by some larger factor, so, I don't know, our system is somewhere in the cloud

Starting point is 00:52:28 and the main factor is actually the network connection to the cloud, then it doesn't really matter that much what the system itself does. So, this will be a minor factor. However, if I'm running something on my laptop, then the CPU, and I'm somewhat hardware conscious, then the CPU will have a large impact and the architecture of the CPU. Then I need to somehow understand this. The workload typically are some requests by users and more often than not, requests by

Starting point is 00:53:01 other systems. So typically you're not building a system that is kind of a huge monolith where then actual people are working there and typing their queries and interacting directly. But often this is one part of a larger puzzle and other systems will connect to this. And the metrics then are the criteria that are used for evaluation. And of course, if we do benchmarking, we need to answer a couple of questions before. So which kind of scenarios do we want to evaluate?

Starting point is 00:53:40 What kind of data makes sense for that? What kind of workloads should we use. And this basically then also, if we have this question, we can see if there is already or there are already benchmarks that we can use. So I don't have that many slides on different kind of benchmarks, but I'll introduce some or at least name a few that you can use. So there is standard benchmarks that you can use. And the good thing about standard benchmarks

Starting point is 00:54:11 is other people know them as well. And if you produce numbers for them, they know what these numbers should be and what these numbers mean. If you come up with your own benchmark, only you can interpret the numbers without giving a lot of details of the benchmark Excel. Then we can think about the hardware and of course the software that we should run and

Starting point is 00:54:37 the metrics, so what to measure, how to measure them and how to compare. Some metrics are actually quite hard to collect. Something like a branch misprediction, I cannot measure an individual branch misprediction. For example, I mean there is no tooling on the PC that gives me fine grained enough timers to say, okay, one branch misprediction takes exactly point four something nanoseconds, right? Or something like this, or five or three or whatever nanoseconds. I have to do this somehow aggregated. I have to know

Starting point is 00:55:14 what's actually going on and I have to then divide by something. Also, if I measure on this fine grained level, then the measurement itself will basically take most of the time. Because I'm just writing data all the time while I'm actually trying to measure something that's in very few CPU cycles. So there, again, I have to be smart how to measure. And of course, the hardware needs to, and the software needs to fit the problem that I'm using. If I'm talking about server scale workloads and systems, It doesn't make sense to run this on a laptop. Because the hardware is just slightly different. The memory is larger, etc. And the workloads also need to fit whatever I want to do.

Starting point is 00:56:02 The common problem is I'm running way too small and way too short benchmarks. Then basically everything will be in caches. Of course, I get very different performance than if I actually have to hit memory or if I have to hit disk with my setup. So basically, we want to find out what's going on. There's a couple of common metrics. If we're talking about performance, we usually want throughput or bandwidth, latency. Then we could have something like accuracy or capacity. I'm going to talk aboutC and capacity. The minimum that we want,

Starting point is 00:56:46 if we do real performance measurement that we do here, for example, usually is latency and throughput. I don't want just one, because basically I can game one with neglecting the other, right? If I don't consider latency, I can drive up my throughput a lot by just queuing.

Starting point is 00:57:08 So by just doing nothing, I can accept requests, requests, requests with having a very high latency. If I just go on latency, basically, if I have a very low throughput, the CPU is almost idle. I can get much better latency than if the CPU has to do a lot of other stuff. So that's why looking at both definitely makes sense. Accuracy is something that you often need in machine learning, of course. And capacity is something that basically, where do we get like a good trade off of throughput and latency? This is where we're at our capacity.

Starting point is 00:57:50 Then things like fault tolerance are interesting. We're not going to talk about this here, but this is something if you're thinking about large scale setups, cloud data centers, stuff like this, time to failure and availability is important for you. Also, efficiency in terms of energy, cost, and maybe fairness might be interesting to you. And efficiency also generally makes sense to think about, right? So while performance is always good,

Starting point is 00:58:19 thinking about cost performance makes sense. I mean, we have lots and lots of expensive hardware. But just getting like a few percent more performance by spending twice the amount of money that is not really efficient or by spending twice the amount of energy is not really efficient. So in a practical setup,

Starting point is 00:58:41 we want to make sure that what we're doing actually also makes sense size-wise and efficiency-wise. And then we have scalability. So can we scale up? Again, scalability is easy to game. I'll come to you in a second. It's easy to game if your baseline performance is very bad. So this is what many research papers also do.

Starting point is 00:59:07 So they basically have a baseline performance on a single node that's super bad. So then it's very easy to get linear or even super linear scalability if your system is super slow on a single node. You should always compare if you're scaling up, you should compare it to a really good baseline. So what's the best baseline that you can get on a single core or a single thread?

Starting point is 00:59:33 That's your baseline. And how much can you get performance over that on multi-core, multi-node, et cetera? That's usually a better comparison than compared to your own maybe scalable but really poorly performing algorithm. Your question? Just real quick, could you

Starting point is 00:59:54 say what policy efficiency related to fairness? JAN-FELIX SCHWARTZMANN- Fairness, in essence, so if we say how do we basically, in a system, fairness means we have maybe many different kind of requests. And we actually answer to all of those. So we can also get a better performance by neglecting requests that cost us more. So that's basically the approach to fairness.

Starting point is 01:00:31 Here, in terms of fairness, it's not about more metatopic fairness, but it's really if I have many services or many users or many components that use a certain resource, do we schedule it efficiently but also fair in essence that everybody gets a share? In general, we want to select metrics having low variability. So if we have metrics that are all over the place, every measurement is completely different, then probably we cannot get much from these measurements.

Starting point is 01:01:17 Again, this is basically also in essence measurement practices. Do we get good measurements or do we get stable measurements, or not? If we have very unstable measurements, the numbers don't help us as much. We don't want to measure stuff over and over in different things. I mean, for databases, it's very common

Starting point is 01:01:40 to have something like requests per second and throughput, something like this. So throughput in terms of queries per second and in terms of gigabytes per second. That makes sense because they're not completely the same, but there's already some redundancy in there. So having yet another five metrics that would measure the same thing would just be unnecessary.

Starting point is 01:02:10 So you don't really want to have this. But at the same time, you want to be complete. This is what I said about throughput and latency. You want to make sure that you capture the whole performance because you can game one without the other. You can basically, if you're just, as I said, just looking at latency and not at throughput, you can improve your latency a lot and vice versa. Yeah, throughput and latency, some examples.

Starting point is 01:02:40 So throughput could be requests per second, could be concurrent users that you're having, could be gigabytes per second processed, etc. So there's many different ways. Of course, you should always describe the whole setup, right? So gigabytes per second is different if a single request is one byte or a single request is a megabyte. You get very different numbers, really depending on the setup basically. Same of course for requests per second or concurrent users. So you need to specify more in order to give useful numbers. Also, it's very different if the dataset that I'm

Starting point is 01:03:21 using is one megabyte or is 10 terabyte, where I'm like basically where I'm selecting from, for example. Latency can be the overall execution kind of time, so end-to-end latency, or it can be per request latency. So I can basically measure some kind of average latency. I can measure on each individual request the latency. If I'm measuring end-to-end latency, I might hide some kind of latency spikes in between if I have some problems somewhere, like some blocking or something in there.

Starting point is 01:04:02 So I might get better throughput and latency numbers by just averaging out than if I really look at everything in detail. And something a lot of people report are percentiles. And percentiles are nice, but percentiles are lossy. So with percentiles, you're basically looking at the top 99 or 95 latencies, which is good.

Starting point is 01:04:30 But if say, for example, you have a web search engine that like where each individual request will create a hundred different interactions with other services, then you're always beyond your 99th percentile with your requests. So then going for a 99th percentile on each individual server interaction will not be useful for you. This is also, and if you're looking at large scale, you will also more quickly go into

Starting point is 01:05:04 the high tail latencies. So this is basically something out of a paper by Jeff Dean and Luis Barroso, where they basically looked at these tail latencies. And so if you have like 99 percentile, 99.9 percentile and 99.99 percentile latencies, or say failures, for example, then this is how quickly you will basically get to these latencies. So how quickly you, how, what's the probability

Starting point is 01:05:49 of getting, of seeing this latency, this latency beyond the 99%, right? So and here for example if you have 99.9 percentile then at, what is this like? It's too small even for me to read. At 100 requests, you have already 60% probability to hit the tail latency. And this, of course, very quickly approaches one. And for the higher tail or for the higher percentiles, you might even like it might be slower, but eventually your tail latencies will hit you. So that's why I actually, I don't, I'm not a big fan of tail latencies or percentiles. I'm

Starting point is 01:06:37 more a fan of averages and maximum and minimum numbers. Then we have capacity. So capacity is a bit more vaguely defined. It's basically how much do I get into my system, right? This is all sometimes we call this sustainable throughput. Basically, how much can I fill my system before it gets slower? And there's a nominal capacity, which is basically, how much can I, what's the maximum throughput under ideal conditions? So this is basically, how many requests per second can my system do unless something happens, unless somewhere garbage collection comes in

Starting point is 01:07:24 or somebody has PowerPoint open on their laptop, something like this. So all of a sudden we're slightly slower. But under ideal conditions, this is the maximum throughput. Then we have the usable capacity. This is basically in a running system um we basically get uh like if some like everything else is going on uh this is the capacity this is the throughput that i can achieve without ever breaking um the system right so this is basically, I get already slightly varying and higher latencies, but we're never going to queue up.

Starting point is 01:08:11 We never will be at the stage where all of a sudden, the system is not able to fulfill all of the requests. So in the nominal capacity, so the maximum basically throughput, this can happen. So we're basically at the maximum throughput. As soon as something goes bad, I will just be starting queuing, queuing, queuing. I cannot fulfill the requests anymore.

Starting point is 01:08:36 The usable capacity just doesn't happen, but my latencies might go up quite a bit. And then we have something that's called the knee capacity. And it's the question if you always can clearly see a knee. This is basically where I'm starting to fully fill my system, right? Where I'm starting for the first time to see some latency hits.

Starting point is 01:09:02 Well, as long as I'm underutilizing my system, basically I can just increase the throughput or the number of requests without seeing any hits in latency. So I can just go up. At a certain point, I will start seeing first bottlenecks, and this is where then basically the knee comes in, where all of a sudden my throughput is getting slower, my latency is somewhat going up. Initially, as long as I'm underutilizing my system,

Starting point is 01:09:39 the latency should be stable, where I can get more throughput. And then this is basically something in between these two, this is where you want to be with your throughput. You don't want to be below this because then you're completely under utilizing your system. You don't want to be above because all of a sudden

Starting point is 01:09:59 you're overloading the system and then for sure your performance, your throughput will go down because all of a sudden you're queuing the system has just gets more and more management tasks to do okay so then uh some statistics and this is going to be the last thing we're going to do the the benchmarking i'll leave for next week. So, and this is just high level. This is just for you to think about this and get a bit of an understanding here. If you really want to learn something about statistics, there is a really nice lecture, even this semester, on statistics. I really recommend

Starting point is 01:10:47 going there because there you will actually learn something about this. Here is more very basic understanding of how we could reason about experiments statistically. Also, I'm not a statistician, so every time I want to use this, I basically have to open a book, check the numbers again and how this works. Okay. In general, if we're doing experiments, at least we try to set it up in a way that they're independent events, right? So that one experiment does not completely influence the other. So that's a precondition to reason about this from a statistic point of view. I mean, of course, you can also have dependencies in there, and we will have a lot of dependencies.

Starting point is 01:11:37 So the first run of our experiments, the server is still cold, the caches are not warmed, etc., all this kind of stuff, we get different performance. We want to get to a point where we can run experiments in independent fashion. Each experiment basically is its own independent run. In general, independent events are something like successive coin throws. Hopefully, they are independent. In practice, of course, it also makes a difference.

Starting point is 01:12:11 The more often you have a coin in your hand, the more sweat is on the coin. It will influence your coin throw in one way or the other. From a statistics point of view, the coin throw is something you do every time, completely independent. We don't see any effect of one coin throw on the next. And if that's the case, then we can talk about this statistically. Then we have a random variable that's basically a specified set of values with a certain probability. So this could be uniform probability. It could be some kind of Gaussian distribution,

Starting point is 01:13:03 say, for example, body mass. This is something that has of Gaussian distribution, say for example, body mass. This is something that's Gaussian, has a Gaussian distribution. So how heavy are people? How tall are people? Execution times, for example, are usually have a Gaussian distribution. Or errors in experiments have a Gaussian distribution.

Starting point is 01:13:21 Then you can calculate a cumulative distribution function which is basically the number of events or a number, the probability of all of the elements of a random variable up to a certain point. So say for example, again, body mass, then we can have, if we have a Gaussian distribution, it's from zero to say 300 or something. This is where we could see body mass distributions and will probably, everything will be in the range of, like for adults, say 40 to 120 or something like this, we will see most things. So this is, there is where we get

Starting point is 01:14:14 close to one, right, from zero close to one, but we'll also have some outliers here and there. And we have a probability density function that basically, so this basically tells us how many do we see up to a certain point. The probability density function tells us where, like, how, like, it's not a, here we integrate basically, and here we see the distribution, how many are at this point, right? So this would be, say, for example, our experiment. If we have a certain error in our experiment, then this is basically the actual measurement that we want to see, and we'll have some positive and negative outliers. So these are the most

Starting point is 01:14:59 frequent ones. And then the cumulative distribution function this is basically where we then have all of the events that we see all of the experiments is at one and here in between we have basically half of the the events already seen and this can also be shifted this means we will, for example, have a bias somewhere. And then we can also have, rather than having this continuous, we can have like a discrete random variable. And this is basically whenever we don't have a continuous space. So when we have individual instances, so we have individual categories or individual buckets that we might have. So then we have, say, we're sorting into a hash bucket.

Starting point is 01:15:56 For example, we have individual buckets. The probability of these buckets will be a probability mass function, not a probability density function, because a probability density function is continuous. Then we can calculate the mean or the average. That's basically, so we can directly calculate this out of, say for example, a Gaussian distribution by just integrating over the or summing up the individual values and their individual probability. If we have a probability mass function or if we have a probability density function,

Starting point is 01:16:45 we have to integrate, right? Because it's a continuous space. The variance is basically how far can we, how far do we see, how wide is this function? So how wide is the distribution, right? If we go back here, this is basically the width of this probability density function and how wide this is, we can calculate with the variance. Again, I'm not going to go through the, also in the interest of time, I'm not going to go through the formulas here.

Starting point is 01:17:30 So just so you have an idea, we can also calculate. I don't remember if we actually have calculations at some point. No, we didn't have them. So we can have coefficient of variation. That's the standard deviation divided by the mean, so the standard deviation we can calculate out of the variance and divide this by the mean, that's called coefficient of variation. And finally we have a normal distribution, which is the most frequent distribution that we will see in our experiments.

Starting point is 01:18:10 So, I mean, you know a uniform distribution where we basically have all of the values with the same probability, so that many times if we have randomly sorted data, or I'm saying randomly distributed data, we'll find some kind of uniform distribution in here. However, if we look at experiments, right, so we basically we will have often have some kind of little variations in our experiment.

Starting point is 01:18:48 And these variations come from different kinds of errors and different kinds of sources. And there, through the central limit theorem, and also in practice what we actually see, is that we have these many small observations that actually aggregate together, right? So if I'm doing my experiment here, it's not just the CPU that counts or just like this

Starting point is 01:19:13 one part, just a branch mispredictions, but it's also some other things, right? So how are the instructions queued? How are the caches warmed? Where's the data actually. So all of these small things come together and they all have their individual distributions. And since they all are aggregated, so we have a large number of observation that are aggregated.

Starting point is 01:19:40 So in the end, we come up, we will get a normal distribution. So basically all these small experimental errors will accumulate into some kind of normal distribution typically. And this is something that we can usually measure these normal distributions and that we should also report. So in many cases, in systems, database systems, the variance is quite low, such that we don't

Starting point is 01:20:10 even report it. So if the variance is very low, our measurements are very stable, then we're kind of okay. We're saying we're not looking at it. As soon as there is a significant or measurable variance, then you would usually report this. So we would basically have this standard, like two times standard deviation. So this is basically the mean minus the standard,

Starting point is 01:20:40 or one time standard deviation is like 68% of the observations, two times standard deviation, which would be the variance is 95% of the observations are within this range. And this is what we typically report if we do these kinds of experiments. And you might see this with these small whiskers in your experiment. And if your standard deviation is larger than your actual measurement, then you know, okay, then the comparisons are going to be hard. So there's a lot more statistics to this. If you're curious about this, there's this nice book by Raj Jain that I had in the beginning.

Starting point is 01:21:27 If you want to summarize measurement data, then you can do a sample mean, so rather than calculating the actual mean, which we cannot do, right? I mean, we cannot do infinite measurements. We will do a sample mean, so we sum the observations, divide and divide it by the number of observations, which is the same kind of mean calculation then. Often, we also want to have the median.

Starting point is 01:21:56 So the median is basically we're sorting all of the observations and take the middle one. This helps us to deal with like very wide outliers. So every now and then the computer might have a complete hiccup and do something completely different. My thread is scheduled to something else. All of a sudden I have a very different latency. So maybe I don't want to have these these events too much, like change my results, so then median might be a better measure than an average. We can also do the mode. The mode is something I've really not seen a good, good, at least not in experiments, a good use of

Starting point is 01:22:48 it where we have the most frequent value. But it's good to know what it is. And with that, I'm going to skip the last one. This is basically something on experimentation. I'll explain this maybe at another point in time. And then for today, do we have questions so far? So this was a bit fast regarding the statistics. If you want to learn more about this,

Starting point is 01:23:15 I really recommend this book. This book, where you basically then would also learn how to use the statistics to calculate how many experiments do we have to do in order to have a significant result. In our case, usually the variance is low enough that like a few number of, we can basically just go with this, okay, we're going to do five repetitions. The variance is low enough that we're we have a significant result five or ten repetitions if the variance is very high still and you we're comparing something then more experiments might be needed and how many are needed this is something we can do like we can use statistics to tell us what is the significance of our results.

Starting point is 01:24:07 Questions other than that? No? Very good. Then next week, Martin will do the lectures. And the week after, I'm going to be back. Please check Moodle for the programming tasks and the assignment. And picking up the programming task. And I'll see you in two weeks. Thank you very much.

Your Ad Here

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.