Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Performance Management & Benchmarking

Starting point is 00:00:00 Welcome to our next session. We're finally in the room that we want to actually do this lecture in. And today we're going to talk about performance management and benchmarking. I'm probably not going to be able to finish all of it today, but we'll see. So this is an important topic. It's near and dear to my heart. I do a lot of research in benchmarking and evaluation. And also, if you want to do something like in efficient programming,

Starting point is 00:00:32 if you want to do yeah, build efficient systems, there's like a very big need and strong need to do a lot of evaluation, a lot of benchmarking to actually see that things are working. And so this is kind of like a baseline or a foundation for later on your programming, et cetera. And before I go into the details, I have a few announcements for you.

Starting point is 00:01:01 One is we've set up something, you should have gotten an email if you're a participant in the course, something saying active participation in Moodle. And we ask you if you want to join the programming tasks that you actually go in there and register in one way or the other. So we just know you are actually one of the people that want to do the programming because we do have to do extra setup for this. So we do this in GitHub. And so we need it for the setup. So please sign up there. It's, I think think just simple click and then you're there. Then, well we still have a seminar. I'm just announcing this again because last week there was a lot of confusion about timing etc.

Starting point is 00:01:55 and rooms. So somehow everything got messed up a bit. And so if you're curious, we're still going to do like we're going to do another introduction session today. 3.15 in K102. And then an additional thing that might be of interest to you. We have a joint biweekly research seminar with TU Darmstadt. So with Carsten Binig and Jolt Istvan. So these are two professors, database professors at TU Darmstadt, which we collaborate a lot with. And every second week we typically do an online talk.

Starting point is 00:02:36 This week is going to be one of the TU Darmstadt PhD students talking about multi-model DLVMS for seamless querying of text and tables. So the idea is basically not being only able to query text with your SQL queries, but also not only tables, which we usually would do, but also text somehow. So you're probably using language models. And the details you will find there. This week is going to be a PhD student. Often we have industry experts presenting as well, but I will announce those as well. So

Starting point is 00:03:14 this will also come up and this is online only. So if you're interested, the Zoom links here, you also have received an announcement through Moodle, I suppose. At least I hope I received an announcement through Moodle. Then, well, where are we? We're at the end of the introduction, so now we're in performance analysis, performance management, benchmarking, so we'll look into how can we find out how performant our code is and what do we actually want to see there. How can we find this out? And then next time, we'll actually start diving into this CPU realm, looking at all, like

Starting point is 00:04:00 really at the cores, etc. How does this work internally? Okay, so and this is the current plan. So I have a small change here already, just to notify you further down here. You can see that I've switched Q&A and NUMA. And this is because I'm not going to be here on that Tuesday. So if you thought, okay, I'm going to take off this Wednesday and you still wanted to see this NUMA, then well, you have to switch and take off the Tuesday. And I will keep this updated. As I said, this is kind of subject to change. Things happen during the semester, I have to do something else or something, something happens and will align. I just want to keep you updated there.

Starting point is 00:04:51 Okay, in this lecture, first I will give an introduction to performance analysis and then one of my favorite tools, back of the envelope calculations. This is something that I always bug my students with and anybody who does a master thesis or a project with us, which is somehow performance related. It's basically a tool just to figure out, do the numbers that I see actually make sense? And then some measurements, some statistics. This is a very small piece of statistics, just so you know the names, etc. If you want more details, there's a nice book. We also have it in the library.

Starting point is 00:05:36 And so there's a lot about experimentation, how to measure things, how to set up experiments, especially how often to repeat them, how to set up experiments, especially and how often to repeat them, how to basically calculate based on the variance, how many experiments would you need. And I've skipped all of this. I don't have it in lecture here because it gets really into statistics, but I see it more as a tool. So rather than saying, okay, I need to understand all the details, the mathematical details. I basically want to see, okay, I know where to get it and then how to use it in order

Starting point is 00:06:09 to get sound experiments. Most of our experiments in database systems typically are on a level where we would say variance doesn't really matter that much because the variance is much, much smaller than the performance differences that we see. But say for example, if you go to machine learning, often you really need to use statistics to figure out if your results are significant or not. And sometimes we also see this, especially if we go to very small hardware parts, there all of a sudden variance becomes large and performance gains become small. And then say if your performance improvement is, I don't know, 30%, your variance is

Starting point is 00:06:55 20% or something like that, then all of a sudden you have to actually figure out, like you have to do enough experiments to see if your results are significant or not. Maybe anecdotally, I will tell you some things here. Then we'll talk about benchmarks. So this is kind of an application level thing where or like a test where we can see what would be the performance of my system in the real world, more or less. And then some examples of basically fair benchmarking or let's say some examples of unfair benchmarking, what you kind of have to think about, like at what level you have to think if you want to do fair benchmarking. And we might not even get there. And then if we don't get there, I will do this next time, meaning tomorrow. Okay, so I come back to this slide. I showed you this in the first lecture already. because this is basically in two pieces or two steps of my famous, well, of my seven-step

Starting point is 00:08:09 recipe for papers or thesis, we have these experimentations. So first we have the back of the envelope calculations, and then we actually need to do experiments. And in like, let's say applied computer science as we do it, you will never have a paper publication without any kind of experiments. So this is we need to do this also to verify what we're seeing, if basically our assumptions hold in a practical setup. And this is often not the case. And if you go more towards theory,

Starting point is 00:08:47 then you need to make clear assumptions. But maybe your assumptions can never be 100% proofed or not, like you were hypothesizing. But here, we're hypothesizing about CPUs and stuff like that. And this is stuff that we can actually measure. So we basically can say, well, I think this would be my hardware setup. And in most cases, if I'm not saying about very large setups or something like that, I can actually benchmark this. I can test this. At least I can break it down into smaller chunks where I can apply this in real world. And this is really important

Starting point is 00:09:26 and it will be demanded by anybody that you're working with. Meaning if you're doing a thesis, your supervisor will demand it. If you're writing a paper, your reviewers will demand it. So that's why you need it. And that's what we will talk about today. And later on, if you want to do, like if you work in a company, you want to submit a performance improvement patch or something like that, people will also want to see sound experiments like this.

Starting point is 00:09:56 And if you can show that this will work for a wide range of applications or use cases, then it's better than if you basically show, well, it works in this separate case, but in Practice, everything will basically crash. So having a good mindset about good experimentation is really Important. And so in general, we're all About systems, right? so And if we build a system, we need to evaluate it. And well, in systems like database systems, whatever systems, we typically think about performance in terms of speed.

Starting point is 00:10:38 Of course, there's other things as well. If you think about machine learning, often performance doesn't matter at all, at least in research, right? So people will say, well, I can have a marginal improvement in prediction quality. I don't care how much longer this takes, right? So to a certain degree right now, what we're seeing, these larger language models, this is insanely slow to process, right? Of course, they will try to speed this up. But in the end, nobody will say, well, this chat GPT took longer to process than this chat GPT. So I'm going to use the one that's faster or to train. But I'm going to use the one that's better in terms of prediction quality.

Starting point is 00:11:32 And for us, in many cases, we don't have like soft results. We have hard results. We basically say, my query needs to perform, or I want to perform this query. The result needs to be exact. And then there's a lot of basically, or there's no debate on what the result would be. And then it's all about performance in the end. Of course, other things are reasonable as well. So usability could be a measure, but it's harder to measure, actually. Energy efficiency, price performance. Price performance actually in practical setups in industry is super important, right? And in practical benchmarks or in standardized benchmarks, you will also need

Starting point is 00:12:14 to specify this. And we typically have a very hard time to specify price performance. So we're all about just performance. And most of what I'm going to say is all about performance today. Also make a difference between benchmark and analysis. And I mean, this is not necessarily standardized, but it's sort of what people also would understand. So, if you do evaluation or analysis, this means you're looking at a single system or algorithm. And of course, there's everything in between as well. Or you might look at individual optimizations. So you come up with a nice algorithm, say a new join algorithm, and then you're analyzing it deeply.

Starting point is 00:13:00 So you're figuring out where does time go if I run this. So how much time is spent on reading the values So how much time is spent on reading the values? How much time is spent on comparing the values? How many comparisons do I need to do? Stuff like that. And then I can do individual optimizations. I can do micro benchmarks. I can do breakdowns.

Starting point is 00:13:17 And this is basically stuff that you could see in one of our papers, right? So we will have these breakdowns. We will see where does time go actually. And then we have a benchmark. And the benchmark is a comparison of multiple systems, multiple approaches. And well, it could be a benchmark that's kind of small, just comparing individual systems on individual operations or could be this

Starting point is 00:13:47 kind of application level big benchmark which could be standardized or you could use a real workload to figure out how your system is behaving and then you're comparing different kind of systems so here for example another so same paper basically different results. So on top with doing the breakdowns, which I would call an analysis on the bottom, we have a comparison and it doesn't really matter what exactly is in the in the pictures. But you can see we're basically benchmarking a couple of different systems again on different kind of workloads.

Starting point is 00:14:27 And a good paper or a good thesis has both of them. So on the one hand, I want to see how good is this in relation to other work, especially whatever is related to this. And on the other hand, I want to see why is this good, right? So just getting the number that my system is 10 times faster than another system doesn't really cut it in many cases, because I also need to know why this is. Because it could just be like a spurious setup. You could have done something wrong with the other systems, etc. So I really want to, if you're building your system, like a new system yourself, you really want to dig inside and see what's going on in there.

Starting point is 00:15:14 So in order to understand system performance, we want to measure everything. And this is kind of the thing that everybody will ask for. But on the other hand, it probably won't be enough. Or it's measuring something, especially something you build yourself, meaning you need to have built it already. And that basically also means you need to take a lot of assumptions and then come up with a result. And your assumptions just might be wrong and because of that it often makes

Starting point is 00:15:47 sense to start with some modeling and one of the types of modeling that we do is back of the n-wrap calculation will come to this but you can also have like very advanced very big models for any kind of thing and then also do measurements for measurements you need some experimental design you need workloads you need benchmarks etc another thing you can do which is like again different would be simulation in a simulation you're not actually implementing your system completely but you're simulating your system how it would would behave. So you are building something that, and this is often useful If you are working with modern hardware or hardware that Doesn't exist yet, right? or that's just being built.

Starting point is 00:16:34 So you are going to build some kind of prototype that Simulates the hardware. Or you have like a huge setup. So you are doing research in sensor networks that span Across the globe. This is something probably you setup so you you're doing research in sensor networks that span across the globe this is something probably you won't have access to this at least not in the beginning of your thesis so in that case it makes sense to do some simulation in our case we're doing stuff on servers on laptops etc so this we can usually measure and then simulation is not the best idea because measurement, we don't have to take assumptions.

Starting point is 00:17:08 However, I already hinted to that, we don't want to do just one, right? Because if like the rule of validation, this by Raj Jain and others is basically saying, do not trust a single technique, but validate your results with another technique. And for us, this often means we're validating measurements with a model. And or we're validating, like we're building a simple model that gives us a baseline result,

Starting point is 00:17:42 and then we're validating the model with some measurements later on. So basically vice versa. We're making sure that both align. If we have a basic model about the performance that we want to see and we're doing the experimentation and we see something completely different, well, then our model was wrong, right? And then our assumptions were wrong. The other way around could be also, right, we're doing some measurements, and then we are having a model and our model doesn't align. We could just measuring the wrong thing. So what often happens, we're thinking we're measuring a system, but actually our driver, our tooling for the benchmarking is so slow that this is basically what we're measuring. So then the system doesn't do anything, it's just our benchmark that's

Starting point is 00:18:33 crappy and that basically gives us like, we might not see any difference between different techniques that would be a typical example. Or we think we're fully utilizing a server, but we see everything's idle, for example. Well, then maybe something else is the bottleneck. So, we always want to do multiple things. For us, often the first step is back of the envelope calculation, and then the actual measurement. And if the two align, then hopefully we did something right and we get the right performance. Okay, so let's get into the back of the envelope calculation.

Starting point is 00:19:20 And back of the envelope calculation is called back of the envelope calculation because it's so simple or it's supposed to be so simple that you can do it on the back of an envelope. So it's basically the one piece of paper you might have flying around. If you're in a bar at night, you might use a beer coaster and do your calculation there. And that's the level of complexity that you should use there. And the idea is like you're trying to understand your application, you're going for the most important performance numbers and estimate your system performance within an order of magnitude. So that's actually quite coarse, right?

Starting point is 00:20:03 This is a very rough estimate, but it just gives you an idea if your idea like, yeah, or a hint and yeah, if your idea or whatever you're trying out might actually work or it's stupid, right? So this is also called a bullshit filter because of this, because often we think, well, this might be the case. This might be a good idea to start something with we're building a huge system in the end we find out well our base assumption what about like where the performance bottleneck actually is was wrong and then uh we're getting stuck somewhere and um so this is basically just to make well, at least in theory or at least from basic performance numbers, this might work.

Starting point is 00:20:48 And, well, eventually we will have to benchmark. But at least the very simple stuff we can check, we can do beforehand. And so I will give you an example in a second. But for this, and you will see this slide again, and I didn't really update this, so this is by Jeff Dean, and also, like, this kind of these ideas are also like the way the slides, at least some of them are from Jeff Dean, and he's one of the Google systems people. So he's one of the core engineers in Google who uses this all the time to basically get performance for their systems.

Starting point is 00:21:30 And you can see this a lot also if you read their papers, the kind of thresholds they use, etc. This comes all out of these basic calculations. Okay, so few latency numbers. And this is not for memorization. It's just to get an idea and also maybe as a reference. You can get newer numbers for this. You can get more accurate numbers for your hardware, for example, just to get a basic understanding. So if we're in L1 cache, we're below nanosecond level.

Starting point is 00:22:03 If we have a branch misprediction in the CPU, and we'll see this in a bit, right? So this is basically something where we five nanoseconds. So meaning if I know what I want to read or add something, like most operations will be in clock cycle, right? But if I have a branch misprediction, all of a sudden, the CPU needs to do a bit more, a couple of additional cycles. So this will cost us more. If we don't have our data in L1 cache, but we need to go through L2 cache to L3 cache, then we're in the 20 nanoseconds already.

Starting point is 00:22:41 So this means, I don't know, 50 cycles, something like that, even more maybe. Then we have, like if we have mutexes, so some locks for something that will take us something some time. If we go to main memory, again, many cycles, we're going out of the caches all the way. If we want to do compressions. And then there's more examples, right? So if you're reading sequentially from memory, say one megabyte, we're in the 100 microseconds. If we go to NVMe, so if we're reading a small amount of data from NVME, so this means like an SSD drive through PCI Express, then we're in the 50 microseconds.

Starting point is 00:23:33 Disk seek, 10,000 microseconds or 10 milliseconds already, reading something sequentially from disk, like one megabytes, 20 milliseconds, milliseconds, sorry, 20 milliseconds, and then 150 milliseconds. If you're going, like if we're sending something like a round trip from California to Europe, for example. And this is useful if you're basically planning on how to lay out your data, how to set up your system. So as I said, I'm going to give you an example in a bit. Then some throughput numbers, say if you're talking about DDR4 memory, then the per channel will get a bandwidth of 20 gigabytes.

Starting point is 00:24:25 PCI Express Gen 3 per channel or 16 channel. Right. So there we'll have to ask Marcel later. I'm always getting mixed up with the PCI Express numbers. But for PCI Express Gen 3, the maximum throughput that we get. So this would be all channels would be 12.5 gigabyte per second. Typical flash drive, NVMe, and there's like the numbers vary a lot. We can get two gigabytes per second. We can get like some newer drives.

Starting point is 00:24:58 We can get close to PCI Express speed. And the general bandwidth of this could be 6 gigabits per second. And so on and so forth. So how do we use this? And so for this, i'm going to give you a simple back of the Envelope calculation example. And this is kind of this, this Is by jeff dean, so it's kind of this google result page Example. So how long would it take to This is by Jeff Dean, so it's kind of this Google result page example. So how long would it take to generate an image result page with 30 thumbnails?

Starting point is 00:25:39 So in the first design, if you're just coding this down, I don't know, Python script or something, if you read serially, you thumbnail the 256 kilobyte images on the fly. Then you basically have to read all these individual images. So 30 images from disk. And then you basically, yeah. So you have to read them from disk. That's the main cost actually, at least on the server side. And if we're reading from disk, we need like classical disk. And Google still uses classical disks for large amounts of the data

Starting point is 00:26:11 because it's just so much cheaper still. Or it's still cheaper, at least enough cheaper so that it makes sense. And we already said, right, so disk, spinning disk, you first need to seek and then you need to read. If you do this image by image, then each of the individual images will be a random read. And you will do one after the other. So first, we're going to seek. A single seek is 10 milliseconds. So we're going to do 30 seeks. And then we're going to read 256 kilobytes with a read speed of say 30 megabytes per second. So this is like a fast drive would give you something like 100 megabytes per

Starting point is 00:26:54 second. But if you have a RAID and whatnot, you might be slower. So then you do 30 individual reads of these 256 kilobytes. So all in all, you're at 560 milliseconds. And this is half a second, and this is just on the server side. So this means basically half a second before I'm actually sending anything across the network, just reading the data. And then I might even have to still thumbnail this, etc. So this is actually too slow, right? So half a second for page generation for a website.

Starting point is 00:27:33 Most of the pages that you're clicking on will probably already be loaded on your phone within half a second, right? So that's too slow. So what could we else do? We could do this in parallel. What happens if we're issuing all these reads in parallel? Hypothetically, we only have to pay the cost, the latency once.

Starting point is 00:27:56 So we're basically, because throughput, probably not a problem. We're not going to read just from one disk. We're going to read from many disks. So we're going to do all of the seeks in parallel. We're going to do all of the reads in parallel. So that means basically we're reading, we're going to have this 10 millisecond seek, but we're doing all of them. At the same time, we're going to do this 256 kilobyte read with 30 megabytes. So there we're all of a sudden down to 18 milliseconds. If we can use 100 percent full parallelization.

Starting point is 00:28:33 Right. So we're not going to have this. Probably there will be some variance here and there. But we see we're just so much faster. We're more than order of magnitude faster. And then the server side is not a problem anymore. All of a sudden, we just have to deal with basically the data transfer, make sure that this works well. And this is, of course, not the end. We can still do further variations, right?

Starting point is 00:29:02 So we can still do some caching. We can have caching for whole sets. We can pre-compute the thumbnails. And doing this back of the envelope calculations is a simple example, right? But with this back of the envelope calculation, we can figure out what makes sense and what doesn't make sense.

Starting point is 00:29:20 And often you might think just by thinking about it and doing like a quick math in your head, you might come up with the best results. But you might be wrong because integration is something the brain can't do very well. So if you have to sum up something large, that's done much easier and much more easy on a sheet of paper. And often you need some kind of simple measurements. So say for example, you don't know how fast your disk is, you don't know how fast your network is, etc. You might need to do some very simple experiments

Starting point is 00:29:58 to figure out the numbers that you're using for your back of the envelope calculation. Like if you, if you have these kind of numbers, fine, right? So if you can work with those, well, then work with those. But if you need something else, you have a more complex system, then you might have to go do some small experiments and then use those for the back of the envelope calculation. And you will get, like,

Starting point is 00:30:28 typically you can get these numbers quickly and then have, like, a good assumption about the performance that your system might have or a subcomponent of your performance before you actually start building the whole thing, which often makes a lot of sense. And so let me give you an example, a second example that Is more catered to what we're doing here. So there the question is how long does it take to sort one

Starting point is 00:30:56 Gigabyte of data of four byte numbers on a modern cpu? Single threaded. So just guess, right? So how long would you think this takes? 10 milliseconds. 10 milliseconds, one number. Yes? 100. 100 milliseconds? OK, so let's see. Read one megabyte sequentially from memory. We have a number for this, right? So this would be 100 microseconds, for example.

Starting point is 00:31:47 So this we have to do 1,000 times. So we're at 100 milliseconds already. And we have to do this multiple times in order to actually get, in order to sort the data. But that's not the only thing that we need to do. So, well, first we need to break it down. So what do we do? Right?

Starting point is 00:32:15 So we have one gigabyte of data, so two to the power of 28 integer numbers on most 32 or 64 bit machines. So on this machine, for example, this will be 2 to the power of 28 numbers. If we're using quicksort, which most sorting algorithms today or most sorting will use because it's one of the fastest algorithms, that means we do logarithmic number of passes

Starting point is 00:32:51 over the two to the power of 28 numbers. So the logarithm of two to the power of 28 is basically 28, right? Makes sense because that's the logarithm. And 28 is basically 2, it's less, but it's more or less 2 to the power of 5, right? So 32 would be the next power of 2. So 2 to the power of 5. So all in all, we're doing 28 passes over the 2 to the power of 28 numbers. And that means we need to compare all of them. So we need 2 to the power of 33 comparisons.

Starting point is 00:33:38 So 2 to the power of 5 passes over the 2 to the power of 28 numbers. And this means and that's basically one of the major things that will basically take our time. The other thing that we need to do is we need to do these 28 passes over the data. So we need to read the data 28 times. And while we're doing the comparison, so the comparison, we are comparing something, it's a single thing, but what modern CPUs do is they do branch prediction.

Starting point is 00:34:15 So they basically say, well, hopefully, I guess this number will be smaller. And then if it's right, then the computer will be fast and we'll already have all the instructions laid out for the next steps if the the cpu is wrong then we have a branch misprediction we have to do a couple of extra cycles basically to get back to the right branch that we need to take and because this is random, half of our comparisons will be mispredictions. So not all of them, but half of them. And this means instead of 2 to the power of 33, we will have 2 to the power of 32 mispredictions. Again, this is a bit higher than we actually have because we said 2 to the power of 5 or 28, but it's actually 32.

Starting point is 00:35:06 So it's going to be a bit less than 2 to the power of 33 and 2 to the power of 32. And each of these mispredicts we said is 5 nanoseconds. So all in all, just the mispredictions would probably be 21 seconds in the order of assuming that our my cpu here actually has five nanoseconds per mispredict then as i said we also need to read this data we do uh two to the power of 30 bytes times 28 passes so one gigabyte times 28 so we're going to read 28 GB from memory, with memory being 4 GB per second, roughly, so something like 7 GB. And these latency numbers, or the numbers that Jeff Dean used,

Starting point is 00:35:55 these are already a bit older. It's not going to change a lot, so it's not going to be much faster on here, but it's going to be a bit faster. So roughly, the assumption is roughly 30 seconds to sort one gigabyte on one CPU. And now we can try, right? So we can check if this still holds. And we said like an order of magnitude would be something that we want to do. And what I do is I'm going to show you the program, and then I'm going to show you, I'm actually going to run this. So the program is very simple, simple C++ program. So we have, well, actually I have it, I'm going to show it to you.

Starting point is 00:36:44 Here, right? So this is, I hope you can read this, it to you here. All right. So this is, I hope you can read this. It's large enough. Okay. Okay. So we're basically just creating, no, this is the wrong program. Where do I have my program? Oh, I'm on the right.

Starting point is 00:37:12 I'm not on split screen anymore. Let me just stop this briefly. So, here we go. Sorry about this. Okay. have our vector, so an array basically with 2 to the power of 28 numbers. What did I say? Yeah, 2 to the power of 28, because 4 bytes, right? 2 to the power of 2, so 2 to the power of 30 bytes, basically. Then we're reserving the data. We're creating something that gives us the duration,

Starting point is 00:38:10 so a variable for the duration. We're initializing the array with random data or random numbers. So just going through the whole array, initializing it. Then I'm giving something out so I can basically see how large is the area. So how large is the individual int, how large is the area in megabytes? What does the area initially look like? So you can actually see that it's random in the beginning. Then I'm going to start my timer. I'm going to just use standard sort.

Starting point is 00:38:49 Standard sort is using quick sort internally, but it's an intro sort and you can also, I'm going to give you access to this, to this program. You can see that it's an intro sort if you change this here for example. If I'm using less different numbers, so if I have only a thousand or ten thousand different numbers so that I'm sorting, so from one to ten thousand or from zero to nine thousand nine hundred ninety nine, then it's not going to use quick sort but it's going to use an insertion sort because it's going to see there's not that many different numbers and then the sorting will actually

Starting point is 00:39:32 be faster. And you can also see this in the way that the sorting starts in the beginning. It will just do less comparisons. All of a sudden, it just needs to do more memory allocation. All of a sudden, it doesn't really make sense anymore. So it will change to something else. You won't see it here, but if you play around with this, you can see this. So then we're going to do the sort. Because the sort is so large, it will actually use the quick sort. And then we're just going to take the time.

Starting point is 00:40:04 I'm going to output the ARRI after the the ARRI starts, so the first values after the sorting and then output the total duration of the whole thing. Now let's try this and I hope I didn't break anything. So it's going to take a bit. So now we have the ARRI set up. You can see these are the first numbers in the ARRI. It's just random numbers. And then it should take a few seconds. So, our total duration in milliseconds is 19,000, so close to 20 seconds, basically. You can see, well, it's not bad, right?

Starting point is 00:40:58 So, now I want to show you that this is even better still. We can even go into more detail. I'm not going to run it again because we would have to wait again for this. So this, I profiled this earlier. I hope you can see this somehow. So here we can see basically I measured the total instances of branching conditions, which is basically how often did we compare, and the total instances of mispredictions. So, how often did the branching not succeed? And here we can see that we have 10,000, well, not thousand, thousand, million, billion, 10 billion comparisons, close to 11 billion, and we have two close to 3 billion mispredictions.

Starting point is 00:41:58 And so if we go back to the slides now. Okay. So then, well, when I did an earlier run, I had like 20 milliseconds, 20 seconds. And if I calculate, so how many comparisons did I have? I had two to the power of 33 comparisons. And I had two to the power of 31, a bit more than that, branch mispredictions. And this is fairly close to what we actually thought. So two to the power of 33 comparisons, two to the power of 32 mispredicts.

Starting point is 00:42:47 And well, with this, and you can see that this really is where basically time goes, and this also gives us a tool to measure or to evaluate, no, not to evaluate, to estimate performance of even close to hardware experiments. And because we're so far in right now, it's 11.45, let's do a break here. So let's do a quick five-minute interruption, and then we're going to continue with measurements and metrics. So let's continue.

Starting point is 00:43:27 So as a reminder, we will have a session on profiling. So meaning what I showed you just now with the profiler, et cetera, we'll have like a more general session on how you use profiling, et cetera. And Lawrence will do that with you. So we've been doing a lot in really low-level performance stuff recently, so I think it makes sense to think about these things and to give you some guidance.

Starting point is 00:43:58 You will also be able to use this later on in the programming tasks. Okay, so talking about measurement and metrics. And first, I want to give you some basic terminology. If we're talking about benchmarks, often you will find the term SOT or system under test. And this is basically what we're benchmarking, right? So this is what we're evaluating.

Starting point is 00:44:22 And it's also important to always be aware, if we're doing experiments, it's always a deployment that we're measuring. So I cannot just measure an algorithm or something if I do actual experiments. This will also always be like the numbers, the results that I see are comprised of hardware software and data typically so all three basically play a role and of course a lot of other conditions as well you

Starting point is 00:44:53 can actually see like the temperature of the cpu etc everything will make a difference so you some of it more some of it less and in most experiments, I told you, right, we're talking about large performance differences and then it's not that problematic. But as soon as you're talking about small differences, even the temperature of the CPU, whatever else might be like the stability of the power that the system gets might make a difference in your performance numbers. So it's important to be aware that everything you have there is basically part of your system

Starting point is 00:45:32 under test. Then you have some benchmark tooling that connects your driver, whatever pushes your workload into the system under test with the system under test that also collects some kind of metrics. The metrics will be stored as some form of measurements in hopefully something like database in a way that you can easily read it and hopefully also process it later on. And then, well, we have the workload and the data. If we're talking about database system,

Starting point is 00:46:04 it's always the two. In other cases, it could just be workload or we're just iterating of the data. But in a database system, we have our tables, which would be the data, and we have the queries, which would be the workload. And in many systems, we will have something similar. And all of this basically, of course, influences your results, right? So you need to be aware that, well, the data, like using this or that data makes a huge difference. Using this or that workload makes a huge difference. Using this or that hardware makes

Starting point is 00:46:39 a huge difference. And you need to basically figure out like how much flexibility do I have or can I basically also benchmark in a very confined setup where I make sure that the difference is like that I'm only measuring the differences that I actually want to measure. And if you before you start measuring and benchmarking, you basically want to answer a couple of questions. So which kind of scenario do I want to evaluate? Meaning what kind of data makes sense and what kind of workloads make sense? Do basically uniform random numbers cut it? Is this basically what I would want to see in reality?

Starting point is 00:47:31 If I use uniform distributed numbers, probably caches won't work well. If I have a Zipfian distribution, caches will work much better. If I have just the same number, or I'm just running the same query over and over, everything might already be in the cache and I don't actually see actual processing anymore. And so the question is, can an existing benchmark already supply those? So there is a lot of benchmarks. I'll introduce you a few of them later, or at least I guess I can also give you pointers on where to look.

Starting point is 00:48:08 And so that's kind of one question. Then which kind of hardware do you want to use and which kind of software? So if I'm benchmarking a server database system, doesn't really make that much sense to run it on my laptop, right? And then also what kind of metrics do I want to measure? And if I know which ones do I want to measure, how do I measure them and how do I compare

Starting point is 00:48:35 them? So it's not easy to necessarily work. You can easily come up with measures that are not really easy to compare with each other. Typically, we're looking at some kind of throughput numbers. But even if you go to database papers, you will either find tuples per second or gigabyte per second or something like that. Or if you say if your data rows per second or gigabytes per second or latency numbers, tuples per second or seconds. So basically, the question is, are the units comparable and are the units understandable

Starting point is 00:49:16 for somebody else? And do I have enough information to actually, even if the units are understandable, do I have enough information to compare it with something else. And when you're doing an analysis, it's really this crime scene investigation. Basically, we're going to try to figure out what is really going on. To some degree, we can. At a certain point, we can't.

Starting point is 00:49:42 So if we're talking about hardware, at a certain point, basically, the hardware vendors will we're talking about hardware at a certain point, basically the hardware vendors will make sure that we don't know what's going on because this is basically trade secrets, like how certain cash replacements stuff, et cetera, works internally, certain protocols. So then there's a certain level where we don't see anything anymore. But up to that level, we can basically investigate. We can do profiling, et cetera, and figure out what's going on in our system. Okay. So what do we measure? There's a couple of common metrics that we can measure. Typically, throughput and latency. And I always want to see both because of course I can have unlimited

Starting point is 00:50:30 throughput if I have unlimited latency. So that's very easy. And then if we're not talking about SQL results of relation database results, but some more prediction, then I might want to see some accuracy. An interesting metric also is capacity. So how much can I actually fit into my system in terms of workload, in terms of users, et cetera? And there's this, well, this is always trade-off.

Starting point is 00:51:04 And basically, you need to put certain thresholds. But throughput and latencies, these are the typical ones that we always want to measure. But also, we could have something like fault tolerance. So, fault time to failure and availability. So, if you have a data center, this is interesting, for example. So our data center regularly has power outages, not power cooling outages. So this is basically detrimental for the availability. As soon as we have to turn down the servers, your availability that usually would be put into nines after the comma. So how much percent of the,

Starting point is 00:51:46 how many percent of the time is the server available? And well, we're probably not in the, we're in this single digits, not point something somewhere. Then efficiency, super important and becoming more important, of course, cost efficiency, but also energy efficiency. So, or let's, yeah, energy efficiency in the sense of how much energy will the execution cost all in all.

Starting point is 00:52:21 Also fairness, how fair are different kind of users, for example, treated, not only in terms of machine learning fairness, but also in terms of if I have multiple jobs and I'm always only scheduling certain types of jobs, others might just starve. And then my system still might be faster, but some of the jobs are not treated fairly. And then scalability is basically how good does my system scale in relation to, say, a single node, two nodes, etc. So with two nodes, do I get twice the performance that I have on a single node, for example? There's different ways to measure this and we're not going to go into detail here because we're not so much about

Starting point is 00:53:09 scalability. How do I select these metrics? Well, I want low variability, meaning I don't want the numbers to be like have a very high variability or range of variability, so that the numbers are not as clear, basically. I don't want them to be super redundant. Some numbers I will basically often have in redundancy, say, tuples per second and gigabytes per second or something like that that for example is a redundant metric because it basically tells me the same thing if I know

Starting point is 00:53:52 how large a tuple is then I can convert between the two but it's often more convenient to have both but then on the other hand you might have like more complicated metrics like you combine different kind of metrics different kind of metrics, different kind of measurements into a single metric, and all of a sudden you have a lot of redundancy in there. I don't know, in prediction, something like F1 score and something else. And so then, well, then all of a sudden the information becomes might become too redundant. However, you definitely want to have completeness.

Starting point is 00:54:28 And this basically is exactly this example of throughput and latency. If I'm just reporting throughput, I can basically have insane latency and just make sure that I measure the best possible throughput, but it would never be usable in an actual system because the throughput is just way too high. The latency is just way too high. So I want to make sure that whatever,

Starting point is 00:54:59 like if there's two that belong to each other, say throughput latency, I definitely want to measure both and report both of them okay so a bit more into throughput and latency so throughput we again we can measure this in many different ways right so we can have gigabytes per second but we can also have number of requests per second. We can have concurrent users, we can have transfers per second. So say if you're talking about PCI Express or these kind of interconnects, they will often say giga transfers per second, stuff like that.

Starting point is 00:55:38 And then you have to know how to convert between these. So there's many different ways. And hopefully you give all the information so that I can convert between them easily. Then we're on the safe side. Latency can be complete execution time. So how long did everything take? could be per individual request and also of course interesting per individual request often is often hard to measure especially if we're talking about fast systems if we're measuring if i would measure like in the sorting example like every individual operation that's been done in sorting then each like just measuring this will take more

Starting point is 00:56:28 time than the actual execution. And then everything will just be basically based on or the numbers that I get will just be based on the measurements that I'm taking. So then you basically need to take larger samples, you take larger numbers of executions, and you average over those. But still, you want to make sure that you somehow also get these kind of outliers where you see, okay, all of a sudden we have latency spikes, etc. You don't want to miss those. That's always a trade-off. And in larger scale setups and in companies, often you will see percentiles. So that basically means what are the top 99 or top 95 percentiles. So my top 99 fastest latency, this is the number that I'm reporting.

Starting point is 00:57:19 And that's fine, but it's also, of course, incomplete because I'm basically just skipping away the 1% and if I like on average always have a hundred requests or a hundred interactions per individual larger interaction so say we're talking about 99th percentile of disk accesses and each of my jobs does 100 disk accesses, then for sure the 1% after the 99th percentile will hit me at almost every individual interaction. And so this is kind of, yeah, there you will get higher tail latency. So I basically prefer averages or maximum numbers, for example. Or at least additionally, you would want to report this as well. Even if you don't want to report it, you might measure it just to know, do some sanity check if you're not like, It's not something is falling off somewhere.

Starting point is 00:58:32 Something more soft or not as hard to measure is capacity. This is basically how much throughput can I achieve under ideal conditions while not getting too much latency, for example. And this is, I mean, this is, as I said, not hard. Typically, we're saying, well, what is my latency threshold? So I'm saying, well, my requests cannot take longer than a second, for example. Then I would say, what's basically the throughput that I can get there? And this would then be my target throughput, for example, then I would say, what's basically the throughput that I can get there. And this would then be my target throughput, for example. And usually, throughput might look not usually, but throughput might look something like this.

Starting point is 00:59:15 So initially the throughput is really like linearly increasing because we're not contended anywhere. So everything is getting faster and faster. All of a sudden, we're getting not contended anywhere. So everything is getting faster and faster. All of a sudden we're getting first contention points. So some of the requests might get slower, etc. So we're not getting as fast anymore.

Starting point is 00:59:37 And this is called need capacity. But you don't necessarily always see like a need point where all of a sudden the performance gets slower or not, but typically it's like a more round curve. And then all of a sudden our system is saturated. So we're not going to get more performance. If we put more pressure on the system, we're actually going to get less performance because we're overloading the system.

Starting point is 00:59:59 So and we want to be performance wise or like capacity wise, we want to be somewhere in like not completely 100% saturated, but bit below this. So we get good utilization, but we're not overloading the system. And with this also the response time will of course increase. So while the system is not fully utilized, we might not see a change in response time. But as soon as the system is fully utilized, or as soon as the system is closed to utilization, the response time will go up. Things will get queued at a certain point. As I said, then the response time will be unlimited at a certain point in time because we're just overloading,

Starting point is 01:00:47 overloading the system is not doing anything anymore. And usable capacity would then be basically the maximum throughput achievable without exceeding like a certain response time. And this is also called sometimes sustainable throughput. This is basically what I would want to load into my system at maximum, not more than that. Because then I get a nice response time,

Starting point is 01:01:18 and my system is nicely utilized, but I'm not overloading the system. And, well, as I said, the knee capacity is something, well, you might see it, you might not see it. So, we did our break, so now let's go for some statistics. And it's more, just so you know, basically the words and the right terminology. So if we're talking about measurements, we're talking about hopefully independent events. So and an independent event in statistics is basically one occurrence does not affect the other.

Starting point is 01:01:57 And this is again like if we're talking about measurements, we have to make sure that this is the case. If we want to have some statistical significance with or we want to talk about our results with statistics. If everything is interfering with each other, then the results are not independent and we cannot basically say we're talking about independent events. So, successive coin throws would be typical independent events. Of course, never 100%, but more or less independent. And hopefully, successive experiments are set up in a way that they're independent.

Starting point is 01:02:38 This is why we typically have some warm up tasks or something like that. So we're heating up the caches, for example, and we make sure every experiment is set up in the same way. If we do something like we have like a fresh database, nothing done, and we do the first experiment and then we use the same thing for the second experiment, the experiments will not be independent, right? Because we have caches that are set up in one way or the other. If we have like a full disk and we're adding some more, we're changing the disk behavior so it's not necessarily independent. And that's something that you will quickly see.

Starting point is 01:03:18 So, I mean, SSDs are a bit more flexible here, but say hard drives, as soon as something fragmented, like performance can be vastly different. Like we're talking about 20% performance difference or something, just depending on where on disk your data is located. So this is something that you got to keep in mind. But now we're talking about independent events.

Starting point is 01:03:41 And if we have independent events, then we can talk about a random variable. So random variable would be a set of values with a specified probability. And say, for example, body mass. So our body mass, like each individual body mass, that could be specified by a random variable. Or the execution time of an experiment would be a random variable. Or the error of an experiment. So how much are we wrong in the experiment actually? So because of some other measurement errors, for example. So this is a random variable that will follow a certain random distribution.

Starting point is 01:04:27 And body mass is kind of a good example. So people have certain body masses, and there's certain averages. But then we have a random distribution around this. And we describe this with a random variable. And then we can have a cumulative distribution function that just sums up all of the variables. So all or all of the probabilities. So where we are. So we go basically go say from body masses. If we would have on our X axis here all possible body masses. Let me put... So, we have all possible body masses here and this would be

Starting point is 01:05:09 the average body mass. I have no idea what the average body mass in Germany is or in the world. Definitely less than mine. And then we're somewhere... here would be zero, here would be, I don't know, 300 kilos or even more. And we know all of a sudden, like if we go closer to 300 kilograms, well, we definitely have found all instances that we want to find. If we go below a few grams, well, we definitely also have found all instances.

Starting point is 01:05:43 We know we don't have anything anymore. So this is the cumulative distribution function. And this basically describes up to which level we basically, how many of the individual instances would be in there or on the continuous distribution. If we say, I don't know, 65 kilograms or something like that, which part of our set would we find here? Or say experiments, 10 second execution, what would be the percentile basically that we

Starting point is 01:06:20 have here? This would be the typical cumulative distribution function. So going here, for example, hopefully our latency, I don't know, say website, one second latency, then we want to have this in the 99 percentile, for example. Or even less,99.99 something like that so that's the cumulative distribution function then we have the probability density function of a continuous random variable this is basically just where like what probability does each individual value have. So again, body mass, I don't know, say 60 kilos or something like that might have to or 65 might have the highest probability and something like, I don't know, 300 has a very low probability because there's few around.

Starting point is 01:07:20 And I mean, body height would also be a typical example of this. And the same would be in our experiments. So say we are assuming we have a one second latency. So most results will be in there. But we will also see some variance to left and right. And depending on what this distribution looks like or depending on how we're measuring, we might see different types of distributions here.

Starting point is 01:07:49 So this would be a Gaussian distribution, and this can be around zero, can be around anything. And the interesting thing is like many probability functions, eventually if we're summing them up or, well, I'll come to that in the next slide. Let me not get ahead here. This is the probability function. This also holds for discrete random variables. This is continuous.

Starting point is 01:08:17 In reality, in experimentation, this is a theory if we have a continuous, we're basically saying this is how we expect things to behave. In practice, once we're measuring stuff, things are discrete, right? We have individual instances, we have a fixed set of experiments, and these would then have a probability mass function. So, we don't have a density function, but we have a mass function where for each individual value, we'll have a certain mass, like what we've been measuring here or we're bucketing things. Also, we cannot measure in like an indefinite fine granularity.

Starting point is 01:09:00 So only because of that, we'll already have a probability mass function. So out of those, we can basically calculate the mean or expected value. So if we see like many of like we have a certain probability density function, I'm not going to go through the through the formulas here. You can basically use the formulas if you ever want. But this is basically, if we have a lot of different experiments, we can basically say, what's the mean? What's the average, basically? What is the expected value?

Starting point is 01:09:37 And this is like, I mean, this is basically just an average, right? So this is what you calculate. But more interestingly also, you can calculate the variance. And this is something that you will need in your papers or also to find out, are your results significant? So the variance is basically the mean squared error, I think, I would have to check again.

Starting point is 01:10:02 And we can also calculate a coefficient of variation, which would then be the standard deviation or variance divided by the mean. And so for a normal distribution, for example, this is defined where we have the mean and we have like within 68% to both sides, we have basically the standard deviation and then within the standard deviation, we have 68% of the results than 60 or 95 within two standard deviation and 99.7 within three standard deviations. So with this, you basically know how much of the results will be within the standard deviation, etc. And the important thing or why this is also interesting and why the normal distribution is helpful for you,

Starting point is 01:11:05 is if we're talking about a large number of individual distributions or individual observations that we're summing up, then in the end we're always going to end up with a normal distribution. And a large number of observations from distributions would be, for example, the different kind of errors that we see in our experiments. So in our experiments, we have many different influence factors just coming, like say, I mean, I already talked about temperature, but just like whatever the computer is doing, et cetera, that will be like, there's a lot of influences on our experiments that give us a little bit of variance. And because these are many different influences, right?

Starting point is 01:11:57 We will end up with a normal distribution. So this means that we can also measure the normal distribution, the work with the normal distribution. And with the normal distribution, we know then how much, like when do we get significant results. I have, like in the different slides that in big data systems, I have some more information there, where then you can basically also figure out if I have this and that significance or this and that variance, how many experiments do I need to run in order to have meaningful

Starting point is 01:12:33 results. Okay, so, but I told you, I just want to give you some terminology. And then if you're curious about this, you can go to the book, read all about it and apply it to your experiments. If we have all these different data, we can basically get the average. And this is very simple, right? So if we have many results, we divide this by the number of observations.

Starting point is 01:13:01 This is our sample mean. This is the regular average that you already know but sometimes also the median might be interesting and that's basically the middle if we're lining them up the middle of all the uh the results so half of them are less half of them are are higher and this is usually especially important if you have outliers. So typically example, if you're talking about income, right? So you have a small village and you have one billionaire in there. If you're talking about the average income, the average income will be insane. But if you're taking the median, you have something that's more reflective of actually individuals in there. Then something else that's sometimes used, it would be the mode,

Starting point is 01:13:47 which is the most frequent value. So which one did happen the most? This is more interesting if you're talking about something like an individual number that we want to optimize for, for example. And then, so something else that you can do with distributions is basically also, or what I'm using distributions a lot, is basically figuring out something like cash efficiency, et cetera. And so here, this is something out of a, basically of a research work that we did a while ago, which basically drove me crazy for a while because I initially, we had this idea, we're gonna do a multi-level cache using HDD and SSD.

Starting point is 01:14:41 And we thought, well, if we have like a really skewed distribution, most likely we're going to get SSD performance and we're just going to pay HDD prices basically. However, we didn't really think about the CPU cache and the CPU cache's influence or the RAM basically influence. And how much we actually need to put in there. So, i mean, here just basically, this is basically using Probability density functions and cumulative distributions.

Starting point is 01:15:19 So cumulative probability functions. So basically how much of my accesses go into which cache. And you can estimate this with probability density functions. So assuming a uniform distribution, this is quite easy, right? So if you have a uniform distribution, say we're talking about random numbers again, this means every next access will be anywhere in the data. And then basically my caches won't really work well. So this means, well, the caches will work in the sense that as much data as I have cached,

Starting point is 01:15:59 this will be the number, like say I have 10% of my data in cache, then 10% of my accesses will be to cache, because any next access will go somewhere. So this is kind of easy to read, and I know how much this will help me. This is completely different if I'm talking about a Zipfian distribution. Zipfian distribution is a typical distribution that's used for any kind of text data. So this is also based on the frequency of letters in text, basically.

Starting point is 01:16:33 So say letter E or something is much more frequent than letter Y. And based on this, this is also defined in a more continuous way so you can put this on any kind of data and here I basically plotted this for up to 10,000 and if you have like a cache that's where you have one percent of the data in cache and for uniform it would be one percent hit ratio for zipfian standard zipf distribution, 50% of the accesses would be in cache for the 1% data. So it's very skewed towards the left. So this means here we're actually quite much more cache.

Starting point is 01:17:16 So in this case, we're much more cache efficient, right? Just paying for 1% cache, we're going to get 50% less hits on disk. Super efficient. So now the question is, how can we basically add more levels of cache in between? And an initial assumption would be, well, let's put 10% of the data, for example, on SSD. And if we have a uniform distribution, well, the 10%, like 1% is in RAM, 10% is on or 9%,

Starting point is 01:17:51 for example, typically these caches are inclusive, meaning that the next cache has everything that the cache below also has in there. So this means 9% of the hits will go to the SSD and the rest will still go to disk because we're just having 10% of the data in SSD. Now with the Scipian distribution, we might think, well, all of us, if we put 90 or 50% only by using 1% of the data on RAM. We would get insane amounts of improvement if we have the 10% on SSD. And, well, actually it's not true, right? Because we have 50% cache hit ratio

Starting point is 01:18:40 in the RAM already. And the ZipfN, the next 10%, will only make 20% of the accesses anymore. So we still have 30% HDD accesses. So we get less HDD accesses, but we don't get that less HDD accesses as you might have assumed. And now the question is, like, how can we actually get even better? Well, we would have to put more data into SSD. And say, for example, we're putting 50% of the data on SSD.

Starting point is 01:19:16 So in this case, for example, here, so this would be the 50% case. Then in the Zipfian distribution, we'll get 40% of the hits on SSD and 10% on the HDD. So while we still have this great improvement of the RAM that gives like the huge improvement that the RAM gives us, we're not getting actually much more efficiency in relation out of the SSD cache when when having like a Zipfian distribution. Right.

Starting point is 01:19:54 So on on if we're having a uniform distribution, we're basically getting 50% hits on the SSD, increasing from 9% if we go from almost 10% basically to 50% if we're increasing by a factor of 5. But here on a Zipfian distribution, we only get like a factor of two better than we were in the initial setup. And this is basically, it's really just because the skew and because of the long tail, because in the end, we will always have some accesses to HDD that will always like the,

Starting point is 01:20:38 basically the skewed distribution makes sure that a lot of the accesses will go to memory. And there's a long tail that always ensures that some of the accesses go to disk. But it's like some, like the cache in the middle basically doesn't really get that much performance improvements as we might think. And for me, this is a good example. So basically, we built a huge system around this, having these multi-level caches, and it works. And it was a paper presented at some point, but it never worked as well as we initially thought. And would we have done these kind of calculations, we would have known, well, we're expecting a factor of like from here to here, we're expecting a factor of three, for example, performance improvement. If we put rather than 10%, we put 50% of the data on SSD and we're only getting like a factor of five performance improvement if we're using SSD

Starting point is 01:21:47 at all. If we're not using SSD and just using RAM, we're still like only a factor of five slower. And this is something if you know that the disk is, I don't know, a factor of 100 slower than the SSD, then you would assume you can get better performance. But if you do the math, if you basically check this once, then later on, well, it's not an easy back of the envelope calculation because you have to deal with probability distributions, but still something you can do in a day and building the system takes months. Then, or at least weeks to get first performance numbers.

Starting point is 01:22:27 And then you find out, well, it's not working as well as I thought. Okay, so with this as kind of a bad example, or a reminder to do your experimentation, to do your modeling, we're going to stop for today, and we're going to continue with the benchmarks next time. Questions? No questions? Okay, well then, thank you very much and see you all tomorrow.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Performance Management & Benchmarking

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.