Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Performance Management & Benchmarking
Episode Date: April 25, 2023...
Transcript
Discussion (0)
Welcome to our next session.
We're finally in the room that we want to actually do this lecture in.
And today we're going to talk about performance management and benchmarking.
I'm probably not going to be able to finish all of it today, but we'll see.
So this is an important topic.
It's near and dear to my heart.
I do a lot of research in benchmarking and evaluation.
And also, if you want to do something like in efficient programming,
if you want to do
yeah, build efficient systems, there's like a very big need
and strong need to do a lot of evaluation, a lot of benchmarking
to actually see that things are working.
And so this is kind of like a baseline
or a foundation for later on your programming, et cetera.
And before I go into the details,
I have a few announcements for you.
One is we've set up something,
you should have gotten an email if you're a participant in the course,
something saying active participation in Moodle.
And we ask you if you want to join the programming tasks that you actually go in there and register in one way or the other.
So we just know you are actually one of the people that want to do the
programming because we do have to do extra setup for this. So we do this in GitHub. And
so we need it for the setup. So please sign up there. It's, I think think just simple click and then you're there. Then, well we still have a seminar.
I'm just announcing this again because last week there was a lot of confusion about timing etc.
and rooms. So somehow everything got messed up a bit. And so if you're curious, we're still going to do like we're going to do another introduction session today.
3.15 in K102.
And then an additional thing that might be of interest to you.
We have a joint biweekly research seminar with TU Darmstadt.
So with Carsten Binig and Jolt Istvan.
So these are two professors, database professors at TU Darmstadt,
which we collaborate a lot with. And
every second week we typically do an online talk.
This week is going to be one of the TU Darmstadt PhD students
talking about multi-model DLVMS
for seamless querying of text and tables.
So the idea is basically not being only able to query text with your SQL queries, but also
not only tables, which we usually would do, but also text somehow.
So you're probably using language models.
And the details you will find there.
This week is going to be a PhD student. Often we have industry experts presenting as well, but I will announce those as well. So
this will also come up and this is online only. So if you're interested, the Zoom links here,
you also have received an announcement through Moodle, I suppose. At least I hope
I received an announcement through Moodle. Then, well, where are we? We're at the end
of the introduction, so now we're in performance analysis, performance management, benchmarking,
so we'll look into how can we find out how performant our code is and what do we actually
want to see there.
How can we find this out?
And then next time, we'll actually start diving into this CPU realm, looking at all, like
really at the cores, etc.
How does this work internally?
Okay, so and this is the current plan. So I have a small change here already,
just to notify you further down here. You can see that I've switched Q&A and NUMA. And this is because I'm not going to be here on that Tuesday.
So if you thought, okay, I'm going to take off this Wednesday and you still wanted to see this
NUMA, then well, you have to switch and take off the Tuesday. And I will keep this updated. As I
said, this is kind of subject to change. Things happen during the semester, I have to do something else or something, something happens and will align.
I just want to keep you updated there.
Okay, in this lecture, first I will give an introduction to performance analysis
and then one of my favorite tools, back of the envelope calculations.
This is something that I always bug my students with and anybody who does a master
thesis or a project with us, which is somehow performance related. It's basically a tool just
to figure out, do the numbers that I see actually make sense? And then some measurements, some
statistics. This is a very small piece of statistics, just so you know the names, etc.
If you want more details, there's a nice book.
We also have it in the library.
And so there's a lot about experimentation,
how to measure things, how to set up experiments,
especially how often to repeat them, how to set up experiments, especially and how often to repeat them, how
to basically calculate based on the variance, how many experiments would you need.
And I've skipped all of this.
I don't have it in lecture here because it gets really into statistics, but I see it
more as a tool.
So rather than saying, okay, I need to understand all the details, the mathematical details. I basically want to see, okay, I know where to get it and then how to use it in order
to get sound experiments.
Most of our experiments in database systems typically are on a level where we would say
variance doesn't really matter that much because the variance is much, much smaller than the
performance differences that we see.
But say for example, if you go to machine learning, often
you really need to use statistics to figure out if your results are significant or not.
And sometimes we also see this, especially if we go to very small hardware parts, there all of a sudden variance becomes large and
performance gains become small. And then say if your performance improvement is, I don't know, 30%, your variance is
20% or something like that, then all of a sudden you have to actually figure out,
like you have to do enough experiments to see if your results are significant or not.
Maybe anecdotally, I will tell you some things here. Then we'll talk about benchmarks. So this
is kind of an application level thing where or like a test where we can see what would be the performance of my system in the real world, more or less.
And then some examples of basically fair benchmarking or let's say some examples of unfair benchmarking,
what you kind of have to think about, like at what level you have to think if you want to do fair benchmarking. And we might
not even get there. And then if we don't get there, I will do this next time, meaning tomorrow.
Okay, so I come back to this slide. I showed you this in the first lecture already. because this is basically in two pieces or two steps of my famous, well, of my seven-step
recipe for papers or thesis, we have these experimentations.
So first we have the back of the envelope calculations, and then we actually need to
do experiments.
And in like, let's say applied computer science as we do it, you will never have a
paper publication without any kind of experiments.
So this is we need to do this also to verify what we're seeing, if basically our assumptions
hold in a practical setup.
And this is often not the case. And if you go more towards theory,
then you need to make clear assumptions.
But maybe your assumptions can never be 100% proofed or not,
like you were hypothesizing.
But here, we're hypothesizing about CPUs and stuff like that.
And this is stuff that we can actually measure.
So we basically can say, well, I think this would be my hardware setup. And in most cases, if I'm not saying about very
large setups or something like that, I can actually benchmark this. I can test this.
At least I can break it down into smaller chunks where I can apply this in real world. And this is really important
and it will be demanded by anybody that you're working with.
Meaning if you're doing a thesis, your supervisor will demand it.
If you're writing a paper, your reviewers will demand it.
So that's why you need it.
And that's what we will talk about today.
And later on, if you want to do, like if you work in a company,
you want to submit a performance improvement patch or something like that,
people will also want to see sound experiments like this.
And if you can show that this will work for a wide range of applications or use cases,
then it's better than if you basically show, well, it works in this separate case, but in
Practice, everything will basically crash.
So having a good mindset about good experimentation is really
Important. And so in general, we're all
About systems, right? so And if we build a system, we need to evaluate it.
And well, in systems like database systems, whatever systems, we typically think about
performance in terms of speed.
Of course, there's other things as well.
If you think about machine learning, often performance doesn't matter at all,
at least in research, right? So people will say, well, I can have a marginal improvement in
prediction quality. I don't care how much longer this takes, right? So to a certain degree right
now, what we're seeing, these larger language models, this is insanely slow to process, right? Of course, they will try
to speed this up. But in the end, nobody will say, well, this chat GPT took longer to process
than this chat GPT. So I'm going to use the one that's faster or to train. But I'm going to use
the one that's better in terms of prediction quality.
And for us, in many cases, we don't have like soft results. We have hard results. We basically say, my query needs to perform, or I want to perform this query. The result needs to be exact.
And then there's a lot of basically, or there's no debate on what the result would be.
And then it's all about performance in the end.
Of course, other things are reasonable as well.
So usability could be a measure, but it's harder to measure, actually.
Energy efficiency, price performance.
Price performance actually in practical setups in industry is super
important, right? And in practical benchmarks or in standardized benchmarks, you will also need
to specify this. And we typically have a very hard time to specify price performance. So we're all
about just performance. And most of what I'm going to say is all about performance today.
Also make a difference between benchmark and analysis.
And I mean, this is not necessarily standardized, but it's sort of what people also would understand.
So, if you do evaluation or analysis, this means you're looking at a single system or algorithm.
And of course, there's everything in between as well.
Or you might look at individual optimizations.
So you come up with a nice algorithm, say a new join algorithm, and then you're analyzing it deeply.
So you're figuring out where does time go if I run this.
So how much time is spent on reading the values So how much time is spent on reading the values?
How much time is spent on comparing the values?
How many comparisons do I need to do?
Stuff like that.
And then I can do individual optimizations.
I can do micro benchmarks.
I can do breakdowns.
And this is basically stuff that you could see in one of our papers, right?
So we will have these breakdowns.
We will see where does time go actually.
And then we have a benchmark.
And the benchmark is a comparison of multiple systems,
multiple approaches.
And well, it could be a benchmark that's kind of small,
just comparing individual systems on individual operations or could be this
kind of application level big benchmark which could be standardized or you could
use a real workload to figure out how your system is behaving and then you're
comparing different kind of systems so here for example another so same paper
basically different results.
So on top with doing the breakdowns, which I would call an analysis on the bottom, we
have a comparison and it doesn't really matter what exactly is in the in the pictures.
But you can see we're basically benchmarking a couple of different systems again on different
kind of workloads.
And a good paper or a good thesis has both of them.
So on the one hand, I want to see how good is this in relation to other work, especially
whatever is related to this.
And on the other hand, I want to see why is this good, right? So just getting the number that my system is 10
times faster than another system doesn't really cut it in many cases, because I also need to know
why this is. Because it could just be like a spurious setup. You could have done something
wrong with the other systems, etc. So I really want to, if you're building your system, like a new system yourself,
you really want to dig inside and see what's going on in there.
So in order to understand system performance, we want to measure everything.
And this is kind of the thing that everybody will ask for.
But on the other hand, it probably won't be enough.
Or it's measuring something, especially something you build yourself, meaning you need to have
built it already.
And that basically also means you need to take a lot of assumptions and then come up
with a result.
And your assumptions just might be wrong and because of that it often makes
sense to start with some modeling and one of the types of modeling that we do is back of the n-wrap
calculation will come to this but you can also have like very advanced very big models for any
kind of thing and then also do measurements for measurements you need some experimental design
you need workloads you need benchmarks etc another thing you can do which is like again different
would be simulation in a simulation you're not actually implementing your system completely but
you're simulating your system how it would would behave. So you are building something that, and this is often useful
If you are working with modern hardware or hardware that
Doesn't exist yet, right? or that's just being built.
So you are going to build some kind of prototype that
Simulates the hardware. Or you have like a huge setup.
So you are doing research in sensor networks that span
Across the globe. This is something probably you setup so you you're doing research in sensor networks that span across the
globe this is something probably you won't have access to this at least not
in the beginning of your thesis so in that case it makes sense to do some
simulation in our case we're doing stuff on servers on laptops etc so this we can
usually measure and then simulation is not the best idea because measurement, we don't have to take assumptions.
However, I already hinted to that,
we don't want to do just one, right?
Because if like the rule of validation,
this by Raj Jain and others is basically saying,
do not trust a single technique,
but validate your results with another technique.
And for us, this often means we're validating measurements with a model.
And or we're validating, like we're building a simple model that gives us a baseline result,
and then we're validating the model with some measurements later on.
So basically vice versa. We're making sure that both align. If we have a basic model about the
performance that we want to see and we're doing the experimentation and we see something completely
different, well, then our model was wrong, right? And then our assumptions were wrong. The other way around could be also, right, we're doing some measurements,
and then we are having a model and our model doesn't align. We could just measuring the wrong
thing. So what often happens, we're thinking we're measuring a system, but actually our driver, our
tooling for the benchmarking is so slow that this is
basically what we're measuring. So then the system doesn't do anything, it's just our benchmark that's
crappy and that basically gives us like, we might not see any difference between
different techniques that would be a typical example. Or we think we're fully utilizing a server, but we see everything's idle, for example.
Well, then maybe something else is the bottleneck.
So, we always want to do multiple things.
For us, often the first step is back of the envelope calculation,
and then the actual measurement.
And if the two align, then hopefully we did something right and
we get the right performance. Okay, so let's get into the back of the envelope calculation.
And back of the envelope calculation is called back of the envelope calculation because it's so simple or it's supposed to be so simple that you can do it on the back of an envelope.
So it's basically the one piece of paper you might have flying around.
If you're in a bar at night, you might use a beer coaster and do your calculation there.
And that's the level of complexity that you should use there.
And the idea is like you're trying to understand your application, you're going for the most
important performance numbers and estimate your system performance within an order of
magnitude.
So that's actually quite coarse, right?
This is a very rough estimate,
but it just gives you an idea if your idea like, yeah, or a hint and yeah, if your idea or whatever
you're trying out might actually work or it's stupid, right? So this is also called a bullshit
filter because of this, because often we think, well, this might be the case. This might be a good
idea to start
something with we're building a huge system in the end we find out well our base assumption what
about like where the performance bottleneck actually is was wrong and then uh we're getting
stuck somewhere and um so this is basically just to make well, at least in theory or at least from basic performance numbers, this might work.
And, well, eventually we will have to benchmark.
But at least the very simple stuff we can check, we can do beforehand.
And so I will give you an example in a second.
But for this, and you will see this slide again, and I didn't really update this, so
this is by Jeff Dean, and also, like, this kind of these ideas are also like the way
the slides, at least some of them are from Jeff Dean, and he's one of the Google systems
people.
So he's one of the core engineers in Google who uses this all the time to basically get performance for their systems.
And you can see this a lot also if you read their papers, the kind of thresholds they use, etc.
This comes all out of these basic calculations.
Okay, so few latency numbers. And this is not for memorization.
It's just to get an idea and also maybe as a reference.
You can get newer numbers for this.
You can get more accurate numbers for your hardware, for example,
just to get a basic understanding.
So if we're in L1 cache, we're below nanosecond level.
If we have a branch misprediction in the CPU, and we'll see
this in a bit, right? So this is basically something where we five nanoseconds. So meaning
if I know what I want to read or add something, like most operations will be in clock cycle,
right? But if I have a branch misprediction, all of a sudden, the CPU needs to do a bit more, a couple of additional cycles.
So this will cost us more.
If we don't have our data in L1 cache,
but we need to go through L2 cache to L3 cache,
then we're in the 20 nanoseconds already.
So this means, I don't know, 50 cycles, something like that, even more maybe.
Then we have, like if we have mutexes, so some locks for something that will take us something
some time. If we go to main memory, again, many cycles, we're going out of the caches all the way.
If we want to do compressions.
And then there's more examples, right?
So if you're reading sequentially from memory, say one megabyte, we're in the 100 microseconds.
If we go to NVMe, so if we're reading a small amount of data from NVME, so this means like an SSD drive
through PCI Express, then we're in the 50 microseconds.
Disk seek, 10,000 microseconds or 10 milliseconds already, reading something sequentially from
disk, like one megabytes, 20 milliseconds,
milliseconds, sorry, 20 milliseconds, and then 150 milliseconds.
If you're going, like if we're sending something like a round trip
from California to Europe, for example.
And this is useful if you're basically planning on how to lay out your data, how to set up your system.
So as I said, I'm going to give you an example in a bit.
Then some throughput numbers, say if you're talking about DDR4 memory, then the per channel will get a bandwidth of 20 gigabytes.
PCI Express Gen 3 per channel or 16 channel.
Right. So there we'll have to ask Marcel later.
I'm always getting mixed up with the PCI Express numbers.
But for PCI Express Gen 3, the maximum throughput that we get.
So this would be all channels would be 12.5 gigabyte per second.
Typical flash drive, NVMe, and there's like the numbers vary a lot.
We can get two gigabytes per second.
We can get like some newer drives.
We can get close to PCI Express speed.
And the general bandwidth of this could be 6 gigabits per second.
And so on and so forth. So how do we use this?
And so for this, i'm going to give you a simple back of the
Envelope calculation example. And this is kind of this, this
Is by jeff dean, so it's kind of this google result page
Example. So how long would it take to This is by Jeff Dean, so it's kind of this Google result page example.
So how long would it take to generate an image result page with 30 thumbnails?
So in the first design, if you're just coding this down, I don't know, Python script or something,
if you read serially, you thumbnail the 256 kilobyte images on the fly.
Then you basically have to read all these individual images.
So 30 images from disk.
And then you basically, yeah.
So you have to read them from disk.
That's the main cost actually, at least on the server side.
And if we're reading from disk, we need like classical disk. And Google still uses classical disks for large amounts of the data
because it's just so much cheaper still.
Or it's still cheaper, at least enough cheaper so that it makes sense.
And we already said, right, so disk, spinning disk,
you first need to seek and then
you need to read. If you do this image by image, then each of the individual images will be a
random read. And you will do one after the other. So first, we're going to seek. A single seek is
10 milliseconds. So we're going to do 30 seeks. And then we're going to read 256 kilobytes with a read speed of say 30 megabytes
per second. So this is like a fast drive would give you something like 100 megabytes per
second. But if you have a RAID and whatnot, you might be slower. So then you do 30 individual reads of these 256 kilobytes.
So all in all, you're at 560 milliseconds.
And this is half a second, and this is just on the server side.
So this means basically half a second before I'm actually sending anything across the network,
just reading the data.
And then I might even have to still thumbnail this, etc.
So this is actually too slow, right?
So half a second for page generation for a website.
Most of the pages that you're clicking on
will probably already be loaded on your phone
within half a second, right?
So that's too slow.
So what could we else do?
We could do this in parallel.
What happens if we're issuing all these reads in parallel?
Hypothetically, we only have to pay the cost, the latency once.
So we're basically, because throughput, probably not a problem.
We're not going to read just from one disk.
We're going to read from many disks.
So we're going to do all of the seeks in parallel. We're going to do all of the reads in parallel.
So that means basically we're reading, we're going to have this 10 millisecond seek,
but we're doing all of them. At the same time, we're going to do this 256 kilobyte read
with 30 megabytes. So there we're all of a sudden down to 18 milliseconds.
If we can use 100 percent full parallelization.
Right. So we're not going to have this.
Probably there will be some variance here and there.
But we see we're just so much faster.
We're more than order of magnitude faster.
And then the server side is not a problem anymore.
All of a sudden, we just have to deal with basically the data transfer,
make sure that this works well.
And this is, of course, not the end. We can still do further variations, right?
So we can still do some caching.
We can have caching for whole sets.
We can pre-compute the thumbnails.
And doing this back of the envelope calculations
is a simple example, right?
But with this back of the envelope calculation,
we can figure out what makes sense
and what doesn't make sense.
And often you might think just by thinking about it
and doing like a quick math in your head, you might come up with the best results.
But you might be wrong because integration is something the brain can't do very well.
So if you have to sum up something large, that's done much easier and much more easy on a sheet of paper.
And often you need some kind of simple measurements.
So say for example, you don't know how fast your disk is,
you don't know how fast your network is, etc.
You might need to do some very simple experiments
to figure out the numbers that you're
using for your back of the envelope calculation.
Like if you, if you have these kind of numbers, fine, right?
So if you can work with those, well, then work with those.
But if you need something else, you have a more complex system,
then you might have to go do some small experiments and then
use those for the back of the envelope calculation.
And you will get, like,
typically you can get these numbers quickly and then have, like, a good assumption
about the performance that your system might have
or a subcomponent of your performance
before you actually start building the whole thing,
which often makes a lot of sense.
And so let me give you an example, a second example that
Is more catered to what we're doing here.
So there the question is how long does it take to sort one
Gigabyte of data of four byte numbers on a modern cpu?
Single threaded. So just guess, right? So how long would you think this takes? 10 milliseconds. 10 milliseconds, one number.
Yes? 100.
100 milliseconds?
OK, so let's see.
Read one megabyte sequentially from memory.
We have a number for this, right?
So this would be 100 microseconds, for example.
So this we have to do 1,000 times.
So we're at 100 milliseconds already.
And we have to do this multiple times in order to actually get,
in order to sort the data.
But that's not the only thing that we need to do.
So, well, first we need to break it down.
So what do we do?
Right?
So we have one gigabyte of data, so two to the power of 28 integer numbers on most 32
or 64 bit machines. So on this machine, for example, this will be
2 to the power of 28 numbers.
If we're using quicksort,
which most sorting algorithms today
or most sorting will use
because it's one of the fastest algorithms,
that means we do logarithmic number of passes
over the two to the power of 28 numbers.
So the logarithm of two to the power of 28
is basically 28, right?
Makes sense because that's the logarithm. And 28 is basically
2, it's less, but it's more or less 2 to the power of 5, right? So 32 would be the next
power of 2. So 2 to the power of 5. So all in all, we're doing 28 passes over the 2 to the power of 28 numbers.
And that means we need to compare all of them.
So we need 2 to the power of 33 comparisons.
So 2 to the power of 5 passes over the 2 to the power of 28 numbers.
And this means and that's basically one of the major
things that will basically take our time. The other thing that
we need to do is we need to do these 28 passes over the data.
So we need to read the data 28 times. And while we're doing the
comparison, so the comparison, we are
comparing something, it's a single thing, but what modern CPUs do is they do branch
prediction.
So they basically say, well, hopefully, I guess this number will be smaller.
And then if it's right, then the computer will be fast and we'll already
have all the instructions laid out for the next steps if the the cpu is wrong then we have a
branch misprediction we have to do a couple of extra cycles basically to get back to the right
branch that we need to take and because this is random, half of our comparisons will be mispredictions.
So not all of them, but half of them. And this means instead of 2 to the power of
33, we will have 2 to the power of 32 mispredictions. Again, this is a bit higher
than we actually have because we said 2 to the power of 5 or 28, but it's actually 32.
So it's going to be a bit less than 2 to the power of 33 and 2 to the power of 32.
And each of these mispredicts we said is 5 nanoseconds.
So all in all, just the mispredictions would probably be 21 seconds in the order of assuming that our my cpu here actually has
five nanoseconds per mispredict then as i said we also need to read this data we do
uh two to the power of 30 bytes times 28 passes so one gigabyte times 28 so we're going to read 28 GB from memory,
with memory being 4 GB per second, roughly,
so something like 7 GB.
And these latency numbers, or the numbers that Jeff Dean used,
these are already a bit older.
It's not going to change a lot, so it's not going to be much faster on here,
but it's going to be a bit faster.
So roughly, the assumption is roughly 30 seconds to sort one gigabyte on one CPU.
And now we can try, right? So we can check if this still holds. And we said like
an order of magnitude would be something that we want to do. And what I do is I'm going to show you the program, and then I'm going to show you,
I'm actually going to run this. So the program is very simple, simple C++ program.
So we have, well, actually I have it, I'm going to show it to you.
Here, right? So this is, I hope you can read this, it to you here.
All right.
So this is, I hope you can read this.
It's large enough.
Okay.
Okay.
So we're basically just creating, no, this is the wrong program. Where do I have my program?
Oh, I'm on the right.
I'm not on split screen anymore.
Let me just stop this briefly. So, here we go.
Sorry about this.
Okay. have our vector, so an array basically with 2 to the power of 28 numbers.
What did I say?
Yeah, 2 to the power of 28, because 4 bytes, right?
2 to the power of 2, so 2 to the power of 30 bytes, basically. Then we're reserving the data.
We're creating something that gives us the duration,
so a variable for the duration.
We're initializing the array with random data or random numbers.
So just going through the whole array, initializing it.
Then I'm giving something out so I can basically see how large is the area.
So how large is the individual int, how large is the area in megabytes?
What does the area initially look like?
So you can actually see that it's random in the beginning.
Then I'm going to start my timer. I'm going to just use standard sort.
Standard sort is using quick sort internally,
but it's an intro sort and you can also, I'm going to give you access to this,
to this program.
You can see that it's an intro sort if you change this here for example.
If I'm using less different numbers, so if I have only a thousand or ten thousand different numbers
so that I'm sorting, so from one to ten thousand or from zero to nine thousand nine hundred ninety
nine, then it's not going to use quick sort but it's going to use an insertion sort because
it's going to see there's not that many different numbers and then the sorting will actually
be faster. And you can also see this in the way that the sorting starts in the beginning.
It will just do less comparisons. All of a sudden, it just needs to do more memory allocation.
All of a sudden, it doesn't really make sense anymore.
So it will change to something else.
You won't see it here, but if you play around with this, you can see this.
So then we're going to do the sort.
Because the sort is so large, it will actually use the quick sort.
And then we're just going to take the time.
I'm going to output the ARRI after the
the ARRI starts, so the first values after the sorting and then output the total duration of
the whole thing. Now let's try this and I hope I didn't break anything. So it's going to take a bit.
So now we have the ARRI set up.
You can see these are the first numbers in the ARRI.
It's just random numbers.
And then it should take a few seconds. So, our total duration in milliseconds is 19,000, so close to 20 seconds, basically.
You can see, well, it's not bad, right?
So, now I want to show you that this is even better still.
We can even go into more detail. I'm not going to run it again because
we would have to wait again for this. So this, I profiled this earlier. I hope you can see this
somehow. So here we can see basically I measured the total instances of branching conditions, which is basically how often did
we compare, and the total instances of mispredictions.
So, how often did the branching not succeed?
And here we can see that we have 10,000, well, not thousand, thousand, million, billion, 10 billion comparisons,
close to 11 billion, and we have two close to 3 billion mispredictions.
And so if we go back to the slides now.
Okay.
So then, well, when I did an earlier run, I had like 20 milliseconds, 20 seconds.
And if I calculate, so how many comparisons did I have?
I had two to the power of 33 comparisons.
And I had two to the power of 31, a bit more than that, branch mispredictions.
And this is fairly close to what we actually thought.
So two to the power of 33 comparisons, two to the power of 32 mispredicts.
And well, with this, and you can see that this really is
where basically time goes, and this also gives us a tool
to measure or to evaluate, no, not to evaluate,
to estimate performance of even close
to hardware experiments.
And because we're so far in right now, it's 11.45, let's do a break here. So let's do a quick five-minute interruption,
and then we're going to continue with measurements and metrics.
So let's continue.
So as a reminder, we will have a session on profiling.
So meaning what I showed you just now with the profiler, et cetera, we'll have like a
more general session on how you use profiling, et cetera.
And Lawrence will do that with you.
So we've been doing a lot
in really low-level performance stuff recently,
so I think it makes sense to think about these things
and to give you some guidance.
You will also be able to use this later on
in the programming tasks.
Okay, so talking about measurement and metrics.
And first, I want to give you some basic terminology.
If we're talking about benchmarks,
often you will find the term SOT or system under test.
And this is basically what we're benchmarking, right?
So this is what we're evaluating.
And it's also important to always be aware,
if we're doing experiments,
it's always a deployment that we're measuring.
So I cannot just measure an algorithm or something
if I do actual experiments.
This will also always be like the numbers,
the results that I see are comprised of hardware software and data
typically so all three basically play a role and of course a lot of other conditions as well you
can actually see like the temperature of the cpu etc everything will make a difference so you
some of it more some of it less and in most experiments, I told you, right, we're talking about
large performance differences and then it's not that problematic.
But as soon as you're talking about small differences,
even the temperature of the CPU,
whatever else might be like the stability of the power
that the system gets might make a difference in your performance numbers.
So it's important to be aware that everything you have there is basically part of your system
under test. Then you have some benchmark tooling that connects your driver, whatever pushes your
workload into the system under test with the system under test that also collects some kind of metrics.
The metrics will be stored as some form of measurements
in hopefully something like database
in a way that you can easily read it
and hopefully also process it later on.
And then, well, we have the workload and the data.
If we're talking about database system,
it's always the two.
In other cases, it could just be workload or we're just iterating of the data.
But in a database system, we have our tables, which would be the data,
and we have the queries, which would be the workload.
And in many systems, we will have something similar.
And all of this basically, of course, influences your results,
right? So you need to be aware that, well, the data, like using this or that data makes a huge
difference. Using this or that workload makes a huge difference. Using this or that hardware makes
a huge difference. And you need to basically figure out like how much flexibility do I have or can I
basically also benchmark in a very confined setup where I make sure that the difference is like
that I'm only measuring the differences that I actually want to measure.
And if you before you start measuring and benchmarking, you basically want to answer a couple of questions.
So which kind of scenario do I want to evaluate?
Meaning what kind of data makes sense and what kind of workloads make sense?
Do basically uniform random numbers cut it?
Is this basically what I would want to see in reality?
If I use uniform distributed numbers, probably caches won't work well.
If I have a Zipfian distribution, caches will work much better.
If I have just the same number, or I'm just running the same query over and over, everything
might already be in the cache and I don't actually see actual processing anymore.
And so the question is, can an existing benchmark already supply those?
So there is a lot of benchmarks.
I'll introduce you a few of them later, or at least I guess I can also give you pointers
on where to look.
And so that's kind of one question.
Then which kind of hardware do you want to use
and which kind of software?
So if I'm benchmarking a server database system,
doesn't really make that much sense
to run it on my laptop, right?
And then also what kind of metrics do I want to measure?
And if I know which ones do I want to measure, how do I measure them and how do I compare
them?
So it's not easy to necessarily work.
You can easily come up with measures that are not really easy to compare with each other.
Typically, we're looking at some kind of throughput numbers.
But even if you go to database papers, you will either find tuples per second or
gigabyte per second or something like that. Or if you say if your data rows per second or
gigabytes per second or latency numbers, tuples per second or seconds.
So basically, the question is, are the units comparable and are the units understandable
for somebody else?
And do I have enough information to actually, even if the units are understandable, do I
have enough information to compare it with something else.
And when you're doing an analysis,
it's really this crime scene investigation.
Basically, we're going to try to figure out what is really going on.
To some degree, we can.
At a certain point, we can't.
So if we're talking about hardware,
at a certain point, basically, the hardware vendors will we're talking about hardware at a certain point, basically the hardware vendors will
make sure that we don't know what's going on because this is basically trade secrets,
like how certain cash replacements stuff, et cetera, works internally, certain protocols.
So then there's a certain level where we don't see anything anymore. But up to that
level, we can basically investigate. We can do profiling, et cetera, and figure out what's going
on in our system. Okay. So what do we measure? There's a couple of common metrics that we can
measure. Typically, throughput and latency. And I always want to see both because of course I can have unlimited
throughput if I have unlimited latency.
So that's very easy.
And then if we're not talking about SQL results of relation database results, but some more prediction, then I might
want to see some accuracy.
An interesting metric also is capacity.
So how much can I actually fit into my system
in terms of workload, in terms of users, et cetera?
And there's this, well, this is always trade-off.
And basically, you need to put certain thresholds.
But throughput and latencies, these are the typical ones that we always want to measure.
But also, we could have something like fault tolerance.
So, fault time to failure and availability.
So, if you have a data center, this is interesting, for example. So our
data center regularly has power outages, not power cooling outages. So this is basically
detrimental for the availability. As soon as we have to turn down the servers, your availability
that usually would be put into nines after the comma. So how much percent of the,
how many percent of the time is the server available?
And well, we're probably not in the,
we're in this single digits, not point something somewhere.
Then efficiency, super important
and becoming more important, of course, cost efficiency,
but also energy efficiency.
So, or let's, yeah, energy efficiency in the sense of how much energy will the execution
cost all in all.
Also fairness, how fair are different kind of users, for example, treated, not only in
terms of machine learning fairness, but also in terms of if I have multiple jobs and I'm
always only scheduling certain types of jobs, others might just starve.
And then my system still might be faster, but some of the jobs are not treated fairly.
And then scalability is basically how good does my system scale in relation to, say, a single node, two nodes, etc.
So with two nodes, do I get twice the performance that I have on a single node, for example?
There's different ways to measure this
and we're not going to go into detail here because we're not so much about
scalability.
How do I select these metrics?
Well, I want low variability, meaning I don't want the numbers to be like
have a very high variability or range of variability, so that the numbers are not as clear, basically.
I don't want them to be super redundant.
Some numbers I will basically often have in redundancy, say, tuples per second and gigabytes
per second or something like that
that for example is a redundant metric because it basically tells me the same thing if I know
how large a tuple is then I can convert between the two but it's often more convenient to have both
but then on the other hand you might have like more complicated metrics like you combine different
kind of metrics different kind of metrics,
different kind of measurements into a single metric,
and all of a sudden you have a lot of redundancy in there.
I don't know, in prediction, something like F1 score and something else. And so then, well, then all of a sudden
the information becomes might become too redundant.
However, you definitely want to have completeness.
And this basically is exactly this example of throughput and latency.
If I'm just reporting throughput, I can basically
have insane latency and
just make sure that I measure the best possible throughput,
but it would never be usable in an actual system
because the throughput is just way too high.
The latency is just way too high.
So I want to make sure that whatever,
like if there's two that belong to each other,
say throughput latency,
I definitely want to measure both and
report both of them okay so a bit more into throughput and latency
so throughput we again we can measure this in many different ways right so we can have gigabytes per
second but we can also have number of requests per second. We can have concurrent users, we can have transfers per second.
So say if you're talking about PCI Express or these kind of interconnects,
they will often say giga transfers per second, stuff like that.
And then you have to know how to convert between these.
So there's many different ways.
And hopefully you give all the information
so that I can convert between them easily. Then we're on the safe side.
Latency can be complete execution time. So how long did everything take? could be per individual request and also of course interesting per
individual request often is often hard to measure especially if we're talking about fast systems
if we're measuring if i would measure like in the sorting example like every individual
operation that's been done in sorting then each like just measuring this will take more
time than the actual execution. And then everything will just be basically based on or the numbers
that I get will just be based on the measurements that I'm taking. So then you basically need to
take larger samples, you take larger numbers of executions, and you average over those.
But still, you want to make sure that you somehow also get these kind of outliers where you see,
okay, all of a sudden we have latency spikes, etc. You don't want to miss those.
That's always a trade-off. And in larger scale setups and in companies, often you will see percentiles.
So that basically means what are the top 99 or top 95 percentiles.
So my top 99 fastest latency, this is the number that I'm reporting.
And that's fine, but it's also, of course, incomplete
because I'm basically just skipping away the 1% and if I like
on average always have a hundred requests or a hundred interactions per individual larger
interaction so say we're talking about 99th percentile of disk accesses and each of my jobs does 100 disk accesses, then for sure the 1% after the 99th
percentile will hit me at almost every individual interaction.
And so this is kind of, yeah, there you will get higher tail latency. So I basically prefer averages or maximum numbers,
for example. Or at least additionally, you would want to report this as well. Even if you don't
want to report it, you might measure it just to know, do some sanity check if you're not like, It's not something is falling off somewhere.
Something more soft or not as hard to measure is capacity.
This is basically how much throughput can I achieve under ideal conditions while not getting too much latency, for example.
And this is, I mean, this is, as I said, not hard.
Typically, we're saying, well, what is my latency threshold?
So I'm saying, well, my requests cannot take longer than a second, for example.
Then I would say, what's basically the throughput that I can get there?
And this would then be my target throughput, for example, then I would say, what's basically the throughput that I can get there. And this would then be my target throughput, for example.
And usually, throughput might look not usually, but throughput might look something like this.
So initially the throughput is really like linearly increasing because
we're not contended anywhere.
So everything is getting faster and faster.
All of a sudden, we're getting not contended anywhere. So everything is getting faster and faster.
All of a sudden
we're getting first contention points.
So some of the requests might get slower, etc.
So we're not getting as fast anymore.
And this is called need capacity.
But you don't necessarily always see
like a need point where all of a sudden the performance
gets slower or not, but typically it's like a more round curve.
And then all of a sudden our system is saturated.
So we're not going to get more performance.
If we put more pressure on the system, we're actually going to get less performance
because we're overloading the system.
So and we want to be performance wise or like capacity wise,
we want to be somewhere in like not completely 100% saturated, but bit below this.
So we get good utilization, but we're not overloading the system.
And with this also the response time will of course increase.
So while the system is not fully utilized, we might not see a change in response time.
But as soon as the system is fully utilized, or as soon as the system is closed to utilization,
the response time will go up. Things will get queued at a certain point. As I said, then the
response time will be unlimited at a certain point in time because we're just overloading,
overloading the system is not doing anything anymore.
And usable capacity would then be basically
the maximum throughput achievable
without exceeding like a certain response time.
And this is also called sometimes sustainable throughput.
This is basically what I would want to load into my system at maximum,
not more than that.
Because then I get a nice response time,
and my system is nicely utilized, but I'm not overloading the system.
And, well, as I said, the knee capacity is something, well, you might see it, you might not see it.
So, we did our break, so now let's go for some statistics. And it's more, just so you know,
basically the words and the right terminology. So if we're talking about measurements,
we're talking about hopefully independent events.
So and
an independent event in statistics
is basically one occurrence does not affect the other.
And this is again like if we're talking about measurements,
we have to make sure that this is the case.
If we want to have some statistical significance with or we want to talk about our results
with statistics.
If everything is interfering with each other, then the results are not independent and we
cannot basically say we're talking about independent events. So, successive coin throws would be typical independent events.
Of course, never 100%, but more or less independent.
And hopefully, successive experiments are set up in a way that they're independent.
This is why we typically have some warm up tasks or something like that.
So we're heating up the caches, for example,
and we make sure every experiment is set up in the same way. If we do something like we have
like a fresh database, nothing done, and we do the first experiment and then we use the same thing
for the second experiment, the experiments will not be independent, right? Because we have caches that are set up in one way or the other.
If we have like a full disk and we're adding some more,
we're changing the disk behavior so it's not necessarily independent.
And that's something that you will quickly see.
So, I mean, SSDs are a bit more flexible here,
but say hard drives, as soon as something fragmented,
like performance
can be vastly different.
Like we're talking about 20% performance difference or something, just depending on where on disk
your data is located.
So this is something that you got to keep in mind.
But now we're talking about independent events.
And if we have independent events, then we can talk about a random variable.
So random variable would be a set of values with a specified probability. And say, for
example, body mass. So our body mass, like each individual body mass, that could be specified by a random variable.
Or the execution time of an experiment would be a random variable.
Or the error of an experiment.
So how much are we wrong in the experiment actually?
So because of some other measurement errors, for example.
So this is a random variable that will follow a certain random distribution.
And body mass is kind of a good example.
So people have certain body masses, and there's certain averages.
But then we have a random distribution around this.
And we describe this with a random variable.
And then we can have a cumulative distribution function that just sums up all of the variables.
So all or all of the probabilities. So where we are. So we go basically go say from body masses.
If we would have on our X axis here all possible body masses. Let me put... So,
we have all possible body masses here and this would be
the average body mass. I have no idea what the average body mass
in Germany is or in the world.
Definitely less than mine.
And then we're somewhere... here would be zero,
here would be, I don't know, 300 kilos or even more.
And we know all of a sudden, like if we go closer to 300 kilograms,
well, we definitely have found all instances that we want to find.
If we go below a few grams, well, we definitely also have found all instances.
We know we don't have anything anymore.
So this is the cumulative distribution function.
And this basically describes up to which level we basically,
how many of the individual instances would be in there
or on the continuous distribution.
If we say, I don't know, 65 kilograms or something like that, which
part of our set would we find here?
Or say experiments, 10 second execution, what would be the percentile basically that we
have here?
This would be the typical cumulative distribution function. So going here,
for example, hopefully our latency, I don't know, say website, one second latency,
then we want to have this in the 99 percentile, for example. Or even less,99.99 something like that so that's the cumulative distribution function then
we have the probability density function of a continuous random variable this is basically
just where like what probability does each individual value have. So again, body mass, I don't know, say 60 kilos or something like that might have to
or 65 might have the highest probability and something like, I don't know, 300 has a very
low probability because there's few around.
And I mean, body height would also be a typical example of this.
And the same would be in our experiments.
So say we are assuming we have a one second latency.
So most results will be in there.
But we will also see some variance to left and right.
And depending on what this distribution looks like or depending on how we're measuring,
we might see different
types of distributions here.
So this would be a Gaussian distribution, and this can be around zero, can be around
anything.
And the interesting thing is like many probability functions, eventually if we're summing them
up or, well, I'll come to that in the next slide.
Let me not get ahead here.
This is the probability function.
This also holds for discrete random variables.
This is continuous.
In reality, in experimentation, this is a theory if we have a
continuous, we're basically saying this is how we expect
things to behave.
In practice, once we're measuring stuff, things are discrete, right?
We have individual instances, we have a fixed set of experiments, and these would then have
a probability mass function.
So, we don't have a density function, but we have a mass function where for each individual value, we'll have a certain mass, like what we've been measuring
here or we're bucketing things. Also, we cannot measure in like an indefinite fine granularity.
So only because of that, we'll already have a probability mass function.
So out of those, we can basically calculate the mean or expected value. So if we see like many of like we have a certain probability density function,
I'm not going to go through the through the formulas here.
You can basically use the formulas if you ever want.
But this is basically, if we have a lot of different experiments,
we can basically say, what's the mean?
What's the average, basically?
What is the expected value?
And this is like, I mean, this is basically just an average, right?
So this is what you calculate.
But more interestingly also,
you can calculate the variance.
And this is something that you will need in your papers
or also to find out, are your results significant?
So the variance is basically the mean squared error,
I think, I would have to check again.
And we can also calculate a coefficient of variation,
which would then be the standard deviation or variance divided by the mean.
And so for a normal distribution, for example,
this is defined where we have the mean and we have like within 68% to both sides, we have basically
the standard deviation and then within the standard deviation, we have 68% of the results than 60 or 95 within two standard deviation and 99.7 within three standard deviations.
So with this, you basically know how much of the results will be within the standard
deviation, etc.
And the important thing or why this is also interesting and why the normal distribution is helpful for you,
is if we're talking about a large number of individual distributions or individual observations that we're summing up,
then in the end we're always going to end up with a normal distribution. And a large number of observations from distributions would be, for example, the different kind
of errors that we see in our experiments.
So in our experiments, we have many different influence factors just coming, like say, I
mean, I already talked about temperature, but just like whatever the computer is doing,
et cetera, that will be like, there's a lot of influences on our experiments that give
us a little bit of variance.
And because these are many different influences, right?
We will end up with a normal distribution.
So this means that we can also measure the normal distribution,
the work with the normal distribution.
And with the normal distribution, we know then how much,
like when do we get significant results.
I have, like in the different slides that in big data systems,
I have some more information there,
where then you can basically also figure out if I have this and that significance or this and that variance, how many experiments do I need to run in order to have meaningful
results.
Okay, so, but I told you, I just want to give you some terminology.
And then if you're curious about this, you can go to the book, read all about it and apply it to your experiments.
If we have all these different data,
we can basically get
the average.
And this is very simple, right?
So if we have many results, we divide this by the number of observations.
This is our sample mean.
This is the regular average that you already know but sometimes also the median might be interesting and that's basically the middle if
we're lining them up the middle of all the uh the results so half of them are less half of them are
are higher and this is usually especially important if you have outliers. So typically example, if you're talking about income, right?
So you have a small village and you have one billionaire in there.
If you're talking about the average income, the average income will be insane.
But if you're taking the median, you have something that's more reflective of actually individuals in there.
Then something else that's sometimes used, it would be the mode,
which is the most frequent value. So which one did happen the most? This is more interesting
if you're talking about something like an individual number that we want to optimize for, for example. And then, so something else that you can do with
distributions is basically also, or what I'm using distributions a lot, is basically figuring
out something like cash efficiency, et cetera. And so here, this is something out of a,
basically of a research work that we did a while ago,
which basically drove me crazy for a while
because I initially, we had this idea,
we're gonna do a multi-level cache using HDD and SSD.
And we thought, well,
if we have like a really skewed distribution, most likely
we're going to get SSD performance and we're just going to pay HDD prices basically.
However, we didn't really think about the CPU cache and the CPU cache's influence or
the RAM basically influence.
And how much we actually need to put in there.
So, i mean, here just basically, this is basically using
Probability density functions and cumulative distributions.
So cumulative probability functions.
So basically how much of my accesses go into which cache.
And you can estimate this with probability density functions. So assuming a uniform distribution,
this is quite easy, right? So if you have a uniform distribution, say we're talking
about random numbers again, this means every next access will be anywhere in the data.
And then basically my caches won't really work well.
So this means, well, the caches will work in the sense
that as much data as I have cached,
this will be the number,
like say I have 10% of my data in cache,
then 10% of my accesses will be to
cache, because any next access will go somewhere.
So this is kind of easy to read, and I know how much this will help me.
This is completely different if I'm talking about a Zipfian distribution.
Zipfian distribution is a typical distribution that's used for any kind of text data.
So this is also based on the frequency of letters in text, basically.
So say letter E or something is much more frequent than letter Y.
And based on this, this is also defined in a more continuous way so you can put
this on any kind of data and here I basically plotted this for up to 10,000 and if you have
like a cache that's where you have one percent of the data in cache and for uniform it would be one
percent hit ratio for zipfian standard zipf distribution, 50% of the accesses would be in cache for
the 1% data.
So it's very skewed towards the left.
So this means here we're actually quite much more cache.
So in this case, we're much more cache efficient, right?
Just paying for 1% cache, we're going to get 50% less hits on disk.
Super efficient.
So now the question is, how can we basically
add more levels of cache in between?
And an initial assumption would be, well,
let's put 10% of the data, for example, on SSD.
And if we have a uniform distribution, well, the 10%, like 1% is in RAM, 10% is on or 9%,
for example, typically these caches are inclusive, meaning that the next cache has everything
that the cache below also has in there.
So this means 9% of the hits will go to the SSD and the rest will still go to disk
because we're just having 10% of the data in SSD. Now with the Scipian distribution,
we might think, well, all of us, if we put 90 or 50% only by using 1% of the data on RAM. We would get insane amounts of improvement
if we have the 10% on SSD.
And, well, actually it's not true, right?
Because we have 50% cache hit ratio
in the RAM already.
And the ZipfN, the next 10%, will only make 20% of the accesses anymore.
So we still have 30% HDD accesses.
So we get less HDD accesses, but we don't get that less HDD accesses as you might have
assumed.
And now the question is, like, how can we actually get even better?
Well, we would have to put more data into SSD.
And say, for example, we're putting 50% of the data on SSD.
So in this case, for example, here, so this would be the 50% case.
Then in the Zipfian distribution, we'll get 40% of the hits on SSD and 10% on the
HDD.
So while we still have this great improvement of the RAM that gives like the huge improvement
that the RAM gives us, we're not getting actually much more efficiency
in relation out of the SSD cache
when when having like a Zipfian distribution.
Right.
So on
on if we're having a uniform distribution,
we're basically getting 50%
hits on the SSD, increasing from 9% if we
go from almost 10% basically to 50% if we're increasing by a factor of 5.
But here on a Zipfian distribution, we only get like a factor of two better than we were in the initial
setup. And this is basically, it's really just because the skew and because of the long tail,
because in the end, we will always have some accesses to HDD that will always like the,
basically the skewed distribution makes sure that a lot of the accesses will go to memory.
And there's a long tail that always ensures that some of the accesses go to disk.
But it's like some, like the cache in the middle basically doesn't really get that much performance improvements as we might think. And for me, this is a good example. So basically, we built a huge
system around this, having these multi-level caches, and it works. And it was a paper
presented at some point, but it never worked as well as we initially thought. And would we have
done these kind of calculations, we would have known, well, we're expecting a factor of like
from here to here, we're expecting a factor of three, for example, performance improvement.
If we put rather than 10%, we put 50% of the data on SSD and we're only getting like a factor of five performance improvement if we're using SSD
at all.
If we're not using SSD and just using RAM, we're still like only a factor of five slower.
And this is something if you know that the disk is, I don't know, a factor of 100 slower
than the SSD, then you would assume you can get better performance.
But if you do the math, if you basically check this once, then later on, well,
it's not an easy back of the envelope calculation because you have to deal with probability
distributions, but still something you can do in a day and building the system takes months.
Then, or at least weeks to get first performance numbers.
And then you find out, well, it's not working as well as I thought.
Okay, so with this as kind of a bad example, or a reminder to do your experimentation, to do your modeling,
we're going to stop for today, and we're going to continue with the benchmarks next time.
Questions?
No questions?
Okay, well then, thank you very much and see you all tomorrow.