Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking
Episode Date: April 10, 2024...
Transcript
Discussion (0)
Welcome to our second session.
I'm glad to still see a couple of faces here.
We're gonna talk about performance management
and benchmarking today
in hardware-conscious data processing.
But before we do, I have three small announcements.
One, Marcel already sent out something on Moodle.
We have this survey or questionnaire on active participation.
Please fill it out if you want to participate in the programming exercises.
It doesn't have to be today, but soon, so then we can start the setup for all of the
GitHub, et GitHub, etc. stuff. So that you get your account, etc.
And you can do the programming exercises.
Today we're going to have the first session
of our project seminar on in-database machine.
That's not true.
It's machine learning systems.
Sorry, I didn't update this. So we have a. It's machine learning systems. Sorry, I didn't update this.
So we have a project seminar on machine learning systems done
by Elaine, Ricardo, and Niels from my group.
And today is the first session in room 111, I think.
So they also said they were trying to find something else
if the room is just too small.
But check Moodle.
If you're going to this room, definitely we will find you.
And if we're doing it somewhere else,
we'll reroute you to the right room.
And then finally, we have a bi-weekly research seminar
with TU Darmstadt.
So with Carsten Binig and Scholt Istvan,
who are both the database professors at TU Darmstadt,
both in the systems domain as our group.
And we, well, for quite some time now,
I think it's the fifth or so incarnation of that seminar,
we basically do joint research sessions. So we invite guest speakers.
And today there's going to be Victor Lais talking about co-designing database systems
and unikernels for the cloud. So this is basically how to do OS and database co-design.
That should be interesting.
And you're invited to join us.
So here you can find the Zoom link.
The Zoom link is in the slides.
The slides are in Moodle.
So if you want to join, feel free to join.
And I forgot to set the time, but fortunately,
or put the time here, but fortunately,
Niels also sent an announcement in Moodle so you can find the details
there as well. It's open for everyone to join. We'll also, so this is sort of every other week
with some exceptions if we don't have time etc or conflicts. But there's usually interesting talks on database-y research
or database system-y research.
Okay, so with this, where are we?
We're in performance analysis.
I have lots of stuff on this.
I'll probably not be able to go all the way through,
so I might ask Martin to finish this up next week.
And it's mainly about understanding or getting
the tools to understand system performance.
This is still introduction.
Next time, next week, Martin will cover database basics.
And then we'll really dive into the
topic.
But the performance analysis is something that's useful, especially for systems development,
but of course, generally, if you do something where you want to be efficient, not waste
compute power and energy in general,
then performance analysis makes sense.
So you still remember the timeline, right?
So we're in the first week.
And I can skip this.
So we'll do a very quick introduction
to performance analysis,
then talk about these back of the envelope calculations
that are already hinted on yesterday i'll give you i'll give you two examples one very concrete
like low level example that i can do also on my laptop here then i'll talk about some measurement
and statistics details and this is probably as far as I get.
Maybe we can cover some benchmarking and the FAIR benchmarking as well.
So that's interesting stuff.
Just to give you some ideas about how to do benchmarking, how to measure devices.
And why do we do this?
Why do we need measurement and benchmarking?
Remember, the seven-step procedure for thesis or papers that I showed you right on step four and five,
we need to do, or we should do, back-of-the-envelope calculations and we should do experiments. Today, if you're writing a paper or a thesis in computer science, which is not on a
meta topic somehow, you will always have to do experiments, right? So you will always have to
benchmark something. You will always have to show that what you did works in practice to some extent.
And for that, you make experiments.
And in order to not be super surprised by the experiments, it makes sense to do some
back of the envelope calculations beforehand.
Because typically, people come up with ideas that seem reasonable if you think about it.
But thinking about performance is actually not so easy.
So a lot of it is basically integration.
So you have to basically integrate
over many different kind of measurements
and integrating is something our brain cannot do well.
So this is something that where we've like,
if we just guess we're often very wrong.
So that's why, and doing the calculations is very simple. So with that, we get much
easier or much better results typically. Okay, so why do we need a benchmark in general?
Well, if we build systems, we need to evaluate them. And as systems programmers, which we are in our group and you are in this course,
we're really always considered or interested in performance.
And this is not the only way you can experiment or how you can measure something.
I mean, usability would also be something interesting.
But for us, in most cases, it's really about,
how can we get most efficient?
Of course, that's not necessarily always
the best thing, right?
So just without looking left or right,
just running after the fastest performance often
will get you into some kind of corner that where you lose genericness of your
solution where you really specialize for a very narrow use case so that's not something you want
right you still want to be able to apply your solution to a broader set of results, you still want to be able to use your system, right?
So that's, or also to evolve your system to new hardware, for example, at least newer
generations of the same hardware.
So that's something where we still should have in mind.
But often the first goal is just to see, OK, how fast it is.
And this is really, if you think about database software,
this is often the first choice why
you would choose A versus B. So if somebody
buys Oracle over SQL Server, then this
is usually due to some bake-off performance benchmarking, right?
So they basically, both companies get the workload,
both vendors basically, so I don't know,
my super-duper company wants to buy a new database software,
I give to Oracle and I give to Microsoft my workload,
and then they are supposed to show me how
fast their system is and the faster one I will buy. At least that's how often it
works. Of course other things are important as well like energy efficiency
and price per performance. So if you think about the cloud the second one
will be super important price per performance. If you think about the cloud, the second one will be super important, price per performance.
If you think about EU regulations, energy efficiency is super important because with
new regulations, every company has to provide their energy bill or their energy consumption
and CO2 impact on paper and if the database is the thing that will basically really make this look
bad or give you a bad number, then you will probably not choose one product over the other.
Okay, still what we're thinking about is performance evaluation.
And while we can talk about benchmark and analysis. And I really like to make this differentiation.
And I don't know if this is standard terminology.
For me, it's terminology.
So I mean, you probably can also say evaluation for analysis
or something like this.
But if you have a new system, if you build a new algorithm or a
special kind of operator or some hardware optimizations, then there's two things that
you can do. You can analyze it like the single thing that you did, right? Your operator,
your system, and check the individual optimizations. Basically, really look into where
does time go here, right? So, if I run my query on this, what actually happens inside and why does
it take the amount of time that it takes? And that's analysis, right? So, this is basically
doing a breakdown, checking what happens inside.
And then we have benchmarking.
And benchmarking typically means we're comparing different things.
So classical benchmark is something that some kind of carpenter or somebody does,
like somebody who produces furniture will have their workshop and a bench in there.
And if they do some
kind of specific type of furniture they will mark the bench where to have the
right length so we're basically checking the performance we're marking the
performance and comparing it to others right so this is comparing multiple
systems on typically a standard or a real workload. So as I said, if you're a
company acquiring some kind of software, you will look at
standard benchmarks but you will also give them your workload and see what is
the performance on your workload. If you're writing a paper or a thesis, you
should do both, right? And you should always understand what these numbers mean way too
often i get uh proposals or thesis where people say well the result of my benchmark is seven
and i have no idea what that means and the person who did the experiments has no idea what that
means and uh well so you basically as you know as much as you
knew before you did the benchmarking so you should always have an idea before you do an experiment you
should have an idea what the result should be and how to interpret that result and for this
we often start with a basic modeling exercise right so we basically start with a basic modeling exercise, right? So we basically start with this rough, getting a rough
idea. And this is also, in order to understand system performance, we can do multiple different
things. So we can do these experiments or measurements in terms of analysis and benchmarks,
or we can just model the whole thing.
So we can just go about and say, oh, my system has these components,
and my system has, like my algorithm has these stages,
and this stages takes us, like, does these operations,
these operations typically take that amount of time.
So overall, my performance should be this and that.
We can also simulate, and in in many cases this actually makes sense. In our case usually this doesn't make sense. So in our cases
in terms of system performance, if we have the hardware and we have the software, then we do real experiments or we model the software to get an understanding.
If we don't have the hardware,
or if we reason about something like,
oh, my peer-to-peer nice large-scale network something-something,
which uses a million different nodes,
I don't have a million different nodes that I can run something on,
so this stuff I have to actually simulate, right?
So I can use individual nodes, I can do benchmarking there,
but the whole large scale,
I will have to do some kind of emulation,
some kind of trace-driven simulation,
just have the whole thing.
And now we have these different roads that we can do.
We can do modeling, we can do measurement.
And the golden rule of validation tells us,
well, do not trust one of the results or a single technique,
but validate with another one.
And for us, this typically means
we're validating the measurements with a model.
So in essence, we're coming up with a very basic model
that gives us some rough performance idea,
and when we're doing the benchmarking,
we can check if the model or if our measurements are actually correct
because we have a mind model of what this,
or a basic model of what the performance should be like.
If this is completely off,
then either we know our model is wrong,
so we basically looked at like the bottleneck
is not what we thought the bottleneck is,
so we're spending time somewhere else.
Our measurement could be wrong because we're not like,
I mean, what often happens, for example,
we have some kind of driver architecture that measures everything,
and the driver is too slow for
the system actually to provide enough data.
So then we're just measuring the driver rather than the system.
Or we're just measuring something completely that the measurements
itself are not really collecting what we actually thought,
or there's some bug in our measurements, right?
Or there's a bug in our model also works.
So having two, the two gives you some safety.
You can basically reason about the performance
and then do the actual measurements.
And if there's something largely off,
if the trends that you expect are not
the same that you would uh that you actually see well then you know you can basically drill down
further if everything is round about what you thought it should be well then you're basically
happy and you're okay the benchmarks tell me what i actually expected and we're good to go we can
actually publish this okay good so with this I'll show you a simple technique
for for modeling it's actually not really a technique it's really just like
okay do it right so you don't have to do much. You just come up, like, OK, where
does time go in my problem or in my system?
So this, like, one person who really pushed this is Jeff Dean.
So he's a Google Systems guru.
So one of the, I don't even know his title right now,
chief architect senior or whatever.
So he's one of the guys who built MapReduce,
built many other systems, is behind TensorFlow, et cetera.
So one of the core systems guys in Google.
And he basically suggests, and I really like it,
and also started doing it for a while
already these back of the envelope calculations.
And the idea is on the one hand,
just to get an understanding of performance
and also to get good enough performance.
So basically do I have a solution to my problem
that's good enough?
And often we come up with some kind of architecture.
We think, that might make sense, right?
So this and that, combining these and these techniques.
But then in the end, if I have too many parts, for example, my latency will go up.
Or I have one weakness in my system.
My bandwidth is too low, et cetera.
So things might get slow.
And as I said, just thinking about performance
will not give you a good understanding,
because our brains are not made for this.
So our brains, if you just assume
what will be the performance of a system,
you're very often very wrong.
And that's very natural, happens to me as well.
And that's why just sitting down and doing some simple calculations on the back of the
envelope or if you're in Germany on a beer coaster, right?
So this should be the amount of calculation that you do, just basically picking the most important factors for your
performance and then calculating the performance within an order of magnitude.
So not 0.1 or 0.01 error on the performance, so I don't need like a very exact model. I just need to know within like
a factor of 10 that this is my performance. Because then I know, okay, if I'm 10 milliseconds
or 100 milliseconds, this is still interactive. If I'm 10 seconds or within a minute range, okay,
this is not interactive. So my users will probably walk away.
And if I can already estimate that something will be that slow, then okay, I have to do
something different.
And so this is why this is also called a bullshit filter, because you can filter out stupid
ideas early.
And as I said, this is not necessarily obvious right away.
So it actually is a good exercise.
Of course, if you then want to know for sure, you have to benchmark.
And then you can basically see what the actual performance will be.
And for this, there are a couple of latency numbers that are useful to know.
And I haven't updated this, right?
So this is from Jeff Dean.
This is probably 10 years old.
You will get slightly different uh and like faster latency numbers here and there not everything is much faster but it's
just to have an understanding of orders of magnitude you don't have to memorize these
again it's just good to have an understanding or have a way of looking up uh what these
performance numbers are so again
This is a bit small probably not everybody can read everything
It's not necessary. You can look at the slides. I'm gonna tell you a few things here and there
Right so say for example you have an L1 cache reference. This is basically within a cycle right so this is
0.5 nanoseconds if your frequency is a bit higher or a bit lower,
you can calculate how fast your GPU would do this. Then we have branch mispredictions. This
is basically whenever the CPU thought, okay, we're predicting to go in an if statement to go the true route, then if we're right,
so the CPU will always say,
I'm not waiting until I can actually calculate the branch.
I will just predict we're going that route.
And then it will already execute all the statements,
the instructions within this branch.
If this is wrong wrong then we have to
go back and start the other branch so that's basically and if this happens a
lot this is basically where a lot of time goes and so this is five nanoseconds
so round about ten times a single cache reference might even be up to 20 times a single cache
reference so then up to 10 nanoseconds roundabout so depending again on the CPU
architecture it's just about the average or some some rough ideas here if we look
at an L3 cache reference where so this is basically not first level cache, but then third level cache could be something like 20 nanoseconds.
If we're going to main memory, we're in the 100 nanoseconds range.
If we're going to NVMe, we're in the 50 microseconds there already.
We can see a lot of shift today.
So newer systems might get even lower
or newer hardware might get lower performance
or lower latency numbers here.
If we're looking at disks, we're in the milliseconds.
Right, so this is, you can see this here, right?
So here we're already going further down.
And this is why basically the CPU has to wait so long
because we're many orders of magnitudes lower
than if we're in memory still.
So this just, as I said, some numbers.
And if you're really curious about current numbers,
write small programs and check yourself.
Write some small benchmarks, programs.
That's a good exercise to get an understanding of system performance.
There's also some throughput numbers that I have.
So here DDR4 channel
bandwidth is, for example, 20 gigabytes per second. So this is something that it would
have here on this system, for example. Again, this one is a bit special. So the M1 has a
unified memory architecture. You can get a bit higher general bandwidth on this system.
Other systems might be slower. Now there's DDR5.
You might get even faster performance there.
Old PCIe Gen 3 channels have 12 gigabytes per second.
Marcel will tell you a lot more about newer PCI Express
versions where you can get like much higher
throughput as well.
NVMe, this is also large range, right?
So you can be in the gig, 10 gigabytes per second
these days already.
Yeah, this bandwidth and NVMe IOPS,
you can, we'll talk more about this later.
So, but this is, as I said,
so having a rough understanding of these numbers
will later on help you to calculate or also to better estimate performance. And we'll do this
right now with a very simple example. So also from Jeff Dean, a simple example is on how long to generate an image result
page with 30 thumbnails.
So this is basically a Google image search problem.
So if you're looking for the newest whatever cat images,
you're going to Google and you're saying, well,
give me nice cat images.
So then Google has to produce one of these pages for you.
And then they have to think, OK, how do we do this?
So how can we do this with good enough performance?
And a very simple approach would be just read these individual images sequentially.
So each of the images 256 kilobytes, for example,
we're reading them serially, so one by one on the fly. Google images are of course many, right? It's
large database, so this will reside on disk because it's the cheapest. We don't want to spend too much money on storage here
because otherwise it's not going to be efficient cost-wise.
So if we just do this sequentially, then we'll have to seek on each disk.
So remember rotating disks, right?
So the disk is spinning.
The arm first has to find the right track,
so we're moving toward the head, moves to the right track, this kind of seek takes 10 milliseconds
per seek, and then we have to read this 256 kilobytes with around about 30 megabytes per
second, can be up to 100 megabytes per second on a good disk but not much
more so you're ending up with 560 milliseconds which is half a second and that just on the
server side so this at this stage you haven't transferred a single image to the user this is
really just on the server side so this is is clearly slow, right? Half a second.
If you have to wait for two seconds for the first image to pop up or something,
you will probably say, well, Google is long gone.
I'm going to Bing now, right?
So there they have to do something else.
Okay, so what else can we do?
We can issue the reads in parallel.
Because these are many images, right?
And they are stored on many disks.
We don't need to wait for a single disk to respond.
We can read all of the images in parallel.
There's going to be thousands of disks.
We can replicate the images to some degree.
So redoing the 36 in parallel on 30 different disks
will not be a problem. And if we do this, then we have only one seek, right? Because all of the
seeks are going to be in parallel. And we only count one read. So one times the 250 kilobytes read,
because again, all of the reads will be in parallel.
We're doing the same amount of work, but we're parallelizing.
So in that sense, or in this case,
we have 80 milliseconds for the server-side processing,
which of course ignores any kind of variance, right?
So if you're doing this massively in parallel,
you will have some stragglers.
Some disks will be slower,
some stupid other job will go in your way
and will take some of your disk time, et cetera.
So this will take a bit longer,
but we're definitely much faster
than the initial sequential execution, right? So here
we're basically in 80 milliseconds, which is fine on the server side, then only the data transfer
is a problem for us. Okay, and of course we can do much more, right? So rather than doing this for
every single image, every time, we can also do c do caching well we can do individual image we can
do whole thumbs a nail data sets we can pre-compute the thumbnails so then we only have to read less
this still won't help us with the seeks right if we have the thumbnails on the disk we still
have to seek them and this is a constant factor on a disk. So we'll still have to pay something, but again, the read itself will be lower then.
And you might come up with many other ideas,
but just guessing what might be a good idea
won't help you much.
If you calculate, you will actually find out
what is a promising idea, right?
So like just guessing how much time this will take will probably not help
you much. So doing the back of the envelope calculation helps you identifying most promising
things. And often you need a very simple prototype just to get these latency numbers initially.
If you are already aware about your infrastructure,
so you have a good understanding of the hardware,
then you know many of these numbers.
Before that, you will need to do some very simple benchmarking,
some very simple prototype to get an understanding how fast
is your disk actually.
So if you have a large RAID with many disks,
you have everything distributed
just the reads might take more time than just the disk speed for example so these kind of
organizations but it doesn't have to be a complete system and that's important right so you get
that you can get these numbers fairly quickly and to my students i always say, when we do some paper prototyping, etc., these kind of numbers, you should be able to produce it within a day maximum.
Ideally, within a few hours.
You come up with a rough understanding of the performance of your solution.
So let me give you another example.
And that's sorting.
And sorting is something that we need in databases all the time.
So many database operators are based on sorting,
so this is something that we actually need to do.
And so my question right now to you,
if I have one gigabyte of four byte numbers in memory how long does it take to
sort them who has an idea just guess what do you think
just a rough idea, guess.
One gigabyte would be in main memory?
Everything will be in main memory, yes.
And then we have...
Just guess. Don't calculate too much.
I really just want guesses.
10 seconds?
Do we get another number?
And don't be shy.
Really, this is not about pointing out stupid guesses or something.
It's really just about, do you have any idea?
Okay, I'm assuming everybody who doesn't say anything
has no idea how long this will take.
All right, so, yes?
Between 10 and 30 seconds. 10 yes? Between 10 and 30 seconds.
JANIS LIEBHOLDT.
Between 10 and 30 seconds.
JANIS LIEBHOLDT.
Between 10 and 30 seconds.
OK.
Who says less than 10?
Who says more than 30?
Who says in between? Who already checked the slides?
OK.
Good.
We'll see.
So let me keep the tension for a bit, right?
Even though most of you or some of you already have an idea.
I got more variance in the numbers in previous years. right, even though most of you or some of you already have an idea.
I got more variance in the numbers in previous years.
I should maybe change the example.
OK, so how can we go about this?
So one gigabyte is 2 to the power of 28 integer numbers
on most 32 or 64 bit machines.
So meaning we have, yeah, this is like if we have int numbers,
so four byte numbers basically.
So if we do sorting, what do we have to do?
So who remembers how sorting works?
Yes?
.
JAN SCHNEIDERMANN Yes?
That's about 30 or 28.
Yes.
And then I assume one second per gigabyte.
OK.
So that's good.
So 30 or 28.
We have n log n.
We have to basically read the whole thing.
We have to read the data
and we have to make these comparisons.
We have to compare the data, right?
We have to basically read and write the data
and we have to compare the individual numbers.
That's basically what we need to do.
And we need to do this n log n times, that's the number of comparisons,
and we need to go through the whole data set basically a couple of times. And you already
said it, we basically need to do this 28 times. We have to, 28 times we need to go through the
whole data set roundabout for all the different comparisons.
So we said comparisons and reading data.
These are the two factors.
If we have comparisons, what does this result to
in terms of code?
I think we get an if statement and the if statement produces branches, yes, and branch mispredictions. Exactly.
How many branch mispredictions will we have? So if you assume completely random, the CPU hasn't got a clue on what it predicts of 50%
or 10% or 15%?
Very good.
So we'll basically do branch.
So on the one hand, we have to read the data, we have to write the data, and we have to
basically on each comparison, which we will have n log n of those, we will
basically have to figure out if it's a... or we will
basically... the CPU basically does a branch prediction and since the data is
randomly distributed, half of the predictions will be wrong.
So that's about 50% will be wrong predictions.
Okay, so again, we have quicksort, for example,
we have n log n.
We do log of 2 to the power of 28 passes,
so 28 passes over 2 to the power of 28 numbers. And we do basically
these individual comparisons, so 2 to the power of 28, log to the logarithm of 2
to the power of 28, of course, is 28.
And that is roundabout.
So just to calculate these numbers,
that is so 28 in powers of 2 would be roundabout 2
to the power of 5.
So it's a bit less.
So it's in between 2 to the power of 4
and 2 to the power of 5. But so we say 2 to the power of 5. So it's a bit less. So it's in between 2 to the power of 4 and 2 to the power of 5.
But so we say 2 to the power of 5. And then if we add 2 to the power of 5 or multiply 2 to the
power of 5 with 2 to the power of 28, we get these 2 to the power of 33 comparisons.
If you say half of those are mispredictions, then we have 2 to the power of 32 mispredictions, right?
Because we just divide by 2.
And if we say 5 nanoseconds per mispredict, then all in all we get 21 seconds of mispredictions, right?
And now we just have to additionally go through the data.
So mostly sequentially read through the data.
So that's 28 passes of these two to the power of 30 bytes.
So of these 28, so of this one gigabyte,
so 28 gigabytes roughly that we need to read.
If we say we have 20 gigabytes of memory bandwidth,
then this costs us round about 1.5 seconds.
So all in all, we're round about 22 seconds, 23 seconds
of what we assume that this sorting of one gigabyte on a CPU will take.
So, quite good guesses.
I'm positively surprised you understand a lot of our performance already.
So, that's an assumption, right?
That's a model.
So, now maybe let's try.
So, let me quickly delete what I already had.
If I still figure out how to do this. Yes, so what I have here is just a simple program.
I hope you can read this and I hope I can find my mouse.
Let me see.
Okay, so what does this do?
So this is a very simple C++ program.
I also have it on GitHub.
You can basically check this at home.
I have an integer array or vector of the size of 2 to the power of 28.
So this is the number of two to the power of 28, right?
So this is the number of integer numbers that we need. We'll just put this in memory.
We're keeping your counter for the time.
Then we're randomly sort the data, not sort,
we're randomly distributing the data
across this area.
So it's first just filled, or it's randomly filled.
And then we start basically start the timer.
Or no, we don't even start the timer.
First, I basically output a couple of lines, or the first couple of lines of the array.
So, you can see, okay, initially it's not sorted.
Then I start the timer.
Then I sort with just standard sort.
So, this is a fairly advanced sorting algorithm.
And this is also why we really need the whole domain.
So, we need to basically have a large enough data set
and we need to fully randomly filled
like with the whole integer range, fill the data set
because if we're using smaller numbers
then this will just be an insertion sort.
So then basically it will just be directly
sorted into a large enough area and will be faster than n log n than what we would expect.
So we're sorting. Then after the sorting, we're stopping the time. We're outputting again
a couple of lines to see that this is actually sorted. And we're outputting again a couple of lines to see that this is actually sorted,
and we're outputting the total sort duration.
And this is in Xcode, so I have to find, I'm just running this,
and you can also run this in command line.
It doesn't really make a huge difference in terms of performance.
So now we're building, and we're running, and're running and this is single threaded, right?
So I'm also not super worried that PowerPoint will basically steal all the performance here.
And you can see, oh, we have one gigabyte of data, four byte integer size.
So I just put this out for reference. We have a couple of initial numbers, which you can see are randomly distributed.
Now it takes, of course, a bit of time.
So we set 20 seconds roundabout, hopefully not five minutes.
Otherwise, I'm going to wait here for quite some time. And then you can see, well, this was actually slow today,
but you can see here we have the numbers
that are now the first couple of numbers.
So this is basically, you can see,
well, we have only small numbers in the beginning.
It took 32 seconds, which is really slow.
So I've never had that slow of a number.
But it's within an order of magnitude of what we calculated.
So again, we have the complete sorting program.
If I run this while the computer is not doing PowerPoint presentation and everything,
I get something around 20 seconds, which is really close to what we see.
We can also do some counting.
We'll see that the mispredictions are round
about the number that we would expect.
So, very nice, right?
So, this basically really shows us
with some basic calculations
you get round about the performance measurements
or you can estimate performance quite well.
And although I'm really surprised that you already had a good idea
about the performance, in maybe other cases you probably don't.
So without, or I'm not sure how certain you were about these timings, right? So it makes sense to do these simple calculations to get a good understanding.
And with that, we're going to do a quick break here, four-ish minutes, and then we will talk
about measurements and metrics. Questions up to here?
Yes?
How much time do you think can be parallelized?
How much time can we get?
Well, I mean, it depends on the
parallelization, right?
Every, like,
10-year-old player, let's say,
is it parallelized completely?
Yes, so we should, depending on the number, of course,
to a certain degree, we can probably
get up to an almost linear speedup
if we're doing it right in terms of parallelization. Overall, of course, then you still need to
do the final sorting.
I've been off the top of my head. I'm not going to tell you a number
because it might be slightly off.
But I would say on this machine,
if we parallelize,
so I have 8 cores,
I would assume I can get a performance boost of factor six or something like that. 2, you think like, you have a geometric standard line.
Then like, the first half takes one, the second half
is third one, quarter and so forth.
And the standard line is.
So you say the total performance, the question
is what kind of performance in an arbitrary number of machines
would you get? And you say the total performance would be arbitrary number of machines would you get?
And you say the total performance would be a factor of two maximum?
Instead of multiplying everything with a factor of 28, you would only multiply with a factor
of two.
Oh, you mean for one gigabyte exactly, if you parallelize.
I'm not following what you're suggesting right now. Yes?
Basically,
the assumption, the argument is
that if we go to
infinity with the number of powers,
then we can't
go below two seconds.
Oh, we can't go below two seconds
performance. Because of the the the
performance that we have in in terms of uh of uh reading and branch mispredictions you're saying
yes because we have that final sweep that's one second and then we have
yeah And then we have. Yeah.
Well, you're saying the final writing the data
or going through the data once finally, that's 0.5.
Or we have 20 gigabytes per second.
If we have one gigabyte, we have 1 over 20, which is 0.05.
So then if you say a factor of 2, then we would be at 0.1
seconds if you're saying that the reading or the writing would
be the dominant factor.
So the sorting of the last layer,
that's going through one gigabyte, costs one second?
No, but going through one gigabyte
doesn't cost a second. Going through one gigabyte doesn't cost a second.
Going through one gigabyte, we have 20 gigabyte memory bandwidth.
Yes, but sorting, like the final step of sorting, the last merge operation.
The last merge operation, okay.
Only if we don't do merge, so we can fix it I mean if we're completely parallelizing, we're probably not going to use quick sort
as is, right?
So then we would do something else.
We would do some parallelizing.
We would do some kind of merge step.
We would do sort individual sub ranges and then merge again.
We'll actually talk about parallel sorting
later in the lecture.
So let's maybe follow up on this once we're doing
parallelization strategies and we can do another
back of the envelope calculation for that one.
And with that you still have some you still get a few minutes of break.
Okay so measurements and metrics. As I said just before the spec of the inward calculation really
is a very powerful tool to give you a good understanding of performance.
Also, everything else.
So when our, let's say, just reasoning
about these kind of things on paper
is a lot more efficient than doing this in your head.
So I always suggest to do this.
If you think about an architecture,
something like this, putting this on paper and
thinking about how something works what are the parts the steps what does it takes how much time
helps a lot makes your life much easier and yeah your thesis etc your project is much more efficient.
And also gives you good understanding when a research idea is stupid.
And there's so many stupid research ideas out there. I mean, not necessarily stupid, they might look nice,
but all too often we figure out that, well, it's a nice idea,
but it doesn't have any impact.
So it doesn't have a real performance or whatever
useful impact and then the problem is then you you basically have a nice idea but you cannot
motivate it and that basically brings you to this problem that oh my whole thesis doesn't really
make sense right so i have uh the idea is nice but there's no performance benefit whatsoever. There's no practical benefit.
And you can rule that out very early, at least to a very high certainty.
Okay, so with that, let's go into measurement and metrics.
And so with this, a bit of terminology.
So if we do measurements or we do benchmarking,
then if we have a database system,
this is usually what the kind of things
that we have to take care about.
So we have data that's often delivered
by something like a data generator.
If we're doing benchmarking, we have a workload
that again is often delivered by a workload generator,
or both can also be some kind of traces or data sets
from real world setups.
And we have a driver that needs to put this into the system
under test.
And the system under test, if we're doing benchmarking,
usually some form of a black box.
So we have it as is.
If we're doing analysis, we're trying
to break this open as much as is. If we're doing analysis, we're trying to break this open
as much as possible.
If we're benchmarking an existing system,
I don't know, again, SQL Server,
Oracle, what have you,
or a cloud database,
there's little we can do to look
inside, right?
If we're analyzing something,
then we really try to crack this open.
If you're building your own algorithm, you really want to know where does time go when.
So say you're building a new improved version of Quicksort or a parallel version,
then you actually want to know where does time go.
And for that, you need some tooling, some connectors.
And then once you run the benchmark, you collect metrics,
different kind of metrics, or you have some certain metrics that you want to collect, and these are then in a form of individual measurements.
So the system under test is a deployment, and it's also important to know, right?
So you cannot measure an algorithm or a system without hardware.
And the hardware will have an influence on the performance.
So running two different things on different kinds of hardware
will give you different results.
Or running the same thing on different hardware
will most likely give you different results.
And in our case, if we do this on a low level, the benchmark, then the hardware has a very
high impact, has a very clear impact on where time actually goes.
If we are completely bottlenecked by some larger factor, so, I don't know, our system is somewhere in the cloud
and the main factor is actually the network connection to the cloud,
then it doesn't really matter that much what the system itself does.
So, this will be a minor factor.
However, if I'm running something on my laptop,
then the CPU, and I'm somewhat hardware conscious,
then the CPU will have a large impact and the architecture of the CPU.
Then I need to somehow understand this.
The workload typically are some requests by users and more often than not, requests by
other systems. So typically you're not building a system that is kind of a huge monolith
where then actual people are working there and typing their queries
and interacting directly.
But often this is one part of a larger puzzle
and other systems will connect to this.
And the metrics then are the criteria that are used for evaluation.
And of course, if we do benchmarking, we need to answer a couple of questions before.
So which kind of scenarios do we want to evaluate?
What kind of data makes sense for that?
What kind of workloads should we use.
And this basically then also, if we have this question,
we can see if there is already or there are already benchmarks that we can use.
So I don't have that many slides on different kind of benchmarks,
but I'll introduce some or at least name a few that you can use.
So there is standard benchmarks that you can use.
And the good thing about standard benchmarks
is other people know them as well.
And if you produce numbers for them,
they know what these numbers should be
and what these numbers mean.
If you come up with your own benchmark,
only you can interpret the numbers without giving a
lot of details of the benchmark Excel.
Then we can think about the hardware and of course the software that we should run and
the metrics, so what to measure, how to measure them and how to compare.
Some metrics are actually quite hard to collect. Something like a branch misprediction,
I cannot measure an individual branch misprediction.
For example, I mean there is no tooling on the PC
that gives me fine grained enough timers to say,
okay, one branch misprediction takes exactly
point four something nanoseconds, right? Or something like this,
or five or three or whatever nanoseconds. I have to do this somehow aggregated. I have to know
what's actually going on and I have to then divide by something. Also, if I measure on this fine
grained level, then the measurement itself will basically take most of the time.
Because I'm just writing data all the time while I'm actually trying to measure something that's in very few CPU cycles. So there, again, I have to be smart how to measure.
And of course, the hardware needs to, and the software needs to fit the problem that I'm using. If I'm talking about server scale workloads and systems,
It doesn't make sense to run this on a laptop.
Because the hardware is just slightly different.
The memory is larger, etc.
And the workloads also need to fit whatever I want to do.
The common problem is I'm running way too small
and way too short benchmarks. Then basically everything will be in caches. Of course,
I get very different performance than if I actually have to hit memory or if I have to
hit disk with my setup. So basically, we want to find out what's going on.
There's a couple of common metrics.
If we're talking about performance, we usually want throughput or bandwidth, latency.
Then we could have something like accuracy or capacity.
I'm going to talk aboutC and capacity. The minimum that we want,
if we do real performance measurement
that we do here, for example,
usually is latency and throughput.
I don't want just one,
because basically I can game one
with neglecting the other, right?
If I don't consider latency,
I can drive up my throughput a lot by just queuing.
So by just doing nothing, I can accept requests, requests, requests with having a very high latency.
If I just go on latency, basically, if I have a very low throughput, the CPU is almost idle.
I can get much better latency than if the CPU has to do a lot of other stuff.
So that's why looking at both definitely makes sense.
Accuracy is something that you often need in machine learning, of course.
And capacity is something that basically, where do we get like a good trade off of throughput
and latency?
This is where we're at our capacity.
Then things like fault tolerance are interesting.
We're not going to talk about this here, but this is something if you're thinking about
large scale setups, cloud data centers, stuff like this, time to failure and availability
is important for you.
Also, efficiency in terms of energy, cost, and maybe fairness
might be interesting to you.
And efficiency also generally makes sense to think about, right?
So while performance is always good,
thinking about cost performance makes sense.
I mean, we have lots and lots of expensive hardware.
But just getting like a few percent more performance
by spending twice the amount of money
that is not really efficient
or by spending twice the amount of energy
is not really efficient.
So in a practical setup,
we want to make sure that what we're doing
actually also makes sense size-wise and efficiency-wise.
And then we have scalability.
So can we scale up?
Again, scalability is easy to game.
I'll come to you in a second.
It's easy to game if your baseline performance is very bad.
So this is what many research papers also do.
So they basically have a baseline performance
on a single node that's super bad.
So then it's very easy to get linear or even super linear
scalability if your system is super slow on a single node.
You should always compare if you're scaling up,
you should compare it to a really good baseline.
So what's the best baseline that you can get on a single core
or a single thread?
That's your baseline.
And how much can you get performance over that
on multi-core, multi-node, et cetera?
That's usually a better comparison
than compared to your own maybe scalable but really poorly
performing algorithm.
Your question?
Just real quick, could you
say what policy efficiency related to fairness?
JAN-FELIX SCHWARTZMANN- Fairness, in essence,
so if we say how do we basically, in a system,
fairness means we have maybe many different kind
of requests.
And we actually answer to all of those.
So we can also get a better performance
by neglecting requests that cost us more. So that's basically the approach to fairness.
Here, in terms of fairness, it's not about more metatopic fairness,
but it's really if I have many services or many users
or many components that use a certain resource, do we schedule
it efficiently but also fair in essence that everybody gets a share?
In general, we want to select metrics having low variability.
So if we have metrics that are all over the place,
every measurement is completely different,
then probably we cannot get much from these measurements.
Again, this is basically also in essence measurement practices.
Do we get good measurements or do we get stable measurements,
or not?
If we have very unstable measurements,
the numbers don't help us as much.
We don't want to measure stuff over and over
in different things.
I mean, for databases, it's very common
to have something like requests per second
and throughput, something like this.
So throughput in terms of queries per second
and in terms of gigabytes per second.
That makes sense because they're not completely the same,
but there's already some redundancy in there.
So having yet another five metrics that would measure the same thing
would just be unnecessary.
So you don't really want to have this.
But at the same time, you want to be complete.
This is what I said about throughput and latency.
You want to make sure that you capture the whole performance
because you can game one without the other.
You can basically, if you're just, as I said, just looking at latency and not at throughput,
you can improve your latency a lot and vice versa.
Yeah, throughput and latency, some examples.
So throughput could be requests per second, could be concurrent users that you're having, could be gigabytes per second processed, etc. So there's many different ways. Of course,
you should always describe the whole setup, right? So gigabytes per second is different if a single
request is one byte or a single request is a megabyte.
You get very different numbers,
really depending on the setup basically.
Same of course for requests per second or concurrent users.
So you need to specify more in order to give useful numbers.
Also, it's very different if the dataset that I'm
using is one megabyte or is 10 terabyte, where I'm
like basically where I'm selecting from, for example. Latency can be the overall execution
kind of time, so end-to-end latency, or it can be per request latency. So I can basically measure
some kind of average latency. I can measure on each individual request the latency.
If I'm measuring end-to-end latency,
I might hide some kind of latency spikes in between
if I have some problems somewhere,
like some blocking or something in there.
So I might get better throughput and latency numbers
by just averaging out than if I really
look at everything in detail.
And something a lot of people report are percentiles.
And percentiles are nice, but percentiles are lossy.
So with percentiles, you're basically
looking at the top
99 or 95 latencies, which is good.
But if say, for example, you have a web search engine
that like where each individual request will create
a hundred different interactions with other services,
then you're always beyond your 99th percentile
with your requests.
So then going for a 99th percentile on each individual server interaction will not be
useful for you.
This is also, and if you're looking at large scale, you will also more quickly go into
the high tail latencies.
So this is basically something out of a paper
by Jeff Dean and Luis Barroso,
where they basically looked at these tail latencies.
And so if you have like 99 percentile, 99.9 percentile and 99.99 percentile latencies,
or say failures, for example, then this
is how quickly you will basically
get to these latencies. So how quickly you, how, what's the probability
of getting, of seeing this latency, this latency beyond the 99%, right? So and
here for example if you have 99.9 percentile then at, what is this like?
It's too small even for me to read.
At 100 requests, you have already 60% probability to hit the tail latency.
And this, of course, very quickly approaches one.
And for the higher tail or for the higher percentiles,
you might even like it might be slower, but eventually your tail latencies will hit you.
So that's why I actually, I don't, I'm not a big fan of tail latencies or percentiles. I'm
more a fan of averages and maximum and minimum numbers. Then we have capacity. So capacity is a bit more vaguely defined. It's basically how
much do I get into my system, right? This is all sometimes we call this sustainable throughput.
Basically, how much can I fill my system before it gets slower? And there's a nominal capacity, which is basically,
how much can I, what's the maximum throughput
under ideal conditions?
So this is basically, how many requests per second
can my system do unless something happens,
unless somewhere garbage collection comes in
or somebody has PowerPoint open on their laptop, something like this.
So all of a sudden we're slightly slower.
But under ideal conditions, this is the maximum throughput.
Then we have the usable capacity.
This is basically in a running system um we basically get uh like if some like
everything else is going on uh this is the capacity this is the throughput that i can achieve without
ever breaking um the system right so this is basically, I get already slightly varying
and higher latencies, but we're never going to queue up.
We never will be at the stage where all of a sudden,
the system is not able to fulfill all of the requests.
So in the nominal capacity, so the maximum basically
throughput, this can happen.
So we're basically at the maximum throughput.
As soon as something goes bad, I will just
be starting queuing, queuing, queuing.
I cannot fulfill the requests anymore.
The usable capacity just doesn't happen,
but my latencies might go up quite a bit.
And then we have something that's called the knee capacity.
And it's the question if you always can clearly see a knee.
This is basically where I'm starting
to fully fill my system, right?
Where I'm starting for the first time
to see some latency hits.
Well, as long as I'm underutilizing my system, basically I can just increase the throughput or the
number of requests without seeing any hits in latency.
So I can just go up.
At a certain point, I will start seeing first bottlenecks,
and this is where then basically the knee comes in,
where all of a sudden my throughput is getting slower,
my latency is somewhat going up.
Initially, as long as I'm underutilizing my system,
the latency should be stable,
where I can get more throughput.
And then this is basically something
in between these two, this is where you want to be
with your throughput.
You don't want to be below this because then
you're completely under utilizing your system.
You don't want to be above because all of a sudden
you're overloading the system and then for sure
your performance, your throughput will go down
because all of a sudden you're queuing the system has just gets more and more management tasks to do
okay so then uh some statistics and this is going to be the last thing we're going to do the
the benchmarking i'll leave for next week.
So, and this is just high level.
This is just for you to think about this and get a bit of an understanding here.
If you really want to learn something about statistics, there is a really nice lecture, even this semester, on statistics. I really recommend
going there because there you will actually learn something about this. Here is more
very basic understanding of how we could reason about experiments statistically. Also, I'm not a
statistician, so every time I want to use this, I basically have to open a book,
check the numbers again and how this works. Okay. In general, if we're doing experiments, at least we try to set it up in a way that they're independent events, right? So that one
experiment does not completely influence the other.
So that's a precondition to reason about this from a statistic point of view.
I mean, of course, you can also have dependencies in there,
and we will have a lot of dependencies.
So the first run of our experiments, the server is still cold,
the caches are not warmed, etc., all this kind
of stuff, we get different performance.
We want to get to a point where we can run experiments in independent fashion.
Each experiment basically is its own independent run.
In general, independent events are something like successive coin throws.
Hopefully, they are independent.
In practice, of course, it also makes a difference.
The more often you have a coin in your hand, the more sweat is on the coin.
It will influence your coin throw in one way or the other.
From a statistics point of view, the coin throw is something you do every time, completely independent.
We don't see any effect of one coin throw on the next.
And if that's the case, then we can talk about this statistically. Then we have a random variable that's basically a specified set of values
with a certain probability.
So this could be uniform probability.
It could be some kind of Gaussian distribution,
say, for example, body mass. This is something that has of Gaussian distribution, say for example, body mass.
This is something that's Gaussian,
has a Gaussian distribution.
So how heavy are people?
How tall are people?
Execution times, for example,
are usually have a Gaussian distribution.
Or errors in experiments have a Gaussian distribution.
Then you can calculate a cumulative distribution function which is basically the
number of events or a number, the probability of all of the elements of a random variable
up to a certain point. So say for example, again, body mass,
then we can have, if we have a Gaussian distribution,
it's from zero to say 300 or something.
This is where we could see body mass distributions
and will probably, everything will be in the range of, like for adults,
say 40 to 120 or something like this, we will see most things. So this is, there is where we get
close to one, right, from zero close to one, but we'll also have some outliers here and there.
And we have a probability density function that basically, so this basically tells us how many do we see up to a certain point.
The probability density function tells us where, like, how, like,
it's not a, here we integrate basically,
and here we see the distribution, how many are at this point, right?
So this would be, say, for example, our experiment.
If we have a certain error in our experiment, then this is basically the actual measurement
that we want to see, and we'll have some positive and negative outliers. So these are the most
frequent ones. And then the cumulative distribution function this is basically where we then have all of the
events that we see all of the experiments is at one and here in between we have basically half of
the the events already seen and this can also be shifted this means we will, for example, have a bias somewhere.
And then we can also have, rather than having this continuous,
we can have like a discrete random variable. And this is basically whenever we don't have a continuous space.
So when we have individual instances,
so we have individual categories or individual buckets that we might have.
So then we have, say, we're sorting into a hash bucket.
For example, we have individual buckets.
The probability of these buckets will be a probability mass
function, not a probability density function,
because a probability density function is continuous.
Then we can calculate the mean or the average.
That's basically, so we can directly calculate this
out of, say for example, a Gaussian distribution by just integrating over the or summing up the individual values and their individual probability.
If we have a probability mass function or if we have a probability density function,
we have to integrate, right?
Because it's a continuous space.
The variance is basically how far can we,
how far do we see, how wide is this function?
So how wide is the distribution, right?
If we go back here, this is basically the
width of this probability density function and how wide this is, we can calculate with
the variance. Again, I'm not going to go through the, also in the interest of time, I'm not going to go through the formulas here.
So just so you have an idea, we can also calculate.
I don't remember if we actually have calculations at some point.
No, we didn't have them.
So we can have coefficient of variation.
That's the standard deviation divided by the mean, so the standard deviation we can calculate out of the variance and divide this by the mean, that's called
coefficient of variation.
And finally we have a normal distribution, which is the most frequent distribution
that we will see in our experiments.
So, I mean, you know a uniform distribution
where we basically have all of the values
with the same probability, so that many times
if we have randomly sorted data,
or I'm saying
randomly distributed data, we'll find some kind of uniform distribution in
here. However, if we look at experiments, right, so we basically we will
have often have some kind of little variations in our experiment.
And these variations come from different kinds of errors
and different kinds of sources.
And there, through the central limit theorem,
and also in practice what we actually see,
is that we have these many small observations
that actually
aggregate together, right?
So if I'm doing my experiment here, it's not just the CPU that counts or just like this
one part, just a branch mispredictions, but it's also some other things, right?
So how are the instructions queued?
How are the caches warmed?
Where's the data actually. So all of these small things come together
and they all have their individual distributions.
And since they all are aggregated,
so we have a large number of observation
that are aggregated.
So in the end, we come up, we will get a normal distribution.
So basically all these small experimental errors
will accumulate into some kind of normal distribution
typically.
And this is something that we can usually
measure these normal distributions
and that we should also report.
So in many cases, in systems, database systems, the variance is quite low, such that we don't
even report it.
So if the variance is very low, our measurements are very stable, then we're kind of okay.
We're saying we're not looking at it.
As soon as there is a significant or measurable variance,
then you would usually report this.
So we would basically have this standard, like two times
standard deviation.
So this is basically the mean minus the standard,
or one time standard deviation is like 68% of the observations,
two times standard deviation, which would be the variance is 95% of the observations are
within this range. And this is what we typically report if we do these kinds of experiments.
And you might see this with these small whiskers in your experiment.
And if your standard deviation is larger than your actual measurement,
then you know, okay, then the comparisons are going to be hard.
So there's a lot more statistics to this.
If you're curious about this, there's this nice book by Raj Jain that I had in the beginning.
If you want to summarize measurement data,
then you can do a sample mean, so rather than calculating
the actual mean, which we cannot do, right?
I mean, we cannot do infinite measurements.
We will do a sample mean, so we sum the observations, divide and divide it
by the number of observations, which
is the same kind of mean calculation then.
Often, we also want to have the median.
So the median is basically we're sorting all of the observations
and take the middle one.
This helps us to deal with like very wide outliers.
So every now and then the computer might have a complete hiccup and do something completely different.
My thread is scheduled to something else. All of a sudden I have a very different
latency. So maybe I don't want to have these
these events too much, like change my results, so then median might be a better measure than an average. We can also do the mode. The mode
is something I've really not seen a good, good, at least not in experiments, a good use of
it where we have the most frequent value.
But it's good to know what it is.
And with that, I'm going to skip the last one.
This is basically something on experimentation.
I'll explain this maybe at another point in time.
And then for today, do we have questions so far?
So this was a bit fast regarding the statistics.
If you want to learn more about this,
I really recommend this book.
This book, where you basically then
would also learn how to use the statistics to calculate how many experiments do we have to do in order to have a significant result.
In our case, usually the variance is low enough that like a few number of, we can basically just go with this, okay, we're going to do five repetitions.
The variance is low enough that we're we have a
significant result five or ten repetitions if the variance is very high still and you
we're comparing something then more experiments might be needed and how many are needed this is
something we can do like we can use statistics to tell us what is the significance of our results.
Questions other than that?
No? Very good.
Then next week, Martin will do the lectures.
And the week after, I'm going to be back.
Please check Moodle for the programming tasks and the assignment.
And picking up the programming task. And I'll see you in two weeks.
Thank you very much.