Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking (2)
Episode Date: April 16, 2024...
Transcript
Discussion (0)
Welcome to this week's Hardware Conscious Data Processing.
My name is Martin. I'm also part of the research group of Professor Rabel.
And today I will present the contents because Professor Rabel is not in Potsdam this week.
Before we start and continue with the topics of last week, short announcements.
There will be the lecture series on HPI research this evening.
This will be here in this room.
Today, Professor Felix Naumann will present about data quality and data preparation.
So this is probably pretty interesting.
You can find more information if you're not aware of this lecture series. You can
find more information on this website. Okay, so where have we been last week?
Last week, or we are still in the introduction of a lecture. Last week
Professor Rabeel started with performance
analysis.
We are going to continue with this today.
And afterwards, depending on how far we get,
we are going to cover database basics.
Just a quick reminder to bring us all on the same page
with database systems.
Concerning the timeline, you can see here we are now in the second week.
And this week, we will have those two topics today,
but the performance management, DB basics, and then tomorrow we are going to start with some probably more
interesting stuff and talk about CPUs. But today as I've said we are still in
the introduction phase. Okay so what did we learn last week? Last week there was
Professor Rabe gave an introduction to performance analysis. That means how can we properly analyze the performance of our system,
of our algorithm, whatever you are working on.
How can we properly do that and evaluate how fast it is,
how efficient it is, how cost-effective it is,
and how to do it properly.
Before we do something like that or before we investigate a new idea, he introduced
the back of the envelope calculations, the biodegradable
to give a quick estimate if an idea is actually worth it,
worth investing more time in it,
and we talked about how to measure things
and some statistics.
Moreover, Professor Rabel presented
his famous seven-step paper thesis recipe.
And yeah, so the rest of this talk today
concerning the performance measurements
will be about benchmarks,
and especially how to do fair benchmarks,
which is often harder than it might sound,
than it might appear.
Okay.
So when we talk about benchmarks, there's a pretty wide spectrum of benchmarks and different
levels of benchmarking that you would usually do when you implement an algorithm or even
an entire system.
And those category or those levels are pretty much blurry, but they are still kind of levels
that they usually talk about.
And the very lowest level that we have are so-called micro-benchmarks.
These benchmarks, you usually execute them on your laptop or on a simple server,
and those are very low-level, simple benchmarks.
Just to get an initial idea or, for example, to give you those numbers
that we talked about last week.
If you want to know, OK, what is a rough estimate for my
bake-off-the-envelope calculation?
If I want to know, how long does it take to scan 1
million items?
So if you don't know this number, or if you don't have
the number at hand, you would simply write a very simple
benchmark, such a micro benchmark, and then just get
the minimum, maximum, average,
whatever you were looking for, measurement.
This is really mostly low level, and you do probably a couple of lines code, so very low
level, simple benchmarking.
This is so-called micro benchmarks.
The next level above, we are talking about functional and component benchmarks.
And now we have a high level function, but still a very specific aspect of the system.
One example might be sorting.
So while the sorting we had last week, so just calling std sort as we did last week,
this might still be considered a micro
benchmark. If we do now something like tera sort, this is probably more like a component benchmark.
If you're not aware, tera sort was kind of a standard benchmark for Hadoop systems,
and the purpose was how to sort one terabyte of data. So this means you're not just having like
four lines of code, or it having like four lines of code,
or it might be four lines of code, but behind it there is, there might be probably different
machines spawned, there's data distributed on your Hadoop cluster and so on. So this is not something
we would do on a single laptop, but this is more, this is more, involves more components of the system.
Concerning SQL, while a micro benchmark might be just a simple scan, so you have your vector
of integers and now you test how fast can I sum it or look for elements smaller than
17, now we would basically say, okay, let's say I might test a simple select query that
does one select.
So very simple, but we send an entire select query
to the database.
So more components are involved, for example, the parsing,
the optimizer, and then the actual execution
of this operation.
Again, one level above, we are talking about application level
benchmarks.
And here we measure the actual system performance, but in a pretty isolated way.
So here, for example, we would use that to evaluate hardware or software, but for a very focused aspect. For example, we would take for a large application, we would take one QB of this application and now test it with different configurations and stand alone,
just nothing else on the machine, just to evaluate how fast would this QB be, for example.
So this would be an actual application level, application benchmark. So we have real, somewhat
real data. We have the workload of this application,
but still, for example, what we often do in research, you only want to know,
okay, how fast is this QB5 of this particular QB of this application?
And then this will be, for example, an application-level benchmark,
which is part of, or might be part of a real application.
So this is usually the end goal.
If you have a system or if you build your system,
you want to have some advantages
or you want to provide some improvements
for a real application
that is actually really running in the real world.
Ideally, whatever you do
might have an impact on a real application, for example, an enterprise
system, a machine learning system, a search engine, whatever this might be.
Usually you don't have such a real system.
So it's really hard.
This might be the gold standard of showing that you have an impact, it's really hard to do something like that.
Because the more important or the more relevant this application is, the harder it is to access this data or at least publish results or talk about whatever you did there with this real application.
For this reason, there are very well-known standardized benchmarks and they have
different categories of benchmarks that we will later talk about but there is
probably the most well-known category here are standardized benchmarks. There
is the TPC Council which the purpose of this council is to standardize with together
with industry benchmarks that somewhat reflect real-world systems. So these
benchmarks say okay what is how do we create data in which form that this data
has to be ingested in a system and then we have certain queries specified which run on the on the ingested data and then also which metrics has to be reported
for this given benchmark. There's a wide variety of benchmarks. Then there are
non standardized benchmarks which are widely used. We'll also give you later a
few examples. So these benchmarks usually come from research projects which had pretty specific problems
and formalized or specified the benchmark for that.
And then later other people also tried to tackle the same problem.
So these benchmarks then became widely used
and accepted in the field.
And then pretty much, or not pretty much,
but very often you see that we have very focused,
small benchmarks, like pretty much every second paper
that you find for performance analysis does
or often needs to do its own benchmark.
But sometimes you say, okay, there's TPC-H,
and then there's just one little aspect that I think is really relevant,
and I can show that it's relevant in the real world,
but TPC-H or some well-known benchmark does not really cover the entire spectrum there.
So I adapt this benchmark or I write a completely new benchmark,
and maybe never somebody else uses this but
is basically sometimes required to have your very own benchmark.
This is basically the third category that we have.
So here we have a typical example, another example than the sorting example of last week
for a micro benchmark. The purpose of this code here is that
we want to add up all the numbers of some columns in a two-dimensional array. In this case,
the two-dimensional array is just flattened in a very large array. And now, depending on how, let's see, the question is,
how would you store the data in your array?
Do you first, or do you store it column by column,
or row by row?
But depending on how you store it,
accessing it in a row-wise order or column-wise order
can make a very, very significant,
or a pretty significant difference.
Depends on your hardware and so on.
So this will be an example we will later come back to
when we talk about CPUs.
But there can be quite significant difference
if you access sequential data,
so data that is consecutively stored in the memory, or if you jump around and have
a so-called strided axis.
As I've said, we will come back to this experiment,
but you can see the numbers here on this plot.
So in this case, the row-wise traversal of this 2D way
is, in this case, much faster.
For example, this is a logarithmic scale, the y-axis,
much faster than if you have a column-wise traversal.
So it really makes sense to think about how,
not just which cells do I need to add up,
but also how and which order I'm going to access that data.
And this is typically example that you often probably,
so in case you have to do something with it,
so you're thinking about some way to do summation
of a 2D array and you have no idea how fast it is,
doing such a simple benchmark and doing a simple micro benchmark helps you a lot to get an
initial idea and gives you already a good idea of which are the things that
you need to be really aware of and that you have to consider. But again we will talk
about it in more detail later. So what is good about, why do we do microbenchmarks?
They are very focused.
Often you don't start with the idea that you're going to event a completely new system,
but you start with a very small problem that you want to attack.
So the microbenchmark gives you the chance to have a very early idea
if this idea might work out or will give you a first good feeling about performance numbers.
So let's say you have now you're following the seven step recipe, you did your first analysis and on the back of the envelope evaluation and it turned out to be a good idea.
The next step can be a micro benchmark,
which shows you if it actually makes sense to do that
when it's really implemented.
So it's a very focused thing.
It's a DR in a very controllable setting.
So you are the one that controls the workload.
You say which data you want to create for this benchmark.
Is it skewed?
Is it integer?
Whatever.
You can easily set the data size and so on,
which might be much harder if you already
use a standardized benchmark, and so on and so forth.
You can change all the parameters that might impact your system and so on.
And it's usually easy to run.
There are obviously some reasons why we cannot, at least why we cannot use them alone.
So they often neglect the larger picture. Whatever you do there or whatever performance
gains you achieve you, they might be totally irrelevant for the eventual performance of the
system. So you might optimize something that is not relevant or that is just a tiny part of the
overall execution time. So it's really easy to focus for quite some time on micro benchmarks,
then losing the bigger picture. So that should be, it should keep in mind that.
And it's hard to generalize the results. Okay. So another level that we talked about are application level benchmarks.
And as I've already said, there is the so-called TPC council, or the TPC, which is the Transaction
Performance, Processing Performance Council.
And this organization tries to standardize benchmarks. There are well-known members in
this council. This is pretty much all the large hardware vendors out there as well as
database vendors. So there's Oracle in it, there's Microsoft in it, CPU vendors like Intel, AMD, VMware.
So pretty much all the big players are in this council,
and they try to come up with good standardized benchmarks
which allow to compare systems on a fair basis.
When we talk about database application level benchmarks, there are usually two, or even more, but two very relevant categories OLAP and OLTP.
Is anybody not aware of what these are. Okay, so OLAP, this is online analytical processing. This is,
yeah, as the name says, analytical processing. So what we do here is that we
have, we analyze lots of data. For example, we want to have the sum of all the products sold last
year separated by region. So those are large aggregations, usually large
joins. This is what is often done in so-called data warehouses. So this is not
your transactional system where every single sale is coming into. This is those
are often systems where you once a day throw in all your data that you
recorded over the day, and then you run all your analysis on those specialized systems,
often called data warehouses.
And two very well-known benchmarks there are TPC-H and TPC-DS.
And DS stands for decision support.
So these benchmarks, these two try to mimic what people do on
those systems, on those data warehouses. Scanning and analyzing large amounts of
data. TP is for transaction processing, so online transaction
processing. And here we have TPCC and TPCE, it's well-known examples. And T,
what we do here is that we have multiple clients
running concurrently,
and they do typical transaction processing.
So they insert like a single line.
They insert, for example, this item has been sold.
Later, they might update it,
or they might reference this with another sale and so on.
So they write, they delete a lot, they update a lot and always,
usually on a very small basis. So this is so-called classical transaction processing.
And TPC-H and C are rather old and outdated, but they are very often used in research because they
are comparatively simple. They're enough to evaluate systems in many cases and still simple enough for research systems to
play around with them while TPC-DS and TPC-E are much more
complex and
but also more realistic. So if you compare, if you actually want to do
want to find a new database system, then you're probably going to compare using TPCE and so on
and using the more recent benchmarks there.
More recently, there are other benchmarks,
and they come with an entire kit.
So, for example, TPC-H and TPC-EDS,
they give you a data generator
that allows you to generate those data sets.
You can say, okay, give me a scale factor of 100, which says the data generator or ask the
data generator to generate 100 gigabytes of data. And then you get a lot of CSV
files with data. You load that. And then there is a query generator which just
generates queries for you. But it doesn't run the queries. And it turned
out that more recent, more complex benchmarks are quite hard to implement
because it's not only generating data and running queries, there's much more in
them involved. So now more recent benchmarks, they come with entire kits.
So this is the two benchmarks that are listed on the bottom there. The express benchmarks TPC XBB for big data benchmark and AI for
artificial intelligence and here we're talking about Hadoop systems so there
are scripts that help you setting up running those benchmarks on larger
systems and creating the data having a client that
actually executes the queries in a reproducible way and so on so these
benchmarks are more complex but much easier to set up or not well not really
easier but since the setup of those artificial benchmark systems is quite
hard they help you a lot because it's just much more that's going on in these systems.
So the AI benchmark, for example, uses Spark in the background and has a couple of Python
scripts that do the artificial machine learning stuff. These kits help you to run all these.
Then we have a couple of benchmarks
that are not standardized, or at least,
yeah, they have not been standardized by a council
from different vendors, but they have been people,
for example, here, Patrick O'Neill,
the people that just found, okay, there are issues that I want to work on,
but there's no real benchmark that is currently giving me such a system to work on.
So what the paper back then did, or what they projected,
they implemented the star schema benchmark because they wanted to have a classical snowflake
schema.
So a typical schema you find in data warehouses.
And they used TPC-H, or they took TPC-H,
and just changed the schema a little bit,
added some or rewrote some queries,
so that it still looks kind of TPC-CH, but it has a different data schema.
So if you want to work on star schemas, then this is a very well-known, a very well-accepted benchmark you can use there.
Another benchmark is the CH benchmark.
As the name already gives away a little bit,
this benchmark combines TPCC and TPCH.
I would say probably around about 10 years ago,
there has been this trend of new database
systems which try to combine transaction processing and analytical processing.
You might have heard that Professor Plattner works a lot on this topic.
There is SAP HANA, which tries to do that.
MemSQL is a typical example, and the team Munich also did a lotANA, which tries to do that. MemSQL is a typical example.
And the TU Munich also did a lot there, which wrote this paper,
or which basically invented this benchmark after a Daxshul seminar.
And the purposes of those systems is that they claim that you no longer need to have your transaction system,
which is maybe Oracle, and then you have your additional system, which is maybe Oracle, and then you have your
additional system, which is your database data warehouse, but these systems want to
combine that.
Often in the beginning, this was all in memory computing, so they said, okay, now we can
use the performance advantages of memory computing and to combine those two systems into one. Of course now we need a benchmark for that and the CH benchmark basically rewrote the TPCH queries to run on
a schema that was written by TPCC clients. So there have been a fixed
number of TPCC clients, so clients that did a lot of inserts, updates, typical
transaction processing and then we, so clients that did a lot of inserts, updates, typical transaction processing, and then we had another client
which did frequent analytics on top.
So they combined both workloads
on the same system.
In the world of key value stores, so almost schema-less database systems, there's a very
well-known benchmark, which is the Yahoo Cloud Serving Benchmark, YCSB.
And this is the standard benchmark if you want to evaluate your key value store.
So this benchmark does typical CRUD operations, so it creates data, reads, updates, deletes.
It is relatively simple, so it's comparatively easy to implement because it's all just on key value stores.
There's no analytics with joins and variations and so on.
But it's still realistic and heavily used.
And yeah, this is one example for a very well-known benchmark.
Also not standardized, but if you're
working in the field of key value stores,
probably you can't get around without benchmarking
with this benchmark. Last but not not least, as an example,
no, sorry, the last example of those relevant benchmarks
that are not standardized is the join order benchmark.
A quite recent benchmark, and this was also coming from a research problem.
So people found that in real world,
there are often queries that join lots of tables.
So TPC-H does, I think only has eight tables
and the largest queries there probably joined five
or so of them, or maybe all eight, but they don't really have those huge join trees, which you find in the real world if you has a couple of tables, and then they wrote queries with different numbers of joins.
So the queries with the most joins, I think, has 13 joins, 13 tables.
So this is already quite challenging for most systems,
which was also the idea of this paper back then was to show that
databases easily struggle with those queries because this is real data, so this is not
uniformly distributed as some benchmarks do. This is like if you scan for certain actors,
depending on which actor you scan for, you might have lots of results, you might only have few
results. The results might not join with certain countries because this actor
was only locally known and so on. So this is quite hard for most
databases to handle such queries and to really properly optimize such queries.
This is not something we will care about, but we still want to make you aware of
those benchmarks, that those are benchmarks that tackle
a very specific problem,
but still a really relevant problem.
And this is also widely used because joint ordering
remains one of the most pressing issues in database systems,
at least in analytical database systems nowadays.
In the worst case, you might still
so a bad join order can easily impact the performance
by several orders of magnitude.
So yeah, it's a very relevant problem
that is tackled in this or that is
tried to address in this benchmark.
And then we have real applications as one for performance measurements.
And in this case, you take an actual application, put it on top of your implementation on your
algorithm, and then try to evaluate how well it runs.
This is actually the best thing to do, because now we can really show that your system has a real impact on the real world.
And you don't focus on problems that are artificial or that nobody really cares about or that are insignificant.
But it's really, really hard to do that.
There are many real life applications
out there so even if we have a real one you might not generalize well which is
kind of a problem and then if you are lucky to have access to one of those
systems you still have the problem that you usually can't talk about it. So if
some company gives you like access to data and workloads, usually you have to sign
an NDA or whatever.
You're somewhat limited talking about that system.
And you can't give out the database, you can't give out the data sets, you might give out
the workload.
But without the data set, people are not able to reproduce their results and that makes it really hard for other people to improve
on your work or just verify what you did.
So it's very much limiting what you can do.
And also, this is not scalable.
So as I've said, what most database benchmarks, or at least the TPC benchmarks allow it to
do is they allow it to give or to set a certain scale factor, which defines the number of
gigabytes that you create.
So it can easily say, I'm going to run a scale factor of one,000 to have 100,000 gigabytes or something larger even or smaller,
which is obviously not possible if you have a real application. Because if that application
only has 100 gigabytes, that's your limit. Unless you artificially blow it up, but then again,
this might be no longer a real application.
But what we can still do here and what has been often done is that we take an artificial
system and then having the information how a real system look, we basically try to come
up with a benchmark that mimics those systems.
So we are going to look how data, what is the data schema, is it
snowflake, is no snowflake, how is data distributed, what kind of tables do we
have and then we want a data generator that mimics exactly those things. For
example the Big Bench data, Big Bench benchmark uses distributions from the
Census Bureau, so that's real US data from publicly
available Amazon datasets.
So this is still realistic data.
And then the data generator can try to mimic those properties of real systems, which often
helps a lot. Because one of the main problems, for example,
that has been found is that the TPC-H, which
was the first really well-known, well, not the first,
but maybe the worst first analytical benchmark
with real impact.
But then people came and found that data
is all uniformly distributed, which we talked
about I think last week, right?
And which is nice and easy to handle.
So working with uniform distributed data is nice, but data in the real world is never,
pretty much never uniform distributed.
So that's rarely the case.
So taking those more realistic data distributions
and working on skewed data set helps a lot
really to make your system or whatever you're implementing
robust and to be prepared for the real world.
And then if you mimic just the real data,
you obviously have no problem with putting your data online. You don't have to think about privacy, scalability,
and so on. So that has a lot of advantages taking this route.
And say at some point you will write a paper or thesis then ideally you cover all
aspects might not always be possible but ideally it's really good to busy for
example top-down show that you can that you can have any real impact so for a
real application might just be a tiny tiny thing for a real application, but you show for a real application
that whatever your idea is, has a real impact.
So you improve the performance or you improve efficiency
or costs, reduce costs and so on.
And then if you want to analyze
where the advantages come from,
you usually have to dive deeper and
run very specific benchmarks. I selected benchmarks, for example,
application-level benchmarks, only running single queries, or to
investigate performance, where is time going, you probably also need to run micro
benchmarks. So again, a good thesis or a good paper usually has all these flavors
of benchmarks. When you compare your work to other systems, which you should always
do, otherwise it's hard to evaluate what you actually achieved.
You should ensure that all the systems
have equal functionalities,
and that everybody, you and others,
are able to reproduce performance numbers.
So please keep that in mind.
That's highly relevant.
And if you do something like that,
feel free to ask other authors.
So if you do something like that, feel free to ask other authors. So if you want to improve on some work that has been published a couple of years ago,
just try to write an email to the original authors.
Many of them might no longer work there because they have finished their PhD or so on,
but at least the professor might still be around or colleagues might still be around,
and they might help you getting the system running. Nowadays, there's more and more papers that publish their work on GitHub, try to find out how they set up the
system to have a really proper setup for their work and that you are able to have a fair
comparison. have also be that you are able to have a fair comparison and there's actually
work on it and there has been a lot of discussions about that because fair
benchmarking and really comparing different systems against each other is
and remains challenging and probably the best example for that is all the best example for that is, or the best talk about that was by guys from
Amsterdam.
They had a talk in a paper that is called Fair Benchmarking Considered Difficult, and
they talked about those challenges.
They talked about how bad actually the current situation is.
What they found is that there are many many many problems in database benchmarking. So there are a lot of papers that compare and evaluate performance. So
many papers don't just evaluate fairly or don't evaluate or compare against
other systems at all. That's obviously a problem. But even if papers or approach
or if people compare against other systems those results are
often not really reliable and reproducible. It changed a little bit so
there's a lot of work but back in 2018 the papers complained that there are a
lot of white papers online that are very much misleading. Does anyone not know what a white paper is?
So this is basically papers put online by companies. This is more like, this is marketing,
right? This is not something that reviewers have written, read. So this has not been
reviewed, it's just some company writes a 10-page report, puts it online, obviously
the numbers are never going to be bad, because this is how you market for your
own product. But even in academia, it might be pretty obvious that
white papers are not always the best way to go or not always so much trustworthy but
even academia there were many numbers and papers they are just unreduced
unreduced and reproducible so it's really hard to evaluate evaluate and
trust these numbers and yeah the interesting thing was that more and more
people published their results and there was
a lot of results and many, many comparisons, which at first sounds like a good idea, but
only few of them have been really useful.
Many people tried to reproduce certain papers and found it really hard to reproduce them.
So there has been a lot of work on this topic.
There have been a lot of discussions and things are improving.
We'll talk about initiatives in a minute.
But it's still a problem that just having many results that you plot and many benchmarks
you do, it still doesn't necessarily mean that the results are really valuable and people can trust them.
And one reason for that are so-called benchmarking games.
So basically you're gaming the system.
While others are p-hacking in some research fields. Benchmarking is often called benchmarking.
Yeah, it's gaming. So I try to find configurations and settings that just work out best for your
systems. And this can be, for example, different configurations. So you compare against different
systems and you use or you compare against vanilla Postgres and
your configuration of Postgres and then you tweak some configurations for your
systems or you don't do it for another system that you compare against. You might
hardwire configurations which can be fine if you if you're very
transparent about it because you might not be able to to automate certain optimizations so you just have to state
which optimizations use you do but if you hardwire them and you don't tell
about them and the other systems that you compare against don't have these
optimizations then it's obviously not a fair comparison you know the
specification of your benchmark might be totally biased
towards your system, which is often just happens.
Because you start with a problem, you try to come up with a
benchmark and they don't really realize that from the very beginning, because you
wanted to show this specific problem, that the setup is just highly biased.
So you have to be aware of it and maybe go one step back and think about, okay, how can
I make this more generalized, my problem, for example, to have less bias there.
Another issue that the group came, that we found is synchronized workload queues.
So in some systems, or what some systems did
was that they had clients running and asking queries,
but they used a synchronized workload queue.
That means that the system on a test
was probably never really fully on the test,
so there was some synchronization going on and especially if you want to compare
the or want to evaluate the effects on a real world you are never going to have the situation
where one query comes after another but there might be peaks where lots of clients approach a
system and run queries on the systems and then at the very same time
and then there might be times when nothing happens so a synchronized
workload queue might be something that is not really a fair comparison then
obviously arbitrary workloads are not a good thing to have very small benchmarks
are often a problem so there are still papers out there which use,
for example, TPC-H with a scale factor of one,
and then do benchmarks on that, which is good.
Which is not bad in general,
because TPC-H is a well-known benchmark.
So if you do all the TPC-H queries,
it's probably a good thing to do.
And this is standardized.
So if you run this on Postgres or whatever database,
it first looks like a pretty decent comparison.
But then if you look how much data is actually really
touched by TPC-H and have a large enough server,
it might turn out that pretty much all the data for scale factor one is in your caches. But depending on, you might have a
large buffer cache of a database, so either all the data is in a buffer cache,
so you never really access disk, or if you keep your data in memory,
then all the data might be in the caches, so we do not really benchmark anything valuable,
at least for larger systems.
So this might also be a problem.
And the root cause for those benchmark games is that we often,
so research often tries to have the, like like on the first page of a paper, people, like pretty much everybody tries to put one figure that gives away like that we are good, right?
So the paper makes sense.
Here's one figure where we compared the performance of our system with their system.
And we would like to show that we are just
much better. So the first page you want to have that and if you don't have that
on your paper then you don't get accepted which is obviously a problem
for you as a researcher and then the same goes for products. So you have to have this plot.
And this plot might be okay-ish, right? If you are pretty transparent about that.
If you say, okay, for this particular query, with these five very specific knobs,
we can improve performance by whatever, let's say say factor 20 or so in this case
which can be fair if you later pretty much very transparently tell okay this
might not be always the case where we not faster and so on right so this can
work out but often if you see those plots especially this simple plot this
is probably kind of fishy and maybe authors try to hide something
or the comparison is not really that fair and good.
So what are most common pitfalls when we talk about database benchmarks?
First of all, there's non-reproducibility. That's definitely a problem that is more and more addressed
as we're talking about in a minute.
Then failure to optimize.
We have apples to oranges comparisons.
Incorrect results, obviously.
So the black ones we are going to talk about
a little bit more in detail in a minute.
And then what also can be a problem is cold runs where there's hot or warm runs. Does anybody not know what those are? What we mean by that? What is a cold run? Anybody? Yeah.
A cold run is when I directly start a system and it doesn't really do anything and then I Anybody? Yeah. Yeah.
Yeah.
Exactly.
So you might just start up the system,
have the first or execute the first query,
which is fair because if you do all your benchmarks like that,
it's a totally fair comparison because often systems are cold. But the differences might be quite high
if you compare that to a system that is not cold.
So that is either warm or hot.
That means the caches are filled.
The QB has maybe already been optimized and cached,
so we don't have to optimize the QB plan.
All the caches are filled with
data that's actually relevant and so on.
So this comparison can be really unfair.
So at least you have to make sure if you compare two systems, either both are hold or cut.
So they are the same state or comparable state.
And then data pre-processing, so you have to be very transparent.
Did you do any data pre-processing, so you have to be very transparent. Did you do any data pre-processing?
Ideally, if that is something that a system automatically does,
show how it would run if it don't pre-process.
And then also another problem is overly specific tuning.
So when you just tune for a very small aspect.
So non-reproducibility. This has been a very frequent problem which has
been tackled a bit frequently. So often there's no code available. This is very much the case,
pretty much still very often the case for industry papers because they can't
provide the code for their product. But even in research it still happens that people have
papers with no code available, which is obviously a problem. And there are still little very little consequences so papers might still be accepted
there is efforts to to counter that one well known or one of the biggest efforts in database systems
is the sigmoid reproducibility also awards so there are awards who did this basically best
and then there are different levels of reproducibility.
For example the artifacts available this is the green button here this can just plot this is put on your paper if you if you are able to if you make your artifacts available here the reviewers
are certain reproducibility chair twice looks at your, for example,
GitHub repository and checks,
those things that you talk about,
are there somewhat available?
So data there, is there code, is there a make file for,
is the code of your paper available?
Can I download, can I check out the git repository and can I execute for example
the make file or so on.
So there's a little bit of effort put in it, but they just check for the artifacts available.
Much better for example is this blue button which means that results have been replicated.
So this means that somebody really invested all the effort to try to replicate your results.
And this can be a lot of work for both sides.
So people try to get better and better at this.
This might be, for example, that you give out a Docker image or something. So that's super simple to just pull an image or whatever
kind of code distribution you use there. And then with a single click, I can say,
please, now on this machine, run this entire code. And ideally, there comes another paper out. So
there's at least the plots are plotted. And I can see, okay, the plots look somewhat the same. I
might not have the same machine, but the plots do show that the code really did
what I expected it to do,
or they might even produce the entire paper.
So there's the paper, all the lattices running,
and then the plots are just, yeah,
the plots are completely reproduced and redone,
and everything looks fine,
and so results
have been replicated which is basically the best way you can go there.
Obviously it's sometimes also not possible if you are working on hardware
that not everybody has access to then might not be able to give the reviewer
access to it so in some cases it's simply not possible to replicate your results. Same if you have data that nobody is allowed to access.
But if it's possible, you should thrive for that.
Okay, here we have an issue.
Here we have a benchmark that compares three systems or four systems with each other.
In the first plot on the left, you see that they compare the runtime, the median runtime
here of MariaDB and Postgres, showing that Postgres is a little bit faster than MariaDB.
The middle plot, you can see that Postgres is a little bit slower than SQLite.
And on the right side, then you see MariaDB Star, which is probably something like, no, the author's optimized version.
And now it's much faster than SQLite.
So being the first, the fastest database in this comparison.
Does anybody have an idea what might be the crime here?
It's not the plot per se, so there's not another trick in the plot, but does anybody have an
idea what the authors might have done here which makes it not a fair comparison.
Or do you have an idea what you could do to improve the performance, let's say, of MariaDB? No guesses?
Okay.
So turned out that those have been the same configuration parameters, but this would be something that you can easily hide
and that has a vast impact on MariaDB, Postgres, and so on.
For example, you usually can tune the buffer size.
So some systems are limited to their buffers,
so they don't keep all the data in memory,
and you just increase this buffer size
and allow all the data to be in memory.
This usually improves performance
quite significantly. So if you would have different configuration parameters,
that might be benchmark gaming, but this has not been the case here.
Compilation flex. So all these databases are open source, so you can usually just check them out, build
them on your own.
Something else that is easy to tune and that is not obviously in those comparisons is that
you might have different compilation flags.
So you might compile all the other databases, just not yours with debug flags or only having minus or one optimizing the code or
optimization flags for the compiler and only your database uses all the optimization flavors so
minus or three or something so this would obviously also be a very unfair comparison that they would never
see with a plot. So this is just hidden and this would obviously be an unfair practice.
But this has also not been done in this case. Indexing would be a problem. So you could
just add an index. In this case, this would probably not help for this query, but
you could add indexes for one database and not the others which
might help one database. And it was the same version number, which is also
another thing that you have to keep in mind or that you have to be aware of
because often people start for example doing their work on Postgres your research is usually not done within two
weeks so you might work for one or two years on Postgres in the meanwhile
there's two there this Postgres has two new major versions so you're comparing
different Postgres versions against each other so you have to be very transparent about
that can be a problem but what actually was the issue here was that this has
been a different schema so the office here used for this Maria star used double
columns instead of decimal doesn't probably probably sound that much of a change. And it's allowed. So,
TPC-H allows you to do that. They don't require to have an actual decimal type,
for example, for dollar values. But as you probably know, having doubles for such decimals can be problematic.
So at some point, you will have rounding errors in your system.
And here, for benchmark comparisons, the issue is that double can be natively optimized or
is optimized by the CPU, and CPU where we will now know how
to do that and there's special hardware for that.
For decimal, there needs to be a little bit more work to be done because it's simply not
a native data type.
So it's simply not a fair comparison in that case. So this would be benchmark gaming.
And the problem is usually there's a low incentive to optimize the competition. So if you just want
to show that you are faster than another system, why should you invest days just to optimize their performance? Which is obviously problematic.
So one thing we have already mentioned is compiler optimizations that you can use.
You should use the same compiler version.
You should use the same flex.
You should use the same configuration.
This is here the chat buffers that we list here.
So the cache size has pretty much as a large impact,
at least for Postgres and most database resident database,
disk resident database systems.
And the easiest fix, which pretty much,
not always, but very often works,
is just to involve the competition.
So either way, you can just check out
their Git repositories if they have something open source
where you can see how they benchmark their system
and just copy, or at least look at their flags.
How do they set up the system, which make file flags
or make files do they use to compile the system?
So whatever they use, use is probably a fair assumption
that this is already like a good configuration.
And then if you have a script, an automated Python script,
bash script, for example,
that does the major comparison of a plot,
maybe just write one of the authors
and ask if they think that it's fair.
Usually they are quite happy to answer
because they don't want to see their paper
in an unfair comparison.
So they are, there's some interest,
usually interest in fixing those things.
So just write them, ask them,
and usually works out, but it can be lots of work.
So sometimes people might complain
that the entire comparison you do is unfair and you need to redo the comparison. usually works out, but it can be lots of work. So sometimes people might complain that
the entire comparison you do is unfair
and you need to redo the comparison.
This might be a lot of work.
Yeah, here in these two plots you can see the
configuration impact for Postgres here.
So we compared two Postgres versions
with different configurations.
You can see that it's almost a factor of two
that you can gain here with different configurations
running this TPC-H query.
And for Minidb with different compilation flags,
that's also a significant performance improvement
to have those or don't have, use those compilation flags.
Another problem is apples versus oranges comparisons.
And so this is always the case if you, for example,
compare yourself against the
full system. Let's say you are, for example, you implement a join, you have an idea how to do a
join very efficiently and you implement it. You do that in memory, you do that in C++ on your laptop,
and you just do the join, right?
So this is all you do, and then you compare it
against running the same join in, let's say, in Postgres.
First of all, it's probably not comparable
because Postgres is not really optimized for large joins.
Postgres is disk-resistant, so maybe data needs to be read from disk,
which you might not do.
But even if you use a more recent analytical database,
let's say, for example, DuckTB,
which is in memory and optimized for analytics,
even then this might be an unfair comparison
because larger systems have
to do a lot of stuff on top of just the join. They do transaction processing so
unless you deactivate that they run transaction so there's a lot more going
on than just having this single join. There might be overflow checking which
they have to do so we have to check if the sum might over overflow you probably don't do that if we just have your micro
benchmark and just want to evaluate the performance of your join because you
might not care even about it but this obviously makes it unfair in the
comparison so the only fix here is actually to implement your code in a real system, which is often hard, but probably the only way to have a really fair comparison.
And since there are systems as Postgres, as DuckDB, so open source systems
that are well understood and they are also used by different universities,
it makes sense to do that because there's knowledge out there how to do it and often it's probably the best
way to go. Here in this case we have another extreme. Here we compare
MoDDB which is a in-memory database for analytics and we compare it or the comparison is against timdb
but timdb is not actually a database timdb is just one one function and c that does exactly
one query so the query is not there's no s being parsed, being optimized, being executed by different operators.
There's just one tight loop that does all the query, just like a couple of lines C code.
And so obviously this is a very unfair comparison.
So you shouldn't, you should never claim that you are faster by whatever factor here than MondeDB, because
this is simply not a comparison that can be done in a fair way.
But still, it is really interesting to have those numbers.
So this would probably be more like a micro benchmark, because it's just a very simple
aspect even though this query does a little bit more
with aggregations filtering,
but it's still really interesting to have this query
because this can be, or this implementation,
because this can be upper bound.
If you have this optimized C code,
which has been compiled for query one,
then this might be the best you can achieve on your system.
So for example, if you try to build a database that compiles Kriby's, like this does not
run that produces code just for a specific Kriby, this might still, this might be helpful
with you, helpful for you to see, okay, where can we get? But just this comparison, obviously,
without further information is not helpful and not fair.
And then, this is also, this will always hit you, I think.
Sometimes you make your, it might be surprisingly fast and your system just really is really extremely fast
because you by accident throw away data and scanning non-data is pretty much always faster
than scanning actual data so you should always check your results. This happens a lot and it can easily happen.
So keep in mind, always check your results.
So have something in your pipeline when you do your benchmarks.
So for a simple paper, you might not have a fully fledged CI pipeline, but try to have
something that just checks that whatever you do yields the same results as, for example, Postgres does.
It doesn't need to have the same results
for ScaleFactor 1000,
because Postgres runs forever maybe,
but have some sanity checks
that check that your results actually are correct.
Happens more than you might think.
And obviously, if this is your implementation
for CRWI1 of TPC-H, even TMDB can't beat it.
Yeah, please check that.
And run with different benchmarks and data sets too.
So whatever you, if you're running a join,
check that it also returns to correct results with if you're running a join, check that it also returns to correct results if you
change the dataset, if you have another dataset, if you have another benchmark, and so on.
This slide includes a few pitfalls when you write or review. And we also are by no means immune. And they are just, you don't have to read
them now, but this might be helpful for you in the future to check all the boxes if you
are, if you checked for that, if you talked about that. For example, if you write your
thesis later, the second point is about reproducibility.
So you need to state what is the actual hardware that you used.
So today you find papers that just don't say what CPU they used.
How did you tune the database parameters?
What database version?
Did you compile it or did you pull the binaries, how did you compile
it and so on, which optimization flex did you use, and so correctness.
So this is a very helpful overview, so please keep that in mind whenever you in the near
future write your master thesis. This is a good thing to check that you
that you are properly benchmarking.
So summary so far.
This has been the introduction to performance analysis.
We talked about back of the envelope calculations.
They are extremely helpful to get a first idea about the performance of your idea or
the viability of your idea.
So if you can't show that on the back of the envelope calculation, if you can't show that
your idea makes sense or improves over the state of the art, then you have to think more about it. Then you
have to get back to the drawing board. So this is the first very good filter to find out which
ideas actually don't really make sense. So this helps a lot. And the numbers,
for most things you can find pretty good estimates online.
We listed a few of them. They are probably somewhere out there, the numbers.
How fast is your CPU scanning or whatever GPU you assume.
How many threads do you have so you can use those numbers just being in the right ballpark.
Calculate that and then you probably have a pretty good estimate mostly
if it makes sense.
It can still be wrong, but it's a good filter.
Then we talked about measurements and some statistics.
So we talked about how to do, how to probably measure
and what might be a good thing to,
what statistics you should collect.
So don't always go only with the average.
Depending on what you do, how many runs should you do,
or should you also report the min and max?
Is there, does your, do your results really,
is there a lot of noise?
That it might make sense to report them.
So usually you should always report them,
but how far you should go depends on your actual problem at hand. Think about that and be very transparent
about it. Then we talked today about benchmarks. There are many different
benchmarks out there. Pick up those that make sense for you to run. So there are probably benchmarks out there
that you can use,
and if you have to write your own benchmarks,
you have to have a good reason for that.
So it often makes sense to have your own benchmark
if you are focusing on very specific aspect
that is not covered by our benchmarks,
but then you should still show
that you are also working for standard benchmarks,
even though they might not trigger your actual improvements.
And we talked about how to do fair benchmarks so that you don't end up with an apples versus
oranges comparison.
Any questions for now?
Okay, so then we will say have a short break.
And then we talk about, or at least we start today with the database basics.
Say a 500 break.
Yeah, as I've said, now we're going to talk about database basics.
Sorry, that should be mostly known to everybody of you.
We're just going to repeat it so that everybody's on the same page,
that they use the same vocabulary throughout this lecture to bring everybody on the same page. Okay, so we are not going to talk about all these things today in the remaining minutes.
I probably stopped a little bit early,
but the next, yeah, today and tomorrow,
we're going to talk about,
we talk about the relational data model and SQL.
We are pretty much aware of everything of that, I guess.
We are talking about the classical architecture,
so those layers that we usually have in a database architecture and indexes.
And we briefly talk about query processing,
so how does a database
process queries? And it's mostly a refresher. So if you have any questions or if you're
that's totally something you haven't heard about, you should probably approach us because
we expect that you know certain things about it. So at least afterwards, you should probably approach us because we expect that you know certain things
about it so at least afterwards you should all know or you should probably
already know what a join is and we're still going to talk a little bit about
it yeah but please tell us if that's totally new to you shouldn't be the case Okay, so there have been quite some different data models, so in the 17th and so on, but
in the end, the world agreed to the relational data model.
So this is the model that pretty much all the major database systems used today. So we put our data into relations.
Mostly those are tables in our database,
in nowadays databases.
And you can see a simple example here of such a table.
Those things, they have a schema, so they have a structure.
We are not talking about schema-less databases or NoSQL systems, key value stores.
We are talking about relational databases.
And they are not, so there are cases of databases, relational databases,
that also have nested data
structures. We keep it simple for now. The classical relation model does not have
something like that. So for nested things you would have multiple tables that
reference nested structures. In this example here we have a simple table
which is called employee. It has five columns or attributes. For example the
first thing this might be the key, the identifier. This is the p underscore id.
We have a given name, a last name, an h and an address. So we have five
columns and sometimes the name of the column is called an attribute or sometimes also attributes means the same as the columns. It's used for example we have marked Stefanie Meyer.
This is, you can call it a row or sometimes it's called a tuple. It's in the
end the same idea. And how this is stored is totally up to the database.
So this is something that the database returns to you. So it returns, it allows you to access this table, to run queries on this table.
But if the database decides if it wants to store each row on its own,
or if it wants to store each column on its own,
or it might even do something in between.
So there are also systems that do hybrid approaches. This is usually
independent, so this is disconnected from a relational model. This is something how
the database implements the relational model. And yeah, we have columns, rows,
attributes, and then we have usually where we were known as the cardinality,
which is four here. So the cardinality which is four here so the
cardinality describes the size of the table it's not with but how many tuples
we have and then we have sometimes you're here there which is the rank
which is five here for for the width so we have five rows here and just short
question do you see a problem with this table? Let's say you
want to store your employees as a large company, you want to store your
employees and their data in a relational table, let's say this is a
physical table, do you see a problem with that?
There must be somebody.
Would you ever, how would you store, would you store the age like that?
Yeah?
Yeah, right.
So it doesn't make sense.
So this is just a simplified example,
which you find in pretty much every database
and deduction course,
but please never store age or something,
store a birth date.
Maybe have a view on top of your table
that does this age calculation.
Makes probably a lot of sense
because it's easier for the user to have age calculated,
but please don't do that.
So H, you shouldn't never store something like that
in database.
And for each column, we also have data types.
And that's really important.
This is important implementation aspect,
which data types you support and which data types you use.
And especially in research systems, you sometimes see systems that only support integers because
it's just much easier to handle.
But there's a huge difference between a column having strings or integers.
Strings are much more harder to handle. We don't even have to go to collations
and different formats of how to store strings.
Strings are just much heavier to handle.
So it should be aware whenever you have,
coming back to the comparison,
whenever you compare systems,
try to also be fair in that regard
that you use comparable or the same data types
because there's a huge difference,
not on the size, right?
So using a big int is different from a small int and so on,
but there are many aspects that have a big impact
on system performance.
The standard way to access relational database systems for the last decades now is structured query language, SQL or SQL.
There are many different standards.
The last one is, I think, from the last year, SQL 23.
And the main idea or the key feature of SQL is that it's declarative.
So you don't state how to execute a query.
You state just like what you want to have.
So how should your result, what should your result be?
For example, you say that you want to have the results from two tables
combined by having an accurate predicate on those tables,
so there's a join. You don't need to say, okay, which table should be processed first.
If there are filters, do I do the filters before the join, after the join, which table is the first to be scanned?
You just say, okay, I want to have these tables filtered, join. I might want to have an aggregate. And then it's a task of a database
to formulate a query that executes this,
a query plan that executes your query.
And this is something that you don't have to care about.
And this is a main feature of all the modern database systems
because that's basically the secret source of the performance.
Because writing performance code for large datasets that might be distributed over multiple
nodes, if there wouldn't be an optimizer that does that for you, that's extremely hard to get right.
So this is where all the performance usually comes from modern database systems.
And we have four basic commands. This is what we usually call
CRUD. So we have create, read, update, delete, or in more SQL-ish terms, we have insert, update,
delete, and select. And we have DDLs, or data modification, or DML is data modification,
and we have definitions, so we can define the schema,
create table with those rows of this particular data type,
with a default value, having certain constraints on it
or not, and then we have
language to modify data, for example,
you can update a tuple, or add a row, add a column, remove a column, and so on.
There are also other languages that we are not going into more detail right now.
The most common, you usually always also use and see is this form here of a select query. I probably don't have to go into more details,
but this is the form that you usually always see,
the way we select from a number of tables.
Might be a join involved if there are multiple tables,
or a union.
We have predicates, so we can filter the query,
we can group by, so we have different results sets
for different groups.
Having a predicate on a group by, so we have different results sets for different groups. Having is a predicate on a group by, so maybe you want to include only certain groups that qualify for a certain predicate, and then you can also order by,
just to have a nice order of the results in the end. And so the classical architecture which is still true today, so yeah it's still
today, you can see this five layers here that we have defined in pretty much
every database at least in the relational databases in some form. So on
the very bottom there's the operating system that
gives you the memory management so that you don't have to... it provides you just
this is the main operating system. So it provides you with file
management, this does your memory translation, so this manages
memory for you, virtual memory and so on. So this is the usual use.
There's a lot to discuss which we don't do right now,
but where databases try to avoid the operating system
because sometimes the operating system might be in your way
when you're a database system
because you want to use the machine to the full extent
and the operating system might think it should be fairer to do something else, right?
So there's sometimes a conflict between databases and operating systems, but mostly
the database system sits on top of the operating system and uses what it provides.
Then level above there is something called a buffer manager.
Even most modern in-memory something called a buffer manager. Even most modern memory databases have
a buffer manager because at some point you don't want to have all your data in memory but only the
data that's frequently accessed or right now accessed and data that is never touched can
remain on disk. So the buffer manager is responsible for putting and putting data to disk.
And the idea is here that the third layer, which is the memory structure, which says how records are stored,
ideally this thing does not need to know how data is persisted on disk.
So it says, okay, I have, for me,
records are stored in a vector,
and now, dear buffer manager,
please give me access to this vector.
And now it's the purpose of the buffer manager to say,
okay, I might need to read this from disk,
I might need to print it in my memory,
maybe it's already there, so I can immediately return it. But the third layer I might need to print it in my memory. Maybe it's already there so I can immediately return it.
But the third layer does not need to care
what disk was, where it was stored,
how it might be striped for performance.
I just want to say, okay, please give me now
this type of data and this size.
And the buffer manager returns that for you
and does all the handling in the background
so they can separate your concerns here. So the memory structure says how is actually data
stored so how are physical records stored and which format or which
indices and where are they how are they kept it stores it manages the indices
that you might use. It does locking so if there are updates, concurrent updates, you usually need some way of locking records.
And if you want to be persistent, which is a nice feature of a database to be persistent in case of crashes,
you also want to lock data and be able to recover in case something happens.
And there will always happen something.
Above that layer, there's the logical axis.
So now this is how you work on your records. So for example this does the sorting,
this does your transaction management, so this coordinates what happens if we have two concurrent
users. One is just writing an item, the other one is reading the exactly same table,
how can we avoid conflicts and so on. It manages the cursors and also has a data dictionary or the
catalog which tells users or keeps information about what tables do we have and what is visible what is not visible and so on and so forth and the very
level on the very top is the data model level here we have for example the query compilation so here
we reformulate our queries we optimize our queries we say how a secret queries might be actually
executed we have the access path selection. So having sometimes we
need to decide do we sequentially scan our data or do we have an index? If we
have an index do we actually want to use it? Is it advantageous or do we not want to
use it? Access control obviously in a real database has to be done. So some
people might not be allowed to access all the data.
So those are the levels you usually always find in every database, at least
in more realistic databases. If you look in research databases, you often
don't see something like integrity checks because they need to be done but
it's often not that interesting. Sometimes people don't see something like integrity checks because they need to be done, but it's often not that interesting.
Sometimes people don't do logging and recovery, but at least if you have a real database,
you need something like that.
But of course, there are many topics that you just can't assign to one single level.
So, logging needs to access different levels.
There's also recovery.
There are many things that you can't just pinpoint.
Okay, this is in this level, this is in the other level, and so on.
So, this might be hard, and I'm going already over time.
I'm sorry for that. Yeah, so next week, I'm not going to this slide, next week we will talk about hard
drives, about indexes and not netgrains next week, I'm sorry,
tomorrow we continue talking about hard drives. They are still a thing. Might surprise you, but hard drives are still,
actually, most of the data is.
And about all the other topics in,
about database systems and then continue with CPUs.
Okay, thank you and see you tomorrow.