Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking (2)

Starting point is 00:00:00 Welcome to this week's Hardware Conscious Data Processing. My name is Martin. I'm also part of the research group of Professor Rabel. And today I will present the contents because Professor Rabel is not in Potsdam this week. Before we start and continue with the topics of last week, short announcements. There will be the lecture series on HPI research this evening. This will be here in this room. Today, Professor Felix Naumann will present about data quality and data preparation. So this is probably pretty interesting.

Starting point is 00:00:48 You can find more information if you're not aware of this lecture series. You can find more information on this website. Okay, so where have we been last week? Last week, or we are still in the introduction of a lecture. Last week Professor Rabeel started with performance analysis. We are going to continue with this today. And afterwards, depending on how far we get, we are going to cover database basics.

Starting point is 00:01:18 Just a quick reminder to bring us all on the same page with database systems. Concerning the timeline, you can see here we are now in the second week. And this week, we will have those two topics today, but the performance management, DB basics, and then tomorrow we are going to start with some probably more interesting stuff and talk about CPUs. But today as I've said we are still in the introduction phase. Okay so what did we learn last week? Last week there was Professor Rabe gave an introduction to performance analysis. That means how can we properly analyze the performance of our system,

Starting point is 00:02:10 of our algorithm, whatever you are working on. How can we properly do that and evaluate how fast it is, how efficient it is, how cost-effective it is, and how to do it properly. Before we do something like that or before we investigate a new idea, he introduced the back of the envelope calculations, the biodegradable to give a quick estimate if an idea is actually worth it, worth investing more time in it,

Starting point is 00:02:49 and we talked about how to measure things and some statistics. Moreover, Professor Rabel presented his famous seven-step paper thesis recipe. And yeah, so the rest of this talk today concerning the performance measurements will be about benchmarks, and especially how to do fair benchmarks,

Starting point is 00:03:13 which is often harder than it might sound, than it might appear. Okay. So when we talk about benchmarks, there's a pretty wide spectrum of benchmarks and different levels of benchmarking that you would usually do when you implement an algorithm or even an entire system. And those category or those levels are pretty much blurry, but they are still kind of levels that they usually talk about.

Starting point is 00:03:46 And the very lowest level that we have are so-called micro-benchmarks. These benchmarks, you usually execute them on your laptop or on a simple server, and those are very low-level, simple benchmarks. Just to get an initial idea or, for example, to give you those numbers that we talked about last week. If you want to know, OK, what is a rough estimate for my bake-off-the-envelope calculation? If I want to know, how long does it take to scan 1

Starting point is 00:04:13 million items? So if you don't know this number, or if you don't have the number at hand, you would simply write a very simple benchmark, such a micro benchmark, and then just get the minimum, maximum, average, whatever you were looking for, measurement. This is really mostly low level, and you do probably a couple of lines code, so very low level, simple benchmarking.

Starting point is 00:04:37 This is so-called micro benchmarks. The next level above, we are talking about functional and component benchmarks. And now we have a high level function, but still a very specific aspect of the system. One example might be sorting. So while the sorting we had last week, so just calling std sort as we did last week, this might still be considered a micro benchmark. If we do now something like tera sort, this is probably more like a component benchmark. If you're not aware, tera sort was kind of a standard benchmark for Hadoop systems,

Starting point is 00:05:17 and the purpose was how to sort one terabyte of data. So this means you're not just having like four lines of code, or it having like four lines of code, or it might be four lines of code, but behind it there is, there might be probably different machines spawned, there's data distributed on your Hadoop cluster and so on. So this is not something we would do on a single laptop, but this is more, this is more, involves more components of the system. Concerning SQL, while a micro benchmark might be just a simple scan, so you have your vector of integers and now you test how fast can I sum it or look for elements smaller than 17, now we would basically say, okay, let's say I might test a simple select query that

Starting point is 00:06:04 does one select. So very simple, but we send an entire select query to the database. So more components are involved, for example, the parsing, the optimizer, and then the actual execution of this operation. Again, one level above, we are talking about application level benchmarks.

Starting point is 00:06:31 And here we measure the actual system performance, but in a pretty isolated way. So here, for example, we would use that to evaluate hardware or software, but for a very focused aspect. For example, we would take for a large application, we would take one QB of this application and now test it with different configurations and stand alone, just nothing else on the machine, just to evaluate how fast would this QB be, for example. So this would be an actual application level, application benchmark. So we have real, somewhat real data. We have the workload of this application, but still, for example, what we often do in research, you only want to know, okay, how fast is this QB5 of this particular QB of this application? And then this will be, for example, an application-level benchmark,

Starting point is 00:07:22 which is part of, or might be part of a real application. So this is usually the end goal. If you have a system or if you build your system, you want to have some advantages or you want to provide some improvements for a real application that is actually really running in the real world. Ideally, whatever you do

Starting point is 00:07:43 might have an impact on a real application, for example, an enterprise system, a machine learning system, a search engine, whatever this might be. Usually you don't have such a real system. So it's really hard. This might be the gold standard of showing that you have an impact, it's really hard to do something like that. Because the more important or the more relevant this application is, the harder it is to access this data or at least publish results or talk about whatever you did there with this real application. For this reason, there are very well-known standardized benchmarks and they have different categories of benchmarks that we will later talk about but there is

Starting point is 00:08:34 probably the most well-known category here are standardized benchmarks. There is the TPC Council which the purpose of this council is to standardize with together with industry benchmarks that somewhat reflect real-world systems. So these benchmarks say okay what is how do we create data in which form that this data has to be ingested in a system and then we have certain queries specified which run on the on the ingested data and then also which metrics has to be reported for this given benchmark. There's a wide variety of benchmarks. Then there are non standardized benchmarks which are widely used. We'll also give you later a few examples. So these benchmarks usually come from research projects which had pretty specific problems

Starting point is 00:09:31 and formalized or specified the benchmark for that. And then later other people also tried to tackle the same problem. So these benchmarks then became widely used and accepted in the field. And then pretty much, or not pretty much, but very often you see that we have very focused, small benchmarks, like pretty much every second paper that you find for performance analysis does

Starting point is 00:10:00 or often needs to do its own benchmark. But sometimes you say, okay, there's TPC-H, and then there's just one little aspect that I think is really relevant, and I can show that it's relevant in the real world, but TPC-H or some well-known benchmark does not really cover the entire spectrum there. So I adapt this benchmark or I write a completely new benchmark, and maybe never somebody else uses this but is basically sometimes required to have your very own benchmark.

Starting point is 00:10:29 This is basically the third category that we have. So here we have a typical example, another example than the sorting example of last week for a micro benchmark. The purpose of this code here is that we want to add up all the numbers of some columns in a two-dimensional array. In this case, the two-dimensional array is just flattened in a very large array. And now, depending on how, let's see, the question is, how would you store the data in your array? Do you first, or do you store it column by column, or row by row?

Starting point is 00:11:14 But depending on how you store it, accessing it in a row-wise order or column-wise order can make a very, very significant, or a pretty significant difference. Depends on your hardware and so on. So this will be an example we will later come back to when we talk about CPUs. But there can be quite significant difference

Starting point is 00:11:38 if you access sequential data, so data that is consecutively stored in the memory, or if you jump around and have a so-called strided axis. As I've said, we will come back to this experiment, but you can see the numbers here on this plot. So in this case, the row-wise traversal of this 2D way is, in this case, much faster. For example, this is a logarithmic scale, the y-axis,

Starting point is 00:12:17 much faster than if you have a column-wise traversal. So it really makes sense to think about how, not just which cells do I need to add up, but also how and which order I'm going to access that data. And this is typically example that you often probably, so in case you have to do something with it, so you're thinking about some way to do summation of a 2D array and you have no idea how fast it is,

Starting point is 00:12:46 doing such a simple benchmark and doing a simple micro benchmark helps you a lot to get an initial idea and gives you already a good idea of which are the things that you need to be really aware of and that you have to consider. But again we will talk about it in more detail later. So what is good about, why do we do microbenchmarks? They are very focused. Often you don't start with the idea that you're going to event a completely new system, but you start with a very small problem that you want to attack. So the microbenchmark gives you the chance to have a very early idea

Starting point is 00:13:32 if this idea might work out or will give you a first good feeling about performance numbers. So let's say you have now you're following the seven step recipe, you did your first analysis and on the back of the envelope evaluation and it turned out to be a good idea. The next step can be a micro benchmark, which shows you if it actually makes sense to do that when it's really implemented. So it's a very focused thing. It's a DR in a very controllable setting. So you are the one that controls the workload.

Starting point is 00:14:03 You say which data you want to create for this benchmark. Is it skewed? Is it integer? Whatever. You can easily set the data size and so on, which might be much harder if you already use a standardized benchmark, and so on and so forth. You can change all the parameters that might impact your system and so on.

Starting point is 00:14:31 And it's usually easy to run. There are obviously some reasons why we cannot, at least why we cannot use them alone. So they often neglect the larger picture. Whatever you do there or whatever performance gains you achieve you, they might be totally irrelevant for the eventual performance of the system. So you might optimize something that is not relevant or that is just a tiny part of the overall execution time. So it's really easy to focus for quite some time on micro benchmarks, then losing the bigger picture. So that should be, it should keep in mind that. And it's hard to generalize the results. Okay. So another level that we talked about are application level benchmarks.

Starting point is 00:15:31 And as I've already said, there is the so-called TPC council, or the TPC, which is the Transaction Performance, Processing Performance Council. And this organization tries to standardize benchmarks. There are well-known members in this council. This is pretty much all the large hardware vendors out there as well as database vendors. So there's Oracle in it, there's Microsoft in it, CPU vendors like Intel, AMD, VMware. So pretty much all the big players are in this council, and they try to come up with good standardized benchmarks which allow to compare systems on a fair basis.

Starting point is 00:16:32 When we talk about database application level benchmarks, there are usually two, or even more, but two very relevant categories OLAP and OLTP. Is anybody not aware of what these are. Okay, so OLAP, this is online analytical processing. This is, yeah, as the name says, analytical processing. So what we do here is that we have, we analyze lots of data. For example, we want to have the sum of all the products sold last year separated by region. So those are large aggregations, usually large joins. This is what is often done in so-called data warehouses. So this is not your transactional system where every single sale is coming into. This is those are often systems where you once a day throw in all your data that you

Starting point is 00:17:27 recorded over the day, and then you run all your analysis on those specialized systems, often called data warehouses. And two very well-known benchmarks there are TPC-H and TPC-DS. And DS stands for decision support. So these benchmarks, these two try to mimic what people do on those systems, on those data warehouses. Scanning and analyzing large amounts of data. TP is for transaction processing, so online transaction processing. And here we have TPCC and TPCE, it's well-known examples. And T,

Starting point is 00:18:02 what we do here is that we have multiple clients running concurrently, and they do typical transaction processing. So they insert like a single line. They insert, for example, this item has been sold. Later, they might update it, or they might reference this with another sale and so on. So they write, they delete a lot, they update a lot and always,

Starting point is 00:18:26 usually on a very small basis. So this is so-called classical transaction processing. And TPC-H and C are rather old and outdated, but they are very often used in research because they are comparatively simple. They're enough to evaluate systems in many cases and still simple enough for research systems to play around with them while TPC-DS and TPC-E are much more complex and but also more realistic. So if you compare, if you actually want to do want to find a new database system, then you're probably going to compare using TPCE and so on and using the more recent benchmarks there.

Starting point is 00:19:11 More recently, there are other benchmarks, and they come with an entire kit. So, for example, TPC-H and TPC-EDS, they give you a data generator that allows you to generate those data sets. You can say, okay, give me a scale factor of 100, which says the data generator or ask the data generator to generate 100 gigabytes of data. And then you get a lot of CSV files with data. You load that. And then there is a query generator which just

Starting point is 00:19:40 generates queries for you. But it doesn't run the queries. And it turned out that more recent, more complex benchmarks are quite hard to implement because it's not only generating data and running queries, there's much more in them involved. So now more recent benchmarks, they come with entire kits. So this is the two benchmarks that are listed on the bottom there. The express benchmarks TPC XBB for big data benchmark and AI for artificial intelligence and here we're talking about Hadoop systems so there are scripts that help you setting up running those benchmarks on larger systems and creating the data having a client that

Starting point is 00:20:25 actually executes the queries in a reproducible way and so on so these benchmarks are more complex but much easier to set up or not well not really easier but since the setup of those artificial benchmark systems is quite hard they help you a lot because it's just much more that's going on in these systems. So the AI benchmark, for example, uses Spark in the background and has a couple of Python scripts that do the artificial machine learning stuff. These kits help you to run all these. Then we have a couple of benchmarks that are not standardized, or at least,

Starting point is 00:21:16 yeah, they have not been standardized by a council from different vendors, but they have been people, for example, here, Patrick O'Neill, the people that just found, okay, there are issues that I want to work on, but there's no real benchmark that is currently giving me such a system to work on. So what the paper back then did, or what they projected, they implemented the star schema benchmark because they wanted to have a classical snowflake schema.

Starting point is 00:21:49 So a typical schema you find in data warehouses. And they used TPC-H, or they took TPC-H, and just changed the schema a little bit, added some or rewrote some queries, so that it still looks kind of TPC-CH, but it has a different data schema. So if you want to work on star schemas, then this is a very well-known, a very well-accepted benchmark you can use there. Another benchmark is the CH benchmark. As the name already gives away a little bit,

Starting point is 00:22:31 this benchmark combines TPCC and TPCH. I would say probably around about 10 years ago, there has been this trend of new database systems which try to combine transaction processing and analytical processing. You might have heard that Professor Plattner works a lot on this topic. There is SAP HANA, which tries to do that. MemSQL is a typical example, and the team Munich also did a lotANA, which tries to do that. MemSQL is a typical example. And the TU Munich also did a lot there, which wrote this paper,

Starting point is 00:23:09 or which basically invented this benchmark after a Daxshul seminar. And the purposes of those systems is that they claim that you no longer need to have your transaction system, which is maybe Oracle, and then you have your additional system, which is maybe Oracle, and then you have your additional system, which is your database data warehouse, but these systems want to combine that. Often in the beginning, this was all in memory computing, so they said, okay, now we can use the performance advantages of memory computing and to combine those two systems into one. Of course now we need a benchmark for that and the CH benchmark basically rewrote the TPCH queries to run on a schema that was written by TPCC clients. So there have been a fixed

Starting point is 00:24:00 number of TPCC clients, so clients that did a lot of inserts, updates, typical transaction processing and then we, so clients that did a lot of inserts, updates, typical transaction processing, and then we had another client which did frequent analytics on top. So they combined both workloads on the same system. In the world of key value stores, so almost schema-less database systems, there's a very well-known benchmark, which is the Yahoo Cloud Serving Benchmark, YCSB. And this is the standard benchmark if you want to evaluate your key value store.

Starting point is 00:24:47 So this benchmark does typical CRUD operations, so it creates data, reads, updates, deletes. It is relatively simple, so it's comparatively easy to implement because it's all just on key value stores. There's no analytics with joins and variations and so on. But it's still realistic and heavily used. And yeah, this is one example for a very well-known benchmark. Also not standardized, but if you're working in the field of key value stores, probably you can't get around without benchmarking

Starting point is 00:25:21 with this benchmark. Last but not not least, as an example, no, sorry, the last example of those relevant benchmarks that are not standardized is the join order benchmark. A quite recent benchmark, and this was also coming from a research problem. So people found that in real world, there are often queries that join lots of tables. So TPC-H does, I think only has eight tables and the largest queries there probably joined five

Starting point is 00:26:02 or so of them, or maybe all eight, but they don't really have those huge join trees, which you find in the real world if you has a couple of tables, and then they wrote queries with different numbers of joins. So the queries with the most joins, I think, has 13 joins, 13 tables. So this is already quite challenging for most systems, which was also the idea of this paper back then was to show that databases easily struggle with those queries because this is real data, so this is not uniformly distributed as some benchmarks do. This is like if you scan for certain actors, depending on which actor you scan for, you might have lots of results, you might only have few results. The results might not join with certain countries because this actor

Starting point is 00:27:07 was only locally known and so on. So this is quite hard for most databases to handle such queries and to really properly optimize such queries. This is not something we will care about, but we still want to make you aware of those benchmarks, that those are benchmarks that tackle a very specific problem, but still a really relevant problem. And this is also widely used because joint ordering remains one of the most pressing issues in database systems,

Starting point is 00:27:38 at least in analytical database systems nowadays. In the worst case, you might still so a bad join order can easily impact the performance by several orders of magnitude. So yeah, it's a very relevant problem that is tackled in this or that is tried to address in this benchmark. And then we have real applications as one for performance measurements.

Starting point is 00:28:09 And in this case, you take an actual application, put it on top of your implementation on your algorithm, and then try to evaluate how well it runs. This is actually the best thing to do, because now we can really show that your system has a real impact on the real world. And you don't focus on problems that are artificial or that nobody really cares about or that are insignificant. But it's really, really hard to do that. There are many real life applications out there so even if we have a real one you might not generalize well which is kind of a problem and then if you are lucky to have access to one of those

Starting point is 00:28:56 systems you still have the problem that you usually can't talk about it. So if some company gives you like access to data and workloads, usually you have to sign an NDA or whatever. You're somewhat limited talking about that system. And you can't give out the database, you can't give out the data sets, you might give out the workload. But without the data set, people are not able to reproduce their results and that makes it really hard for other people to improve on your work or just verify what you did.

Starting point is 00:29:31 So it's very much limiting what you can do. And also, this is not scalable. So as I've said, what most database benchmarks, or at least the TPC benchmarks allow it to do is they allow it to give or to set a certain scale factor, which defines the number of gigabytes that you create. So it can easily say, I'm going to run a scale factor of one,000 to have 100,000 gigabytes or something larger even or smaller, which is obviously not possible if you have a real application. Because if that application only has 100 gigabytes, that's your limit. Unless you artificially blow it up, but then again,

Starting point is 00:30:20 this might be no longer a real application. But what we can still do here and what has been often done is that we take an artificial system and then having the information how a real system look, we basically try to come up with a benchmark that mimics those systems. So we are going to look how data, what is the data schema, is it snowflake, is no snowflake, how is data distributed, what kind of tables do we have and then we want a data generator that mimics exactly those things. For example the Big Bench data, Big Bench benchmark uses distributions from the

Starting point is 00:31:02 Census Bureau, so that's real US data from publicly available Amazon datasets. So this is still realistic data. And then the data generator can try to mimic those properties of real systems, which often helps a lot. Because one of the main problems, for example, that has been found is that the TPC-H, which was the first really well-known, well, not the first, but maybe the worst first analytical benchmark

Starting point is 00:31:37 with real impact. But then people came and found that data is all uniformly distributed, which we talked about I think last week, right? And which is nice and easy to handle. So working with uniform distributed data is nice, but data in the real world is never, pretty much never uniform distributed. So that's rarely the case.

Starting point is 00:32:01 So taking those more realistic data distributions and working on skewed data set helps a lot really to make your system or whatever you're implementing robust and to be prepared for the real world. And then if you mimic just the real data, you obviously have no problem with putting your data online. You don't have to think about privacy, scalability, and so on. So that has a lot of advantages taking this route. And say at some point you will write a paper or thesis then ideally you cover all

Starting point is 00:32:46 aspects might not always be possible but ideally it's really good to busy for example top-down show that you can that you can have any real impact so for a real application might just be a tiny tiny thing for a real application, but you show for a real application that whatever your idea is, has a real impact. So you improve the performance or you improve efficiency or costs, reduce costs and so on. And then if you want to analyze where the advantages come from,

Starting point is 00:33:23 you usually have to dive deeper and run very specific benchmarks. I selected benchmarks, for example, application-level benchmarks, only running single queries, or to investigate performance, where is time going, you probably also need to run micro benchmarks. So again, a good thesis or a good paper usually has all these flavors of benchmarks. When you compare your work to other systems, which you should always do, otherwise it's hard to evaluate what you actually achieved. You should ensure that all the systems

Starting point is 00:34:07 have equal functionalities, and that everybody, you and others, are able to reproduce performance numbers. So please keep that in mind. That's highly relevant. And if you do something like that, feel free to ask other authors. So if you do something like that, feel free to ask other authors. So if you want to improve on some work that has been published a couple of years ago,

Starting point is 00:34:31 just try to write an email to the original authors. Many of them might no longer work there because they have finished their PhD or so on, but at least the professor might still be around or colleagues might still be around, and they might help you getting the system running. Nowadays, there's more and more papers that publish their work on GitHub, try to find out how they set up the system to have a really proper setup for their work and that you are able to have a fair comparison. have also be that you are able to have a fair comparison and there's actually work on it and there has been a lot of discussions about that because fair benchmarking and really comparing different systems against each other is

Starting point is 00:35:15 and remains challenging and probably the best example for that is all the best example for that is, or the best talk about that was by guys from Amsterdam. They had a talk in a paper that is called Fair Benchmarking Considered Difficult, and they talked about those challenges. They talked about how bad actually the current situation is. What they found is that there are many many many problems in database benchmarking. So there are a lot of papers that compare and evaluate performance. So many papers don't just evaluate fairly or don't evaluate or compare against other systems at all. That's obviously a problem. But even if papers or approach

Starting point is 00:36:01 or if people compare against other systems those results are often not really reliable and reproducible. It changed a little bit so there's a lot of work but back in 2018 the papers complained that there are a lot of white papers online that are very much misleading. Does anyone not know what a white paper is? So this is basically papers put online by companies. This is more like, this is marketing, right? This is not something that reviewers have written, read. So this has not been reviewed, it's just some company writes a 10-page report, puts it online, obviously the numbers are never going to be bad, because this is how you market for your

Starting point is 00:36:54 own product. But even in academia, it might be pretty obvious that white papers are not always the best way to go or not always so much trustworthy but even academia there were many numbers and papers they are just unreduced unreduced and reproducible so it's really hard to evaluate evaluate and trust these numbers and yeah the interesting thing was that more and more people published their results and there was a lot of results and many, many comparisons, which at first sounds like a good idea, but only few of them have been really useful.

Starting point is 00:37:39 Many people tried to reproduce certain papers and found it really hard to reproduce them. So there has been a lot of work on this topic. There have been a lot of discussions and things are improving. We'll talk about initiatives in a minute. But it's still a problem that just having many results that you plot and many benchmarks you do, it still doesn't necessarily mean that the results are really valuable and people can trust them. And one reason for that are so-called benchmarking games. So basically you're gaming the system.

Starting point is 00:38:19 While others are p-hacking in some research fields. Benchmarking is often called benchmarking. Yeah, it's gaming. So I try to find configurations and settings that just work out best for your systems. And this can be, for example, different configurations. So you compare against different systems and you use or you compare against vanilla Postgres and your configuration of Postgres and then you tweak some configurations for your systems or you don't do it for another system that you compare against. You might hardwire configurations which can be fine if you if you're very transparent about it because you might not be able to to automate certain optimizations so you just have to state

Starting point is 00:39:09 which optimizations use you do but if you hardwire them and you don't tell about them and the other systems that you compare against don't have these optimizations then it's obviously not a fair comparison you know the specification of your benchmark might be totally biased towards your system, which is often just happens. Because you start with a problem, you try to come up with a benchmark and they don't really realize that from the very beginning, because you wanted to show this specific problem, that the setup is just highly biased.

Starting point is 00:39:45 So you have to be aware of it and maybe go one step back and think about, okay, how can I make this more generalized, my problem, for example, to have less bias there. Another issue that the group came, that we found is synchronized workload queues. So in some systems, or what some systems did was that they had clients running and asking queries, but they used a synchronized workload queue. That means that the system on a test was probably never really fully on the test,

Starting point is 00:40:21 so there was some synchronization going on and especially if you want to compare the or want to evaluate the effects on a real world you are never going to have the situation where one query comes after another but there might be peaks where lots of clients approach a system and run queries on the systems and then at the very same time and then there might be times when nothing happens so a synchronized workload queue might be something that is not really a fair comparison then obviously arbitrary workloads are not a good thing to have very small benchmarks are often a problem so there are still papers out there which use,

Starting point is 00:41:06 for example, TPC-H with a scale factor of one, and then do benchmarks on that, which is good. Which is not bad in general, because TPC-H is a well-known benchmark. So if you do all the TPC-H queries, it's probably a good thing to do. And this is standardized. So if you run this on Postgres or whatever database,

Starting point is 00:41:31 it first looks like a pretty decent comparison. But then if you look how much data is actually really touched by TPC-H and have a large enough server, it might turn out that pretty much all the data for scale factor one is in your caches. But depending on, you might have a large buffer cache of a database, so either all the data is in a buffer cache, so you never really access disk, or if you keep your data in memory, then all the data might be in the caches, so we do not really benchmark anything valuable, at least for larger systems.

Starting point is 00:42:12 So this might also be a problem. And the root cause for those benchmark games is that we often, so research often tries to have the, like like on the first page of a paper, people, like pretty much everybody tries to put one figure that gives away like that we are good, right? So the paper makes sense. Here's one figure where we compared the performance of our system with their system. And we would like to show that we are just much better. So the first page you want to have that and if you don't have that on your paper then you don't get accepted which is obviously a problem

Starting point is 00:42:57 for you as a researcher and then the same goes for products. So you have to have this plot. And this plot might be okay-ish, right? If you are pretty transparent about that. If you say, okay, for this particular query, with these five very specific knobs, we can improve performance by whatever, let's say say factor 20 or so in this case which can be fair if you later pretty much very transparently tell okay this might not be always the case where we not faster and so on right so this can work out but often if you see those plots especially this simple plot this is probably kind of fishy and maybe authors try to hide something

Starting point is 00:43:47 or the comparison is not really that fair and good. So what are most common pitfalls when we talk about database benchmarks? First of all, there's non-reproducibility. That's definitely a problem that is more and more addressed as we're talking about in a minute. Then failure to optimize. We have apples to oranges comparisons. Incorrect results, obviously. So the black ones we are going to talk about

Starting point is 00:44:23 a little bit more in detail in a minute. And then what also can be a problem is cold runs where there's hot or warm runs. Does anybody not know what those are? What we mean by that? What is a cold run? Anybody? Yeah. A cold run is when I directly start a system and it doesn't really do anything and then I Anybody? Yeah. Yeah. Yeah. Exactly. So you might just start up the system, have the first or execute the first query, which is fair because if you do all your benchmarks like that,

Starting point is 00:45:07 it's a totally fair comparison because often systems are cold. But the differences might be quite high if you compare that to a system that is not cold. So that is either warm or hot. That means the caches are filled. The QB has maybe already been optimized and cached, so we don't have to optimize the QB plan. All the caches are filled with data that's actually relevant and so on.

Starting point is 00:45:26 So this comparison can be really unfair. So at least you have to make sure if you compare two systems, either both are hold or cut. So they are the same state or comparable state. And then data pre-processing, so you have to be very transparent. Did you do any data pre-processing, so you have to be very transparent. Did you do any data pre-processing? Ideally, if that is something that a system automatically does, show how it would run if it don't pre-process. And then also another problem is overly specific tuning.

Starting point is 00:45:58 So when you just tune for a very small aspect. So non-reproducibility. This has been a very frequent problem which has been tackled a bit frequently. So often there's no code available. This is very much the case, pretty much still very often the case for industry papers because they can't provide the code for their product. But even in research it still happens that people have papers with no code available, which is obviously a problem. And there are still little very little consequences so papers might still be accepted there is efforts to to counter that one well known or one of the biggest efforts in database systems is the sigmoid reproducibility also awards so there are awards who did this basically best

Starting point is 00:47:01 and then there are different levels of reproducibility. For example the artifacts available this is the green button here this can just plot this is put on your paper if you if you are able to if you make your artifacts available here the reviewers are certain reproducibility chair twice looks at your, for example, GitHub repository and checks, those things that you talk about, are there somewhat available? So data there, is there code, is there a make file for, is the code of your paper available?

Starting point is 00:47:41 Can I download, can I check out the git repository and can I execute for example the make file or so on. So there's a little bit of effort put in it, but they just check for the artifacts available. Much better for example is this blue button which means that results have been replicated. So this means that somebody really invested all the effort to try to replicate your results. And this can be a lot of work for both sides. So people try to get better and better at this. This might be, for example, that you give out a Docker image or something. So that's super simple to just pull an image or whatever

Starting point is 00:48:26 kind of code distribution you use there. And then with a single click, I can say, please, now on this machine, run this entire code. And ideally, there comes another paper out. So there's at least the plots are plotted. And I can see, okay, the plots look somewhat the same. I might not have the same machine, but the plots do show that the code really did what I expected it to do, or they might even produce the entire paper. So there's the paper, all the lattices running, and then the plots are just, yeah,

Starting point is 00:48:59 the plots are completely reproduced and redone, and everything looks fine, and so results have been replicated which is basically the best way you can go there. Obviously it's sometimes also not possible if you are working on hardware that not everybody has access to then might not be able to give the reviewer access to it so in some cases it's simply not possible to replicate your results. Same if you have data that nobody is allowed to access. But if it's possible, you should thrive for that.

Starting point is 00:49:36 Okay, here we have an issue. Here we have a benchmark that compares three systems or four systems with each other. In the first plot on the left, you see that they compare the runtime, the median runtime here of MariaDB and Postgres, showing that Postgres is a little bit faster than MariaDB. The middle plot, you can see that Postgres is a little bit slower than SQLite. And on the right side, then you see MariaDB Star, which is probably something like, no, the author's optimized version. And now it's much faster than SQLite. So being the first, the fastest database in this comparison.

Starting point is 00:50:27 Does anybody have an idea what might be the crime here? It's not the plot per se, so there's not another trick in the plot, but does anybody have an idea what the authors might have done here which makes it not a fair comparison. Or do you have an idea what you could do to improve the performance, let's say, of MariaDB? No guesses? Okay. So turned out that those have been the same configuration parameters, but this would be something that you can easily hide and that has a vast impact on MariaDB, Postgres, and so on. For example, you usually can tune the buffer size.

Starting point is 00:51:33 So some systems are limited to their buffers, so they don't keep all the data in memory, and you just increase this buffer size and allow all the data to be in memory. This usually improves performance quite significantly. So if you would have different configuration parameters, that might be benchmark gaming, but this has not been the case here. Compilation flex. So all these databases are open source, so you can usually just check them out, build

Starting point is 00:52:06 them on your own. Something else that is easy to tune and that is not obviously in those comparisons is that you might have different compilation flags. So you might compile all the other databases, just not yours with debug flags or only having minus or one optimizing the code or optimization flags for the compiler and only your database uses all the optimization flavors so minus or three or something so this would obviously also be a very unfair comparison that they would never see with a plot. So this is just hidden and this would obviously be an unfair practice. But this has also not been done in this case. Indexing would be a problem. So you could

Starting point is 00:52:59 just add an index. In this case, this would probably not help for this query, but you could add indexes for one database and not the others which might help one database. And it was the same version number, which is also another thing that you have to keep in mind or that you have to be aware of because often people start for example doing their work on Postgres your research is usually not done within two weeks so you might work for one or two years on Postgres in the meanwhile there's two there this Postgres has two new major versions so you're comparing different Postgres versions against each other so you have to be very transparent about

Starting point is 00:53:45 that can be a problem but what actually was the issue here was that this has been a different schema so the office here used for this Maria star used double columns instead of decimal doesn't probably probably sound that much of a change. And it's allowed. So, TPC-H allows you to do that. They don't require to have an actual decimal type, for example, for dollar values. But as you probably know, having doubles for such decimals can be problematic. So at some point, you will have rounding errors in your system. And here, for benchmark comparisons, the issue is that double can be natively optimized or is optimized by the CPU, and CPU where we will now know how

Starting point is 00:54:47 to do that and there's special hardware for that. For decimal, there needs to be a little bit more work to be done because it's simply not a native data type. So it's simply not a fair comparison in that case. So this would be benchmark gaming. And the problem is usually there's a low incentive to optimize the competition. So if you just want to show that you are faster than another system, why should you invest days just to optimize their performance? Which is obviously problematic. So one thing we have already mentioned is compiler optimizations that you can use. You should use the same compiler version.

Starting point is 00:55:34 You should use the same flex. You should use the same configuration. This is here the chat buffers that we list here. So the cache size has pretty much as a large impact, at least for Postgres and most database resident database, disk resident database systems. And the easiest fix, which pretty much, not always, but very often works,

Starting point is 00:56:01 is just to involve the competition. So either way, you can just check out their Git repositories if they have something open source where you can see how they benchmark their system and just copy, or at least look at their flags. How do they set up the system, which make file flags or make files do they use to compile the system? So whatever they use, use is probably a fair assumption

Starting point is 00:56:25 that this is already like a good configuration. And then if you have a script, an automated Python script, bash script, for example, that does the major comparison of a plot, maybe just write one of the authors and ask if they think that it's fair. Usually they are quite happy to answer because they don't want to see their paper

Starting point is 00:56:49 in an unfair comparison. So they are, there's some interest, usually interest in fixing those things. So just write them, ask them, and usually works out, but it can be lots of work. So sometimes people might complain that the entire comparison you do is unfair and you need to redo the comparison. usually works out, but it can be lots of work. So sometimes people might complain that the entire comparison you do is unfair

Starting point is 00:57:07 and you need to redo the comparison. This might be a lot of work. Yeah, here in these two plots you can see the configuration impact for Postgres here. So we compared two Postgres versions with different configurations. You can see that it's almost a factor of two that you can gain here with different configurations

Starting point is 00:57:36 running this TPC-H query. And for Minidb with different compilation flags, that's also a significant performance improvement to have those or don't have, use those compilation flags. Another problem is apples versus oranges comparisons. And so this is always the case if you, for example, compare yourself against the full system. Let's say you are, for example, you implement a join, you have an idea how to do a

Starting point is 00:58:19 join very efficiently and you implement it. You do that in memory, you do that in C++ on your laptop, and you just do the join, right? So this is all you do, and then you compare it against running the same join in, let's say, in Postgres. First of all, it's probably not comparable because Postgres is not really optimized for large joins. Postgres is disk-resistant, so maybe data needs to be read from disk, which you might not do.

Starting point is 00:58:48 But even if you use a more recent analytical database, let's say, for example, DuckTB, which is in memory and optimized for analytics, even then this might be an unfair comparison because larger systems have to do a lot of stuff on top of just the join. They do transaction processing so unless you deactivate that they run transaction so there's a lot more going on than just having this single join. There might be overflow checking which

Starting point is 00:59:22 they have to do so we have to check if the sum might over overflow you probably don't do that if we just have your micro benchmark and just want to evaluate the performance of your join because you might not care even about it but this obviously makes it unfair in the comparison so the only fix here is actually to implement your code in a real system, which is often hard, but probably the only way to have a really fair comparison. And since there are systems as Postgres, as DuckDB, so open source systems that are well understood and they are also used by different universities, it makes sense to do that because there's knowledge out there how to do it and often it's probably the best way to go. Here in this case we have another extreme. Here we compare

Starting point is 01:00:17 MoDDB which is a in-memory database for analytics and we compare it or the comparison is against timdb but timdb is not actually a database timdb is just one one function and c that does exactly one query so the query is not there's no s being parsed, being optimized, being executed by different operators. There's just one tight loop that does all the query, just like a couple of lines C code. And so obviously this is a very unfair comparison. So you shouldn't, you should never claim that you are faster by whatever factor here than MondeDB, because this is simply not a comparison that can be done in a fair way. But still, it is really interesting to have those numbers.

Starting point is 01:01:16 So this would probably be more like a micro benchmark, because it's just a very simple aspect even though this query does a little bit more with aggregations filtering, but it's still really interesting to have this query because this can be, or this implementation, because this can be upper bound. If you have this optimized C code, which has been compiled for query one,

Starting point is 01:01:39 then this might be the best you can achieve on your system. So for example, if you try to build a database that compiles Kriby's, like this does not run that produces code just for a specific Kriby, this might still, this might be helpful with you, helpful for you to see, okay, where can we get? But just this comparison, obviously, without further information is not helpful and not fair. And then, this is also, this will always hit you, I think. Sometimes you make your, it might be surprisingly fast and your system just really is really extremely fast because you by accident throw away data and scanning non-data is pretty much always faster

Starting point is 01:02:36 than scanning actual data so you should always check your results. This happens a lot and it can easily happen. So keep in mind, always check your results. So have something in your pipeline when you do your benchmarks. So for a simple paper, you might not have a fully fledged CI pipeline, but try to have something that just checks that whatever you do yields the same results as, for example, Postgres does. It doesn't need to have the same results for ScaleFactor 1000, because Postgres runs forever maybe,

Starting point is 01:03:11 but have some sanity checks that check that your results actually are correct. Happens more than you might think. And obviously, if this is your implementation for CRWI1 of TPC-H, even TMDB can't beat it. Yeah, please check that. And run with different benchmarks and data sets too. So whatever you, if you're running a join,

Starting point is 01:03:44 check that it also returns to correct results with if you're running a join, check that it also returns to correct results if you change the dataset, if you have another dataset, if you have another benchmark, and so on. This slide includes a few pitfalls when you write or review. And we also are by no means immune. And they are just, you don't have to read them now, but this might be helpful for you in the future to check all the boxes if you are, if you checked for that, if you talked about that. For example, if you write your thesis later, the second point is about reproducibility. So you need to state what is the actual hardware that you used. So today you find papers that just don't say what CPU they used.

Starting point is 01:04:36 How did you tune the database parameters? What database version? Did you compile it or did you pull the binaries, how did you compile it and so on, which optimization flex did you use, and so correctness. So this is a very helpful overview, so please keep that in mind whenever you in the near future write your master thesis. This is a good thing to check that you that you are properly benchmarking. So summary so far.

Starting point is 01:05:16 This has been the introduction to performance analysis. We talked about back of the envelope calculations. They are extremely helpful to get a first idea about the performance of your idea or the viability of your idea. So if you can't show that on the back of the envelope calculation, if you can't show that your idea makes sense or improves over the state of the art, then you have to think more about it. Then you have to get back to the drawing board. So this is the first very good filter to find out which ideas actually don't really make sense. So this helps a lot. And the numbers,

Starting point is 01:06:04 for most things you can find pretty good estimates online. We listed a few of them. They are probably somewhere out there, the numbers. How fast is your CPU scanning or whatever GPU you assume. How many threads do you have so you can use those numbers just being in the right ballpark. Calculate that and then you probably have a pretty good estimate mostly if it makes sense. It can still be wrong, but it's a good filter. Then we talked about measurements and some statistics.

Starting point is 01:06:34 So we talked about how to do, how to probably measure and what might be a good thing to, what statistics you should collect. So don't always go only with the average. Depending on what you do, how many runs should you do, or should you also report the min and max? Is there, does your, do your results really, is there a lot of noise?

Starting point is 01:06:59 That it might make sense to report them. So usually you should always report them, but how far you should go depends on your actual problem at hand. Think about that and be very transparent about it. Then we talked today about benchmarks. There are many different benchmarks out there. Pick up those that make sense for you to run. So there are probably benchmarks out there that you can use, and if you have to write your own benchmarks, you have to have a good reason for that.

Starting point is 01:07:34 So it often makes sense to have your own benchmark if you are focusing on very specific aspect that is not covered by our benchmarks, but then you should still show that you are also working for standard benchmarks, even though they might not trigger your actual improvements. And we talked about how to do fair benchmarks so that you don't end up with an apples versus oranges comparison.

Starting point is 01:08:00 Any questions for now? Okay, so then we will say have a short break. And then we talk about, or at least we start today with the database basics. Say a 500 break. Yeah, as I've said, now we're going to talk about database basics. Sorry, that should be mostly known to everybody of you. We're just going to repeat it so that everybody's on the same page, that they use the same vocabulary throughout this lecture to bring everybody on the same page. Okay, so we are not going to talk about all these things today in the remaining minutes.

Starting point is 01:08:52 I probably stopped a little bit early, but the next, yeah, today and tomorrow, we're going to talk about, we talk about the relational data model and SQL. We are pretty much aware of everything of that, I guess. We are talking about the classical architecture, so those layers that we usually have in a database architecture and indexes. And we briefly talk about query processing,

Starting point is 01:09:24 so how does a database process queries? And it's mostly a refresher. So if you have any questions or if you're that's totally something you haven't heard about, you should probably approach us because we expect that you know certain things about it. So at least afterwards, you should probably approach us because we expect that you know certain things about it so at least afterwards you should all know or you should probably already know what a join is and we're still going to talk a little bit about it yeah but please tell us if that's totally new to you shouldn't be the case Okay, so there have been quite some different data models, so in the 17th and so on, but in the end, the world agreed to the relational data model.

Starting point is 01:10:18 So this is the model that pretty much all the major database systems used today. So we put our data into relations. Mostly those are tables in our database, in nowadays databases. And you can see a simple example here of such a table. Those things, they have a schema, so they have a structure. We are not talking about schema-less databases or NoSQL systems, key value stores. We are talking about relational databases. And they are not, so there are cases of databases, relational databases,

Starting point is 01:11:02 that also have nested data structures. We keep it simple for now. The classical relation model does not have something like that. So for nested things you would have multiple tables that reference nested structures. In this example here we have a simple table which is called employee. It has five columns or attributes. For example the first thing this might be the key, the identifier. This is the p underscore id. We have a given name, a last name, an h and an address. So we have five columns and sometimes the name of the column is called an attribute or sometimes also attributes means the same as the columns. It's used for example we have marked Stefanie Meyer.

Starting point is 01:12:06 This is, you can call it a row or sometimes it's called a tuple. It's in the end the same idea. And how this is stored is totally up to the database. So this is something that the database returns to you. So it returns, it allows you to access this table, to run queries on this table. But if the database decides if it wants to store each row on its own, or if it wants to store each column on its own, or it might even do something in between. So there are also systems that do hybrid approaches. This is usually independent, so this is disconnected from a relational model. This is something how

Starting point is 01:12:50 the database implements the relational model. And yeah, we have columns, rows, attributes, and then we have usually where we were known as the cardinality, which is four here. So the cardinality which is four here so the cardinality describes the size of the table it's not with but how many tuples we have and then we have sometimes you're here there which is the rank which is five here for for the width so we have five rows here and just short question do you see a problem with this table? Let's say you want to store your employees as a large company, you want to store your

Starting point is 01:13:33 employees and their data in a relational table, let's say this is a physical table, do you see a problem with that? There must be somebody. Would you ever, how would you store, would you store the age like that? Yeah? Yeah, right. So it doesn't make sense. So this is just a simplified example,

Starting point is 01:14:25 which you find in pretty much every database and deduction course, but please never store age or something, store a birth date. Maybe have a view on top of your table that does this age calculation. Makes probably a lot of sense because it's easier for the user to have age calculated,

Starting point is 01:14:44 but please don't do that. So H, you shouldn't never store something like that in database. And for each column, we also have data types. And that's really important. This is important implementation aspect, which data types you support and which data types you use. And especially in research systems, you sometimes see systems that only support integers because

Starting point is 01:15:09 it's just much easier to handle. But there's a huge difference between a column having strings or integers. Strings are much more harder to handle. We don't even have to go to collations and different formats of how to store strings. Strings are just much heavier to handle. So it should be aware whenever you have, coming back to the comparison, whenever you compare systems,

Starting point is 01:15:39 try to also be fair in that regard that you use comparable or the same data types because there's a huge difference, not on the size, right? So using a big int is different from a small int and so on, but there are many aspects that have a big impact on system performance. The standard way to access relational database systems for the last decades now is structured query language, SQL or SQL.

Starting point is 01:16:13 There are many different standards. The last one is, I think, from the last year, SQL 23. And the main idea or the key feature of SQL is that it's declarative. So you don't state how to execute a query. You state just like what you want to have. So how should your result, what should your result be? For example, you say that you want to have the results from two tables combined by having an accurate predicate on those tables,

Starting point is 01:16:47 so there's a join. You don't need to say, okay, which table should be processed first. If there are filters, do I do the filters before the join, after the join, which table is the first to be scanned? You just say, okay, I want to have these tables filtered, join. I might want to have an aggregate. And then it's a task of a database to formulate a query that executes this, a query plan that executes your query. And this is something that you don't have to care about. And this is a main feature of all the modern database systems because that's basically the secret source of the performance.

Starting point is 01:17:27 Because writing performance code for large datasets that might be distributed over multiple nodes, if there wouldn't be an optimizer that does that for you, that's extremely hard to get right. So this is where all the performance usually comes from modern database systems. And we have four basic commands. This is what we usually call CRUD. So we have create, read, update, delete, or in more SQL-ish terms, we have insert, update, delete, and select. And we have DDLs, or data modification, or DML is data modification, and we have definitions, so we can define the schema, create table with those rows of this particular data type,

Starting point is 01:18:12 with a default value, having certain constraints on it or not, and then we have language to modify data, for example, you can update a tuple, or add a row, add a column, remove a column, and so on. There are also other languages that we are not going into more detail right now. The most common, you usually always also use and see is this form here of a select query. I probably don't have to go into more details, but this is the form that you usually always see, the way we select from a number of tables.

Starting point is 01:18:53 Might be a join involved if there are multiple tables, or a union. We have predicates, so we can filter the query, we can group by, so we have different results sets for different groups. Having a predicate on a group by, so we have different results sets for different groups. Having is a predicate on a group by, so maybe you want to include only certain groups that qualify for a certain predicate, and then you can also order by, just to have a nice order of the results in the end. And so the classical architecture which is still true today, so yeah it's still today, you can see this five layers here that we have defined in pretty much

Starting point is 01:19:37 every database at least in the relational databases in some form. So on the very bottom there's the operating system that gives you the memory management so that you don't have to... it provides you just this is the main operating system. So it provides you with file management, this does your memory translation, so this manages memory for you, virtual memory and so on. So this is the usual use. There's a lot to discuss which we don't do right now, but where databases try to avoid the operating system

Starting point is 01:20:14 because sometimes the operating system might be in your way when you're a database system because you want to use the machine to the full extent and the operating system might think it should be fairer to do something else, right? So there's sometimes a conflict between databases and operating systems, but mostly the database system sits on top of the operating system and uses what it provides. Then level above there is something called a buffer manager. Even most modern in-memory something called a buffer manager. Even most modern memory databases have

Starting point is 01:20:47 a buffer manager because at some point you don't want to have all your data in memory but only the data that's frequently accessed or right now accessed and data that is never touched can remain on disk. So the buffer manager is responsible for putting and putting data to disk. And the idea is here that the third layer, which is the memory structure, which says how records are stored, ideally this thing does not need to know how data is persisted on disk. So it says, okay, I have, for me, records are stored in a vector, and now, dear buffer manager,

Starting point is 01:21:32 please give me access to this vector. And now it's the purpose of the buffer manager to say, okay, I might need to read this from disk, I might need to print it in my memory, maybe it's already there, so I can immediately return it. But the third layer I might need to print it in my memory. Maybe it's already there so I can immediately return it. But the third layer does not need to care what disk was, where it was stored, how it might be striped for performance.

Starting point is 01:21:54 I just want to say, okay, please give me now this type of data and this size. And the buffer manager returns that for you and does all the handling in the background so they can separate your concerns here. So the memory structure says how is actually data stored so how are physical records stored and which format or which indices and where are they how are they kept it stores it manages the indices that you might use. It does locking so if there are updates, concurrent updates, you usually need some way of locking records.

Starting point is 01:22:28 And if you want to be persistent, which is a nice feature of a database to be persistent in case of crashes, you also want to lock data and be able to recover in case something happens. And there will always happen something. Above that layer, there's the logical axis. So now this is how you work on your records. So for example this does the sorting, this does your transaction management, so this coordinates what happens if we have two concurrent users. One is just writing an item, the other one is reading the exactly same table, how can we avoid conflicts and so on. It manages the cursors and also has a data dictionary or the

Starting point is 01:23:13 catalog which tells users or keeps information about what tables do we have and what is visible what is not visible and so on and so forth and the very level on the very top is the data model level here we have for example the query compilation so here we reformulate our queries we optimize our queries we say how a secret queries might be actually executed we have the access path selection. So having sometimes we need to decide do we sequentially scan our data or do we have an index? If we have an index do we actually want to use it? Is it advantageous or do we not want to use it? Access control obviously in a real database has to be done. So some people might not be allowed to access all the data.

Starting point is 01:24:08 So those are the levels you usually always find in every database, at least in more realistic databases. If you look in research databases, you often don't see something like integrity checks because they need to be done but it's often not that interesting. Sometimes people don't see something like integrity checks because they need to be done, but it's often not that interesting. Sometimes people don't do logging and recovery, but at least if you have a real database, you need something like that. But of course, there are many topics that you just can't assign to one single level. So, logging needs to access different levels.

Starting point is 01:24:51 There's also recovery. There are many things that you can't just pinpoint. Okay, this is in this level, this is in the other level, and so on. So, this might be hard, and I'm going already over time. I'm sorry for that. Yeah, so next week, I'm not going to this slide, next week we will talk about hard drives, about indexes and not netgrains next week, I'm sorry, tomorrow we continue talking about hard drives. They are still a thing. Might surprise you, but hard drives are still, actually, most of the data is.

Starting point is 01:25:29 And about all the other topics in, about database systems and then continue with CPUs. Okay, thank you and see you tomorrow.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Performance Management & Benchmarking (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.