Hardware-Conscious Data Processing (ST 2024) - tele-TASK - DBMS Recap & CPU and Caching

Starting point is 00:00:00 Welcome back to Hardware Conscious Data Processing. Today we are going to continue discussing database management systems and then we continue or then we start with the actual content and talk about CPUs. So we are a little bit late in time but it shouldn't be a problem. We should have enough time to cover everything that we plan to cover in this lecture. Okay, so briefly what we discussed yesterday, we talked, before that we talked about performance management, and how to do proper benchmarks, and how to do fair comparisons, and then the second part was about database systems and

Starting point is 00:00:49 We discussed how relations look in those systems Introduced a few terms There that we have columns rows and tuples. They shouldn't store the age column Then one minimal change to the slides. So they have been a Frequent said there. I think they are like 10 sequell standards out there the most recent one is from last year it's called sql 2023 and just yeah might be interesting to you so sql doesn't really change a lot so what might be the thing that they that they still add and the big change with sql 23 was JSON. So now there's a native support for syntax, how to query and how to add JSON data in a relation database system. So this is what

Starting point is 00:01:37 there still changes to SQL. And this is one example of what changes. There are also recent versions that added graph syntax, and such things are now added. Yeah, we talked about the different layers that we have in the database system, and then we stopped at this slide. And yeah, so what you see here is a hard drive, a hard drive disk, so an HDD, and might sound pretty old fashioned and outdated to you,

Starting point is 00:02:12 but it's not. So there's still probably most of the data you frequently access apart from a laptop is probably residing. So much data in the cloud is still stored on such disks and there are various reasons for that. One main reason or the obvious reasons are price efficiency capacities, but there are more interesting reasons that we might cover later on in the lecture, but it's still very relevant how these work. Compared to your probably modern

Starting point is 00:02:47 SSDs that you have in a laptop, HEDs are quite slow. So the main reason for that is that there is actual physical movement in there. So if you want to access a certain sector which is marked here, a little bit darker, on a given track. So you might have to move your head. The head needs to move the right track. Then you have to rotate to the sector that you want to read to actually fetch the data that you want to access.

Starting point is 00:03:20 And that can take some time, depending on how fast the disk spins, so this might be a simple disk that spins with 5,000 rounds. Here we have this example has 7,000 rounds. We have 10K and even 15,000 rotations done per minute. Depending on that, it says how long it takes an average to spin the disc to the right position. So an average, if you have 7,000 rotations, it means you take around about 8 milliseconds for one rotation. So half of it in average to move the head is what I can expect for this rotational delay.

Starting point is 00:04:08 And then you have transfer data and so on and so forth. And the very big difference to SSDs or the most significant difference to SSDs, apart from obvious things as the bandwidth. So it's the difference between random and sequential I.O. So what you can probably imagine is if you have sequential I.O.s, so read one data or one sector after the other, and the disk is spinning,

Starting point is 00:04:37 that's pretty much the best thing that can happen, right? So you just let the disk spin, read one sector after another, so that's pretty fast, and that's the best scenario you can have here. On the other hand, you have random I.O. That means you have to move the head maybe to another line. You have to reach some very distant sector. So that's basically the worst-case performance you can achieve with HEDs. So especially in the old days people very much or

Starting point is 00:05:07 systems very much focused on doing as much sequential I.O. as they could. This is no longer always the case with SSDs but also there it's usually better to have sequential I.O. but difference back then or still still today, the difference for AGDs is much more important, or much significantly larger. So how does a sequential file actually look? Now we have a table. In this example here, we have an ID in this table. We have a name.

Starting point is 00:05:41 And there might also be an auxiliary ID that is not shown to the user which is just used by the database a record ID for example a record or tuple identifier and so now how does it look the file is just the sequential bunch of bytes right so if we now want to if we now have such a file and we want to add a record, that's just usually going to the end of the file and then adding this new tuple that you just inserted. If you want to seek a certain tuple identifier, so you want to fetch one tuple, that means you have to scan the file right you don't know where in the file this record ID might be stored at least for this example here looks like the table might be sorted by my last name but if you're now looking for a TID there's no way you know where this ID is unless you have another data structure on top so I just have to scan the entire file and look for it.

Starting point is 00:06:47 The same if you want to delete it or replace it. What could we do if we want to, what could we do if the file would be sorted or if the tuple identifiers would be increasing? What could we do there? Yeah. Yeah. So we could jump from the file, right?

Starting point is 00:07:15 We could give them the middle, and then half, half, half. So we could have a binary search, do there. And if that's not the case, I typically do or what is often done is that we have an index file so we have now a data structure that's no longer just a sequential file but we store we organize the data differently for example index file and now this is looks pretty much it is a tree, started as a tree, so the file internally starts basically a tree with a node, with a root, with internal nodes and the leaves.

Starting point is 00:07:53 And now you're able, depending on how, what is the criteria that your tree is built on, now you are able to seek much more efficiently, so in a logarithmic time. How we insert and delete in this case should be, I guess, quite, should be obvious, I guess. So what are typical trees we are using here? We, I think, don't think there's any system that uses a classical binary tree because does anybody have an idea

Starting point is 00:08:31 what the main problem with a binary tree is? Like each internal node just says larger, smaller or what? So... Yeah? In search for deletion, Yeah. Yeah. In search for deletion, do you have to update the...

Starting point is 00:08:50 This is also... Modifications can also be a problem, but the main problem here is that we... It actually depends on how we implement it, but what you usually do is that you have different... So let me back up. The thing is, if you talk about, sorry, I forgot to mention that. If you talk about an HDD and you want to access, let's say you want to access five bytes,

Starting point is 00:09:16 what you always will end up doing is that you read an entire page. So no matter how small the data is you read, even if it's a single bit somewhere, you need to fetch an entire page. This is usually four kilobytes of data. And that means if this is your main bottleneck, getting this page from disk, which can be really slow, so we are in milliseconds range here, this means that you want to have as much data as possible, usable data, valuable data within each page that you get. So the main problem with classical binary trees

Starting point is 00:09:50 is that the goal of different trees, for example, B-tree is that you have as much usable data in each page that you use as possible. So a B-tree, which is not a binary tree, but has been introduced by a researcher called, I think, Rudolf Beyer, so this is a Beyer tree. I don't think I coined it that way, but everybody assumes it's a B-tree, if I'm not mistaken.

Starting point is 00:10:18 So this B-tree, we have much larger nodes. So we don't only, depending on how large the size is and what your actual key is, we try to put as many keys in each node as you can so that with each page fetch, you flatten the depth of your tree, nodes get larger, and you only have, even for large data sets,

Starting point is 00:10:48 only very few pages, and you're already at the location that you're looking for. And there are different alternatives of this tree. Many people, there's, for example, the B plus and B star tree, and the main difference here or is how large each how large you allow the notes to get when do you reorder and when do you reorganize a tree and where actually data is stored so it can make for different cases different things work out but I'm for example a B plus tree doesn't store the data.

Starting point is 00:11:26 So if, say, you store tuples here in this index, the B plus tree would not store more than the key in the nodes. And the extra tuple is then in the leaves, while this might be different, for example, for the classical B tree. We can also have tuples in the leaves. And if you have all the tuples only in the leaves, and the leaves are connected, for example for the classical B-tree. We can also have tuples in the leaves. And if you have all the tuples only in the leaves and the leaves are connected, for example, that allows you to scan the entire table in a sorted fashion just by traversing all the leaves, which can be quite nice if you expect sequential traversal to be somewhat a frequent operation.

Starting point is 00:12:09 We have a B plus tree example here. I think, I guess it's something you can look on tomorrow. It shouldn't be too surprising to you, but keep in mind, B plus trees or B trees in general, many forms of B trees are still basically the way to do indexing on block-based storage. Even with very fast SSDs, even the most modern, most recent database systems still often use B trees. So even though it's a pretty old data structure and there are many many small tweaks to it, it's still very widely used and it's

Starting point is 00:12:49 the state of the art still if your data is on a block-based device. Another example that is also heavily used is hashing. So we have, we store hash buckets and files. So in this case, we have each bucket. So for whatever we use hashing for, we have different buckets, into which we add the items depending on the hash value. And now each bucket can be a page for example and then it might happen because there is skew in the data or data is distributed differently than we expected that one page is full. So in this case here what you see is that we have so-called overflow pages. In this case we have more items for a given bucket than we expected. So we have to account somehow for that. And then we have to have an overflow page in this case.

Starting point is 00:13:52 Yes? . JAN-FELIX SCHWARTZMANN- Hopefully that's better. I assume hashing is well known. How hash tables are used, what different hash tables are there. You have to account for collisions. That should be not surprising, right? You have to deal with collisions. That should be not surprising, right? So there's, you have to deal with collisions. Okay, so then the other level, so now we know how data is stored. The database system stores data on file. So the next level that we talked about is the buffer manager or the buffer management,

Starting point is 00:14:43 or the buffer manager who does the buffer management and that's what usually is the caching in the database system. As we have said yesterday, the components above the buffer manager usually they don't, they shouldn't know how data is stored or which, how to cache data, how to do that efficiently and effectively. They just say okay I want to have this page, I want to have this data. Dear Buffer Manager, please provide me with this page. And now this Buffer Manager checks if the data is already in the main memory in its buffer.

Starting point is 00:15:17 So Buffer Manager controls this buffer. If it is, we just have to return the pointer to this page and we are done. If not, we have to fetch it from disk. And in case our main memory buffer is full, we now have to first clean some space in the buffer so that we can pull the page from disk into our buffer. There are various replacement strategies, so-called replacement strategies, how to do that and how to decide which page should I now remove from my buffer to make place

Starting point is 00:15:54 for a new page. And that's actually often not super trivial what to do there. You can do very simple approaches, probably the best known thing, or the most trivial thing is a FIFO approach, which means first in, first out. So the oldest page will be evicted. So usually makes sense, but might not always be the best strategy. And one problem that

Starting point is 00:16:26 there's one of the standard strategies you can use is often LRU, which means least recently used. So you not only store when was a data added to my buffer, like in FIFO, you keep a timer for each page when was the last time that I accessed this page and whatever now is the oldest page or the least recently accessed page, this is the candidate that I will evict before I can put in new pages. But there is an issue with all the approaches that some more sophisticated replacement strategies try to avoid, and this is scanning. So imagine you have a buffer and you have, let's say, 1,000 pages in it, and now there is a scan for a large table. So you scan the entire table just once.

Starting point is 00:17:21 The buffer manager usually doesn't know what the components above do, so they just say one page after another. Maybe you can give it a hint. Sometimes it's possible in a system to give hints that I know I'm doing a scan, but if you don't have those hints you will just load every page sequentially, put it in a buffer, and then after you have 1000 pages processed, you will start evicting the first page and so on. So as you do a scan, there is no second access to this page probably. So in the end, you will just read all the pages, evict all the pages, and

Starting point is 00:17:58 there's no, you basically destroyed all the entire buffer. For that reason, there are different approaches for replacement strategies. For example, LRU2. And in this case, it tries to counter such scans in a way that it decides on the second least recently used timestamp. To keep two timestamps. And then the second recently used timestamp will be keep two timestamps and then the second recently used timestamp will be used to evict that. So that the last timestamp which might come from the scan

Starting point is 00:18:32 must somewhat accounted for. And there are many other strategies but keep in mind scanning can be kind of a problem if you have such a buffer manager. You have to be aware of those things because a scan is the worst case.

Starting point is 00:18:46 Scan actually, a buffer manager or the buffering in a scan doesn't make any sense. Doesn't give you any benefits. Okay, so now how do we access data that is stored on disk? And this means we somehow need, we need to map data in our, for example, we know now our index tells us that

Starting point is 00:19:13 the data that we want to, that we are looking for is on this and that page, so we need to somehow have a mapping from the page ID and the location to the actual location on disk. So how this is done, there's for example absolute addressing. That means the record ID is a page ID and an offset. So we know okay now we have to fetch the page 17 and go to the offset 100 to read the actual tuple that I'm looking for right now. Or there's also the alternative of absolute addressing and searching. Now my

Starting point is 00:19:51 router identifiers are smaller because I'm only storing the page ID. But I have to scan this page, so usually the 4 kilobytes, and search for my tuple that I'm looking for. Both make sense. So the second alternative saves you space, which is good.

Starting point is 00:20:14 And if your disk is really slow, and nowadays we have slow disks and very fast CPUs, scanning 4 kilobytes is often not really a problem, considering that fetching the page takes milliseconds. So this might be a viable idea. If you have very fast SSDs, it might make more sense to store the offsets. So it depends on the workload and many other factors, but both approaches are used in modern systems.

Starting point is 00:20:49 Few more notes on the data model level. And what you see here are three levels that we are, that you see in the database. In the middle, this is the conceptual schema. This is level how you, when you create a database, how you represent the world. So you might think, okay, I need a relation that stores my employees.

Starting point is 00:21:17 There might be a relation that is called locations. But this is basically your world view. This is how you store the data. This is the tables, the attributes and so on. Then above, for different users, we might have different views. So they can be, depending on access control and so on, or the views that those users can access, the world might look a little bit different. Might be a condensed view, might be aggregated, might be that the salary column is not visible

Starting point is 00:21:46 to certain users. They can still see who is an employee, but the salary is not shown and so on. So basically different users have different views on the data you have in the database. And on the very bottom, there is the internal schema. This is something that you don't see when you use a database system.

Starting point is 00:22:09 It's basically you have to inspect the database system to get some knowledge about it, but usually the system tries to hide that completely for you. It doesn't tell you what, or necessarily it doesn't tell you, but it tells you what index it's using, how data is stored, where it is stored. So this is something that you want to hide from the user because the user has its tables and so on. You can access them, but if you're storing it on which disk and which way with B-tree or hash index, hidden usually from the user.

Starting point is 00:22:37 Something that you don't have to care about. Okay, so now we basically know how our database looks conceptually, how the architecture looks, which layers we have. Now let's talk about how queries are processed. So we have a user that sends us a query. So as we have said yesterday, this is an iterative query. In this case, we have four columns that we are selecting or projecting. Name, address, checking, and balance.

Starting point is 00:23:09 From two tables, customer and account. And we are looking for the name Bond. And we have another predicate that does not join. That just says, okay, for those two tables, I want to, account needs to equal, it's going to be the equal account on both tables. So potential query, oh, no, first, the database system now needs to somehow find a plan that it can execute, right?

Starting point is 00:23:42 So this is just the secret query. It doesn't really tell you a lot how to execute it. Now that it's a task of the database to come up with a plan that it can execute and that eventually it returns your result. Here we have like in pseudo code, a potential query plan that would process this query. So we have an outer loop, we have two loops here.

Starting point is 00:24:08 First we go over each customer and now for every customer that we load, so there will be a sequential file, so we scan the sequential file and get every customer. Now we look if the customer's name is Bond, they continue and then we have another loop here and in this, we loop over account, each two-bit and account. And now we check for both tuples, do they match? If they match, we return the result. Does anybody know what this kind of join is called? Hmm?

Starting point is 00:24:44 Inner? This is an inner join, yeah. And the way that we join, so is it a hash join? I think it's a nested loop join. Yeah, exactly. So this is a nested loop join. Usually the most inefficient join, so you will barely see such a join in a real system, because systems try to avoid that.

Starting point is 00:25:09 But in this case, this is a very simple example, this would be a nested loop join. And so now once we have this execution plan, there's usually another step that translates that to an actual execution plan. So the physical plans are now, for example, use this operator or use this kind of scan. So you just check for attribute. Can be done differently. This is another translation step which we will talk about in a minute. So what are the main steps?

Starting point is 00:25:45 First step is we just get in SQL query, so we have to pass it. So in this case, we need to, someone need to build up a structure, for example, a tree that allows us to put that in a, to first of all check the syntax, and then also check the semantics, and check if everything that the user wants to access is actually available.

Starting point is 00:26:09 So does the table account exist? Is the syntax correct? And so on. Then you usually end up with something like an AST that you later process. And then once you have this tree, so we have now your this SQL string is now passed and is in a machine readable format or in a format that your database understands, you start with generic rewriting. For example, you might expand views if the view is selected. So you just replace this view with the actual definition of the view.

Starting point is 00:26:45 You might look for common sub-expressions that you can replace and so on. And once you've done the simple stuff or the preparation stuff, then you try to find an optimal plan. So the plan that you just generate from parsing your SQL statement is rarely the most efficient one.

Starting point is 00:27:04 And the main task now is to find maybe not the optimal plan because that's very hard in many cases and there might be thousands of millions of different plans but you want to have a pretty good plan. And there are often in many systems there are two steps. There's first there's a rule based step. So then we just apply simple rules that are usually always good to have examples later. So we just go with the plan and see, OK, can we pull up or pull down, push down some predicates and so on.

Starting point is 00:27:41 And then the second step we do, we need to estimate costs and then say, okay, which is maybe a good join order. And for that, simple rules are no longer enough. So we have to apply some cost functions and see, okay, which would actually be the estimated cost of this join and the other join. So what might be a good order for that and so on and that often requires statistics. Because if you don't have statistics, you basically don't know how expensive predicate might be. And the last step

Starting point is 00:28:16 is executing the query plan. Okay, so let's say you have now passed your query. You have checked that both tables as an R are actually part of your system, so everything's fine. You can continue with the query. So now we have a logical plan that says we are going to join R and S in this example.

Starting point is 00:28:40 Then we have a filter. We have a filter on column X of R that checks if that selects all the values that are smaller than 5 and then we project the two columns Rx and Sz. And now there might be as I've said many many different physical plans that we that are potential they have to be equivalent so they have to deal the same results. And the task of database is to select the most efficient one.

Starting point is 00:29:10 Fastest one, the most efficient, or whatever you're optimizing for. So one alternative would be, for example, if you have an index on one of the join columns, then you would say, okay, now I scan R, so I just pull on, I get all the data of the table R, and while I'm fetching all this data, I can already execute my predicate, so I can early filter on, so I can pretty early in the query reduce my data set by filtering.

Starting point is 00:29:38 When I have R, for example, loaded, I can now use an index nested loop, which is much better than nested loop, because now I just have index probes. So I use the index on column SA here and see if there is a matching value for which columns match, which tuples match, and then I return these two columns. Now, we have to maybe care about, we didn't put it on the slides, but depending on how much data is stored in your index, you might still have to fetch a tuple, right? Because we are projecting the column Z of S, so that means assuming you scan the entire tuples of R, you have the entire R tuple in place.

Starting point is 00:30:27 But if the index might only return you, okay, yeah, you have a match for this tuple. You have a match in S for those five tuple IDs. And you also know, okay, it's the value A. You know the value A right now because you have an equal match. But you don't have the value, the Z column for this tuple, so that might require you to have another random lookup to your disk just to fetch the single attribute for each matching tuple. Another alternative that also can be very efficient,

Starting point is 00:31:01 depending on how large the data is and if there are indexes or not, might be a merge join. In this case, we have a sorted scan for both input tables. That means we read the table and solve it. We again do already the predicate on X, so that X is smaller than five. And then once we have both index relations sorted we can do an efficient merge join. Is anybody not familiar with the merge join? Okay so yeah it's a sequential operation right So you just linearly run over two now sorted relations.

Starting point is 00:31:46 So that's usually very efficient. Okay, so here's an overview of how we map the logical operations to their physical counterparts. For a relation, if you have, like say, you reference a relation, that usually is often called a scan. That just fetches all the tuples. It scans the entire relation, if you have, like say you reference a relation, that usually is often called a scan. That just fetches all the tuples, it scans the entire relation, maybe an index file,

Starting point is 00:32:10 sequentially gets all the data leaves of the index file, or is it a sequential file, then it just runs over the sequential file and gets the relation. Filters might be executed by a sequential filter, or it might be an index scan. Depending obviously if depending on if there is an index or if there are multiple indexes then you have to decide for one index to choose but you need to

Starting point is 00:32:37 know to have some information if there's an index to decide if you want to have normal filter or if you can do it more efficiently via an index scan or index access. A projection, if you allow duplicates, then it's trivial. You just take the tuple and just keep all the columns that are projected. For the Cartesian product, that's usually a nested loop join. So that's not very efficient, but Cartesian products never are, so that's fine. If you have an inner join or feeder join, so we have some predicates, we can do hash sort my join, we can do index as a loop join, so this is the standard join that you usually see and there are usually at

Starting point is 00:33:21 least two different implementations every database system and now we have database system needs to decide whatever it assumes to be the most efficient alternative to use for your query for allegations usually there's hash sorted aggregations also depends on how the query looks or the statistics looks if you have like in the example, we had a nested merge join where we sort data. Now if you happen to also have an aggregation afterwards which groups on the predicate that you have sorted on, that you are sorted on now, then you can do a sorted aggregation and you don't have to sort anything. You just sequentially run

Starting point is 00:34:03 your join result and you're done. So that's very efficient. In other cases, hash aggregation might be much more efficient because you don't have to sort upfront. If you have projections and you need to eliminate duplicates, that's usually off the aggregation, so I just group by all the tuples that you want to project. Intersection,

Starting point is 00:34:30 special case of a join, so we have to find the matches. Difference, there is something called an anti-join, which joins tuples together but don't match. So this can be used for difference, and the union is a union. Hope that's kind of clear. Okay, so now a few notes on cost-based optimizations. So what we do in this step,

Starting point is 00:35:03 now that we have, for example, a already a little bit pre-processed query plan, is to enumerate plans. So we want to have checked different plans. We come up with different plans, and now we want to estimate the execution costs. For that we need statistics. So if there's, for example, a, there's, like, there's usually for a hash join, there's like there's usually a for a hash join there's a hash probe side and a hash build side it's where you build your hash table on it helps you a lot to have an estimate or to know how large the tables are because for hash join usually take the small table with the hash table for that

Starting point is 00:35:39 so it can nicely be cached and then for large table, you probe that one into the other table. And for that, you need table cardinalities, which is usually the most trivial thing because in some way you have to know how large the tables are. Column cardinalities and frequent values, that might help you a lot when you want to estimate a predicate. Let's say you have a table R

Starting point is 00:36:04 and now we have a predicate on column A and B. So the question now is how do I execute these two filters? Do I do first, do the filter on A or on B? And for that, you need to estimate the equality predicates. If it is equality, you need to estimate the predicates. And let's say there is, you have frequent values, and now you see okay column A selects the most frequent value in this tuple so we have a huge

Starting point is 00:36:29 result this might already help you to know okay I'd rather start with scanning predicate B because that reduces my intermediate result already quite significantly and then run A or maybe it might be the other way around. But you need to have to be able to estimate the equality, you need to have some statistics. You might have high low keys or min max keys. So this might already allow you to totally throw away the query. So if you are selecting, have an equal select on a value that is below your minimal value, you know there's no tuple that can ever match that. So that might allow you to just throw out the query and say, okay, there's no result. There will be no result for that. There will be, there's histograms usually in most databases

Starting point is 00:37:16 and index information to estimate how fast an index can could be. As usual, statistics are flawed because you don't want to keep them up they don't always want to keep them updated. If you keep your histogram updated with every insert that slows down your inserts and so on so usually have something like in Postgres you can do certain functions, analyze functions that will update and create statistics. So this is usually not done with every data modification and thus it is always usually flawed. So you have to account for errors there. You can also do sampling but sampling is often very expensive unless you piggyback it. So if

Starting point is 00:38:01 you scan your entire relation, there's a good chance probably to collect some samples, which can also be used to estimate predicates and joints and so on. Rule-based optimization, so for this one, so it should have changed the order. Rule-based optimizations are usually the step before cost-based optimizations. Here we do use simple heuristics. So in general

Starting point is 00:38:29 it is not necessarily always the best idea but pretty much always you want to minimize the intermediate result. That means if there is some operation that reduces the result set, do it as early as possible. So a filter is always helps you to, the filter will never blow up your result, right? Can only reduce it. Maybe it doesn't reduce. So it might not reduce, but it will never increase the result.

Starting point is 00:38:56 And usually it makes your results smaller. So always, if you have a filter, push it down as early, to execute as early as possible. We want to minimize also the materialization that we need to do, so we can do projections early on and we want to avoid access to secondary storage because disks are slow. So typical examples, as I've already said, if there are selections, push them down. So do them before the joining because the join joints might be very examples as I've already said if there are selections push them down right so do them before the joining because a join joins might be very selective so

Starting point is 00:39:30 the result of a join might be really slow but joins can also blow up the results so it's depending on how good your statistics are you might be wrong selections never blow up your results so it's a good idea to just push them down as much as possible. The same with projections. You also need statistics for that and those heuristics might be wrong. There will always be cases where a join is very selective, so a join has a very small result and will be quite fast, while a selection might be a pretty expensive wildcard like match, which is actually more expensive to execute on a string column than a nice join

Starting point is 00:40:21 on an integer column. Can be the case. Most often you will be pretty good with those heuristics, but they will always be cases when you're wrong. Maybe for like cases you can somewhat estimate it, but you will pretty much never have the optimal plan. For joins, we already talked a little bit about them. So we have different joint methods. Don't think there's any database system that has not at least two of them.

Starting point is 00:40:54 So we have nested loop joints. So the complexity is m by n. So m times n, this is just usually for large data sets, you want to avoid that by any cost. You never want to do an asset loop join. For small joins, this is frequently done because you avoid sorting the table or you avoid having to build a hash table first.

Starting point is 00:41:16 So if you're just joining maybe 20 tuples and 30 tuples, maybe two pages, just do an asset loop join. That's totally fine. That's probably the fastest thing you can do if you have a larger data avoided by any cost. Other methods for larger join are, for example, sort merge join. We already discussed it. So we have the cost now of sorting both input relations and then sequentially joining them. This can be pretty efficient and can be parallelized,

Starting point is 00:41:48 but can really hurt if you need to do external sorting. So if the main memory is not large enough and you need to spill data to disk and you need some merging afterwards. But again, that can be really fast if everything is in my memory or if previous operator, maybe there was another join before and your result is already sorted. So this can be, the sorting can be in this case for free. So sort merge joins can be really fast in this case. And usually a good fallback or the good join for most cases is a hash join. Works only for equal predicates, so if you have a join that joins with a different predicate, you usually can't use the hash join,

Starting point is 00:42:32 but in most cases you will be ending up with a hash join. Yeah, so now we listed the complexities here, which is often what you do if you talk about joins. But keep in mind that it's not only about the comparisons that we have to do while joining. It's also relevant how we actually get the data in the first place. So if there might be a join alternative that is a little bit slower or a little bit less efficient, but somehow reads less data from secondary storage, this might still be the better alternative. So just looking at the complexity is something you would do in a main memory database where all the

Starting point is 00:43:21 data is always in the memory and you don't have to care about secondary storage accesses. If you care about those accesses, then you probably also want to consider IO costs. If you're running in exactly that situation, so you want to join, you have a small buffer and now you need to join two very large tables. Then one example, one typical hash join implementation is the so-called grace hash join and the idea of the grace hash join is basically that we partition our data. So in this case let's assume we have those two partitions here, those two tables first and second. For example now we join by month, let's just assume there are 12 partitions here, those two tables, first and second. For example now we join by month. Let's just assume there are 12 partitions. Another good thing is

Starting point is 00:44:10 if we now first we partition both input relations by month. So we have one partition for the month January, February and so on. So we have 12 partitions for both input relations. We can put them, we can spill them to disk. So we don't need to have the entire relations in my memory, which we would have with a standard join, at least the hash table would be entirely in memory. Now we first partition those, and then now we only load the matching partitions, right? So we know there will be no match from tuples, general tuples with tuples that are in the February bucket. If we have an equality predicate, this won't match. So if you

Starting point is 00:44:54 have an equi-join, we know, okay, we only have to read the general partition from R, general partition from S, and now we need to merge these two or join these two partitions. For example, the hash join and this allows us to keep the memory usage somewhat limited. In case we have a little bit more memory, we can do something called the hybrid hash join, but that's something we're not going to talk in more detail about now. Yeah, in case we have many many partitions, what you also sometimes do is that you have a multi-step partitioning. This is shown here on the bottom. For example, let's say you want to,

Starting point is 00:45:40 this is a really a massive result, so you're not creating 12 partitions, but you need to create, so to be able to join partitions, let's say you want to have many, many thousand partitions, because it's a huge data set. In this case, you probably don't want to start with partitioning into many, many thousands of partitions at once, because you will be just sending single tuples to new files. So each time you will append another file so there will be no positive effects from caching and so on. What is shown here is that you have a multi-step partitioning. So you first, for example, partition into 100 partitions, something you can keep in your buffer. And then for each of these partitions, it will continue and then partition this partition further into more partitions.

Starting point is 00:46:35 Let's have a short five-minute break and then we will continue with database topics. Database stuff. So this component where we store all the statistics that are usually needed and required to optimize queries is sometimes called the data dictionary. Sometimes it's part of a catalog. But yeah, so data dictionary catalog. You restore general statistics. Statistics are very useful, but they're also expensive. So we need to store them so they account

Starting point is 00:47:11 for memory usage and disk usage. They have to be kept at least a little bit current to avoid too much misestimations. And that's actually quite hard. So there's a lot of research on that. So having good statistics is not easy. And yeah, but to do true transformations, optimizations, cost-based optimizations,

Starting point is 00:47:35 there's no way to avoid having some statistics. The obvious one here, for example, you see a table that tells us, okay us which columns do we have in our system and what is the size or the average size, for example. So some of them we just need to even to pass the query. We need to know which columns we have and which data type they have, if the query is actually correct that users sent to our systems. Others, like which indices do we have is needed for optimizing this, optimizing queries for cardinality estimations and so on.

Starting point is 00:48:15 Okay, so then the very, on the very top level of our five layer architecture, one part, another part was the access control. It's kind of boring, but part of every real database system, with this component we can say, okay, what does every user, what can users do? For research systems, you usually just ignore that because it's not really that relevant. But for real systems, there's no way around it. So you can, first of all, you can say,

Starting point is 00:48:53 which users do we have? So we can create a user, we can remove users. We can kill sessions in case you see that something is going on with some user session. And then you can grant or revoke rights. So in this example here we grant all privileges to the account Lawrence and also with grant option so that means that Lawrence has read-write access on the account relation and you can also further grant this rights to others.

Starting point is 00:49:28 Still, even with those access controls, that's no complete protection against malicious actors. So there might be different vulnerabilities of access rights, so problems might stem from there. There might be access to data without a database management system. That's often the problem. So you have to store your backups, you have to store your logs somewhere, and now often, or if you have the entire log of the database system, you don't even need to have access to the system because you can just take the log and replay everything that was done, every insert and every update. So now you basically have the

Starting point is 00:50:08 entire system at hand, right? So the database needs to make sure that nobody can either access this auxiliary data or this backup data, log data, or you encrypt it in a way that everybody can, other people can access it, but they can't make sense of it. So, but you also have to take care of that not only your database system is secured, also all the data that you write in some places that other people might access. And sometimes it's even possible to ask questions to derive the requested data. So you might not have access to the account emulation, but by asking good questions you might still

Starting point is 00:50:47 get an idea of what is actually in this account relation. Transactions are usually the logical unit of work. So this is how you as a user post your previous posts, your updates, your inserts, your selects. They are usually run in the context of a transaction. Here we have an example of a transaction where we update an account. So we have for example a bank account here. We take one million from the checking account, put it into the savings account for the account 007, so I guess for James Bond or so, and then we insert into the journal that there has been some transfer. And the important thing here for transactions is that it's this unit of work, and as we

Starting point is 00:51:39 later will see, this either runs through, wants to success completely, or we have to provide somewhere that a database can roll back all the changes that it might have done if it does not run successfully. So if we have already subtracted the 1 million from the checking account and something goes wrong, we have to make sure that database somehow recognizes that and is able to put this one million back because we should never end up in a state where parts of the transactions have been executed

Starting point is 00:52:14 and other parts have not been executed and later other users and other transactions see only partially the effects. This is something that database needs to make sure that this never happens. And how we do that concurrently is by synchronization and locking. For example, we have two sketches of their two users running concurrently some transactions, adding, removing, and so on, from their accounts or from this table that we have seen previously

Starting point is 00:52:44 with the transaction. And now the question is, can a database system determine if there's or do we know when two of those schedules are conflict-free? And the trick to do there is, or the trick here is, they are conflict-free if they are so-called serializable. That means, are they equivalent when we run them sequentially. So it is in a serial schedule. Then those are conflict-free, which is nice. If it's not conflict-free, there needs to be... we need to make sure that we can resolve those conflicts. This might be locking. So we just lock something and one transaction can't run at all. Or we might do it after the execution.

Starting point is 00:53:29 So we just let both run. And before we let those two transactions commit, we check if they conflicted. It might be expensive. But this is usually the way that most database systems do it today. This is called optimistic. So we just assume a transaction does called optimistic. So we just assume a transaction does not conflict, so we run it until the end and then afterwards we have data structures that are somewhat efficient to check if there really was no conflict. If there

Starting point is 00:53:55 was a conflict then we do kind of a lot of work, but we don't want to lock for every transaction. We just assume we are optimistic and we assume, okay, nothing will happen. First execute them and then check if there might have been a conflict. Often that's the most efficient way to do it. There are different synchronization protocols for that. There are issues with most protocols there in the database systems.

Starting point is 00:54:24 You see there are issues with most protocols there in the database systems. You see out there, there are issues. So if you, or they require certain well behavior of transactions. So if you take your Oracle database, Postgres, for every database in the default protocol, you will be able to run into issues. You will run into issues if you want to run into issues. But something you should keep in your head, no matter what method you use, they usually do it for efficiency reasons. For most databases, you can say, okay okay I want to have the maximum guarantees there but this is just usually much slower than what you get out of the box and which is for most cases just

Starting point is 00:55:14 good enough. But remember you can run into problems. Can then there's a transaction manager I already talked about this transactions about atomicity. So the A in ACID is atomicity. So that means if there's a transaction that either runs entirely or does not run, or if it runs, so the visibility of it, either it commits entirely or it does not commit, right?

Starting point is 00:55:42 So there's no half commit or something like that so this is the a then we have c that consistency that database from the changes from one state to another change for example by a transaction the state should always be consistent so we need to remain in a consistent state and the i is for isolation and this is what the transaction manager does. So basically, you as a user, the transaction to you should look as you are running in isolation. You might run concurrently with many, many other users, but from your perspective,

Starting point is 00:56:17 this looks like your transaction is isolated. There's nothing else going on. And this is what a transaction manager does. So it might log, it might synchronize. It is responsible to detect deadlocks and to resolve them if they are deadlocks. And it's also responsible for logging and recovery in case something happens. So let's say you have you do everything right with commits, but your system crashes within a transaction. So you have already written half of the transactions modifications.

Starting point is 00:56:53 And now you have a power outage. Logging recovery is responsible to log all the changes that have been done. And then you want to recover into a state that is consistent. What that means is that you will have to store data redundantly. Because, yeah, there's one that there's the extra tuples in the files, and then you have this lock that you write

Starting point is 00:57:26 to where every changes are written to this is usually looking vastly different because it's serves different purposes the lock is something that blocks you when you have an insert for example you have to write into this lock first as soon as the lock is written you can say okay now i can allow the user to commit, not before because it's not yet persisted and you haven't locked it. So in this case, for the lock, you want to have something

Starting point is 00:57:52 that you can just very fastly append to. And since you hopefully might never read it, it doesn't have to be read optimized. Your file on the other hand, where you query on, that needs to at least somewhat be read optimized if you want to have selections on that. For the log, that's very different. And you can see that you have basically the log and the memory. And then what most systems do, you would not flush your data.

Starting point is 00:58:16 So if you have a file, you need to flush your writes to the disk. You would not do the flush for every single write. Often you do a little bit of batching for a short amount of time and then have five transactions, for example, waiting until they are allowed to commit and then you after five or you wait for one millisecond and every millisecond you flush, then now every transaction has been waiting for the flush. Now you can say okay now you're allowed to commit. And this is, this is the last slide here. This is an overview of all the parts that we have

Starting point is 00:58:53 in modern database systems. On the very top, this is not part of the database, so this is how users interact with systems. And we have usually different users here. On the very left, you see the inexperienced user. This is sometimes users that don't even know that there's a database. So this might be a website user and the user just uses the website, no idea that there is a database in the background, there's an application and this application uses a database but the

Starting point is 00:59:21 user does not see it. Then we have advanced users. This might be analysts or somebody who really, for example, poses SQL queries to the database. So he's already after this database. He knows what data there is in the database. And he might read or modify data. We have software engineers. In this case, for example, they might write UDFs so user-defined functions. They might write code that is running within the database system. So usually some kind of expert for the system and then we have database administrators at least for the very for large enterprise systems you

Starting point is 01:00:02 have administrators and their responsibility is to ensure that everything is running as expected, as efficient as possible, so they are responsible for creating indices, for vacuuming, for checking that all the backup is running as expected and so on. In the middle, this is basically the main database process here. We have schema management, we have database manager that handles requests, DMA compilers, so the compiler that take the queries and compile them to something the database can actually execute.

Starting point is 01:00:38 We have recovery transactions and so on. So the components that we just talked about and on the very bottom we have files. So this is our disk. The disk needs to store our indices, the log, very important, snapshots maybe, and the dictionary and so on. So that's the complete stack you see here for the database management system, which pretty much every system kind of has. Okay, so what do we talk about today? Yesterday we started with database management systems,

Starting point is 01:01:17 where we talked about relational model, talked about SQL, the latest version of SQL, very briefly, we talked about the five layer. We continue talking about the five layer architecture in the indexes. We talked about B-trees. So what is one of the most efficient index if you are on persistent block-based storage?

Starting point is 01:01:40 And we talked about how to process queries and for example, how to optimize a query plan into something that runs efficiently. And, yeah, next week, I think it doesn't make sense to start. No, wait, it's 12.30, right? Yeah. Yeah, no, we will start with CPU. Not next week. Sorry, today.

Starting point is 01:02:06 This will be the next topic. I'm gonna start in a minute. Are there any questions for us through the database part? SQL relational, that should be kind of well known, but anything, any remaining questions there? No, Okay. Okay. So, let's at least start with CPU and caching. First of all, all the code examples, there will be a couple of code examples. You can find them online.

Starting point is 01:02:42 I will put on the slides later online so you can don't have to type, you will have the link in the slides. But the slides are not yet online. I will do that later on. So what are we discussing now, today and next week, Professor Rabl? We have three main parts in this lecture. We have the intro, which is just finished. Yay. We have CPUs and then we are going to talk about peripherals. Actually, I don't know how to pronounce that, but yeah. Now we are just going to talk about CPUs the next couple of days and we will talk about the things that we can see on the right. So we're talking about DRAM, CPU, what we'll learn, what

Starting point is 01:03:30 cores are, what caches we have and how actually this system looks like a CPU and what a CPU looks like. Today we start with CPU architecture and caching which are very fundamentally important things if you want to have a good performance. You should be aware how CPUs look like, how they execute code, what caching is. In the next weeks we are also going to talk about how instructions are executed, what vectorized execution is, what execution models there are are and more. So the BASIC means almost in time, a little bit behind the schedule, but it should be fine.

Starting point is 01:04:15 This lecture, or today, I see how far we get, but first we will talk about computer architecture, so we will talk about buses and the memory hierarchy. And then the second part is about memory access, so what is the virtual memory, what caches do we have, what caches do we have in modern CPUs, how is that data layout and what is alignment. The lecture is based on different books, you can see here on the right, which we can recommend, and also other system, other lectures for example by the guys, Guido Mockette, Sebastian Breiss, and Jana Kicsewa. So these are the sources for the following slides.

Starting point is 01:04:59 Okay, computer architecture. This is not every system you nowadays see in modern servers, but this is kind of a system that most servers, especially if you rent them from Amazon EC2 and so on, what most servers look like. So you have CPU. CPU. Yeah. what most servers look like. So you have CPU. So here in the middle, this is central processing unit. Everybody should know that.

Starting point is 01:05:33 And modern CPUs, as you probably also know, have multiple cores. So now we are no longer having single core CPUs. We have cores, modern CPUs with dozens, many dozens of cores. Right next to it we have the DRAM and then there's a PCIe interface which basically connects other components to our system. This might be the GPU, network, disks, FPGAs and so on which we cover later in this lecture. If we are talking about larger systems, larger servers, we might not only have a CPU, we might have multiple CPUs. In this case, we are talking, in modern CPUs, about a NUMO architecture,

Starting point is 01:06:19 or a NUMO system. And NUMO is a non-uniform memory access. That means, as we'll cover later, but basically that means that if your thread is running on this CPU shown here, this large CPU, there's a difference or access to memory is not uniform. That means if you're accessing memory that is part of this CPU or that's located near this CPU, this might be faster than accessing data that is located next to the other this might be faster than accessing data that is located next to the other CPU. So this is something you need to be aware of. If you're having large servers with multiple CPUs, there are servers with four and eight CPUs, so you might have multiple hops to access data even though it's DRAM.

Starting point is 01:06:59 So usually consider the DRAM is always fast. It's so fast, but it's slower than your local DRAM, something to be aware of. And one of the, not recent, or one of the trends that continues to be a problem and that we should address, and that's why we have this lecture, is that there's a growing performance gap between CPUs and the speed with which we can access DRAM.

Starting point is 01:07:32 And the problem is, yeah, it's increasing. And it led to the situation that, for example, for memory databases and other systems, that we often now have vast amounts of DRAM. So now, I think even at Amazon, you can simply just rent a server with at least hundreds of gigabytes, sometimes terabytes of gigabytes.

Starting point is 01:07:54 So you have vast amounts of DRAM, you have many CPUs, we have hundreds of CPUs, but you're actually bound, the performance is bound by accessing DRAM. So this might be your main bottleneck, but unless your data is really really huge, so you don't have to access disk, but if you have hundreds of gigabytes, many data sets entirely fit in my memory. And in this case, the problem that you often see is that your main bottleneck for performance is DRAM. Even though DRAM is supposed to be really fast,

Starting point is 01:08:23 this gap is growing. So we have many cores that concurrently access data and they wait more and more. They could do more and more, but they have to wait for DRAM. And yeah, this is something that we will discuss today, how this can be a little bit addressed by modern architectures,

Starting point is 01:08:43 and also later how this can be addressed by you running code that is efficient. Coming back for example to the example we had yesterday in the micro benchmark with accessing the matrix in different ways. So memory, not every memory is, not all memory is the same. We have different kinds of memories, and the one everybody knows, that's dynamic RAM. So this is DRAM, and here they're just kept in a capacitor. And what the interesting thing here is that this needs refreshing.

Starting point is 01:09:20 So DRAM, just to keep the state, to keep whatever is stored in DRAM, you need to refresh it. That means it uses energy and it also has an impact on the performance. So this is usually what you have in a DIMM. This is your memory in a system. Then we have static RAM and this is another design. It's much more complex and it's much more expensive. So this is larger and more expensive and for that reason something like SRAM is used in a CPU cache. So this is the small memory that is part of each CPU while the dynamic RAM is usually used in a DIMM. So this is where you have something that you put on a DIMM

Starting point is 01:10:06 that's connected via a bus. So that's not something really directly that sits on the CPU. This is something you have outside of the CPU in much larger capacities. And the reason for that is that DRAM is comparatively slow, at least if you compare it to SRAM. As I've said, we need to refresh it periodically. That means there is certain work to do just to keep the DRAM in its state and you have to account for that. So this refreshing takes hundreds of cycles. So accessing DRAM can be slow basically. But this does not have, so you don't have to refresh it before you can access the read data. This is like all the time and if you access,

Starting point is 01:11:02 depending on how you access data, you might be able to hide that latency very well. For example, in the end what you do, if you take, let's say you want to access a single byte, what the CPU does for you, or what the CPU does, it reads an entire cache line, so it reads more bytes than the actual request. So this is one thing you can address a little bit and then if you have for example a sequential workload so you're running sequentially over data one bit after another or one byte and the CPU recognizes that it might allow you to or

Starting point is 01:11:37 it might automatically for you pull multiple cache lines at once so you can totally hide this latency that you have when accessing DRAM if you're programming it in the correct way which we'll talk about later but keep in mind accessing DRAM is actually comparatively expensive and comparatively slow. Yeah, so, right, talk about that later. SRAM on the other hand can be really fast, right? And that's instantaneous, so we don't have to wait hundreds of cycles, but it's too expensive.

Starting point is 01:12:16 So just, yeah, it's not possible. You wouldn't use SRAM to put your hundreds of gigabytes, hundreds of gigabytes of it on a DIMM. This is something that is so fast you want to have it, you want to use it, but it's just in very small spaces, and you put that, for that reason, on the chip. This is a local chip memory that you have, and much smaller sizes than your actual DRAM. And so how we combine these two different memories

Starting point is 01:12:49 is that we have an hierarchy that you probably have heard about before. So we want to have very fast memory near the CPU. So once we load data, we want to access very fast. And then we have different levels of that growing by size, getting a little bit slower every time until we are at the DRAM level and then the next level it will be slower. So how that looks in a modern system is shown here.

Starting point is 01:13:36 So the size of the pyramids, so basically the size of each level shows how large it is, the capacity, and on the right side you can see the latency. For registers, that's basically instantaneous, right? So as fast as your CPU runs, the data is immediately there. For caches, so as we have said about S-RAM, there's also immediately there the data, but the caches do need some translations, they need to check if data is in the next level or have to have whatever we have to pull data from. So there's some more effort that caches need to do. So we have a little bit of a latency hit there if you access caches depending on which cache.

Starting point is 01:14:15 As you've seen in the slides before we have in modern Intel CPUs we have three level caches. IBM I think has four. Apple has I think two right now. So it depends on the CPU that you actually have, but you always have multiple levels and also different latencies with that. For memory we have often something like 50 to 100 nanoseconds, so definitely less than a microsecond. Hard disk, we had that before, it's in the millisecond range. And then we have tape, much slower. Not shown here, but in between main memory and hard disk would now be your modern SSD. And here that's roundabout like 1,000 times faster than hard disk. So you sometimes can get into the single digit

Starting point is 01:15:05 microsecond range, mostly between like 50 and 100 microseconds for latencies when accessing modern SSDs, depending on the SSD. So this would be running between hard disk. Okay, and I think we better stop now before we're going to continue, or Professor Raube is going to continue with virtual memory next week. Are there any questions so far to general architecture and memory hierarchies in modern

Starting point is 01:15:38 systems? No? Okay, if there are no questions, then thank you, and we'll see each other. You will see Professor Ravel next week. Okay, thank you.

Your Ad Here

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - DBMS Recap & CPU and Caching

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.