Hardware-Conscious Data Processing (ST 2024) - tele-TASK - DBMS Recap & CPU and Caching
Episode Date: April 17, 2024...
Transcript
Discussion (0)
Welcome back to Hardware Conscious Data Processing. Today we are going to
continue discussing database management systems and then we continue or then we
start with the actual content and talk about CPUs. So we are a little bit late
in time but it shouldn't be a problem.
We should have enough time to cover everything that we plan to cover in this lecture.
Okay, so briefly what we discussed yesterday, we talked, before that we talked about performance management,
and how to do proper benchmarks, and how to do fair comparisons,
and then the second part was about database systems and
We discussed how relations look in those systems
Introduced a few terms
There that we have columns rows and tuples. They shouldn't store the age column
Then one minimal change to the slides. So they have been a
Frequent said there. I think they are like 10 sequell standards out there the most recent one is from last year it's called sql 2023 and just yeah
might be interesting to you so sql doesn't really change a lot so what might be the thing that they
that they still add and the big change with sql 23 was JSON. So now there's a native support for
syntax, how to query and how to add JSON data in a relation database system. So this is what
there still changes to SQL. And this is one example of what changes. There are also recent versions that added graph syntax,
and such things are now added.
Yeah, we talked about the different layers
that we have in the database system,
and then we stopped at this slide.
And yeah, so what you see here is a hard drive,
a hard drive disk, so an HDD,
and might sound pretty old fashioned and outdated to you,
but it's not.
So there's still probably most of the data
you frequently access apart from a laptop
is probably residing.
So much data in the cloud is still stored on such disks and there are various reasons for that.
One main reason or the obvious reasons are price efficiency capacities,
but there are more interesting reasons that we might cover later on in the lecture,
but it's still very relevant how these work. Compared to your probably modern
SSDs that you have in a laptop, HEDs are quite slow. So the main reason for
that is that there is actual physical movement in there.
So if you want to access a certain sector which is marked here, a little bit
darker, on a given track.
So you might have to move your head.
The head needs to move the right track.
Then you have to rotate to the sector that you want to read
to actually fetch the data that you want to access.
And that can take some time, depending on how fast the disk spins,
so this might be a simple disk that spins with 5,000 rounds.
Here we have this example has 7,000 rounds.
We have 10K and even 15,000 rotations done per minute.
Depending on that, it says how long it takes an average to
spin the disc to the right position. So an average, if you
have 7,000 rotations, it means you take around about 8 milliseconds for one rotation.
So half of it in average to move the head is what I can expect for this rotational delay.
And then you have transfer data and so on and so forth.
And the very big difference to SSDs or the most significant difference to SSDs,
apart from obvious things as the bandwidth.
So it's the difference between random and sequential I.O.
So what you can probably imagine is
if you have sequential I.O.s,
so read one data or one sector after the other,
and the disk is spinning,
that's pretty much the best thing that can happen, right?
So you just let the disk spin,
read one sector after another,
so that's pretty fast, and that's the best scenario you can have here.
On the other hand, you have random I.O.
That means you have to move the head maybe to another line.
You have to reach some very distant sector.
So that's basically the worst-case performance you can achieve with HEDs. So especially in the old days people very much or
systems very much focused on doing as much sequential I.O. as they could. This is
no longer always the case with SSDs but also there it's usually better to have
sequential I.O. but difference back then or still still today, the difference for AGDs is much more important,
or much significantly larger.
So how does a sequential file actually look?
Now we have a table.
In this example here, we have an ID in this table.
We have a name.
And there might also be an auxiliary ID that is not shown to the user
which is just used by the database a record ID for example a record or tuple identifier
and so now how does it look the file is just the sequential bunch of bytes right so if we now want
to if we now have such a file and we want to add a record, that's just usually going to the end of the file and then adding this new tuple that you just inserted.
If you want to seek a certain tuple identifier, so you want to fetch one tuple, that means you have to scan the file right you don't know where in the file this record ID might be stored at least for this example here looks like
the table might be sorted by my last name but if you're now looking for a TID
there's no way you know where this ID is unless you have another data structure
on top so I just have to scan the entire file and look for it.
The same if you want to delete it or replace it.
What could we do if we want to,
what could we do if the file would be sorted
or if the tuple identifiers would be increasing?
What could we do there?
Yeah.
Yeah.
So we could jump from the file, right?
We could give them the middle, and then half, half, half.
So we could have a binary search, do there.
And if that's not the case, I typically do or what is often done is
that we have an index file so we have now a data structure that's no longer
just a sequential file but we store we organize the data differently for
example index file and now this is looks pretty much it is a tree, started as a tree, so the file internally
starts basically a tree with a node,
with a root, with internal nodes and the leaves.
And now you're able, depending on how,
what is the criteria that your tree is built on,
now you are able to seek much more efficiently,
so in a logarithmic time.
How we insert and delete in this case should be, I guess, quite, should be obvious, I guess.
So what are typical trees we are using here? We, I think, don't think there's any system
that uses a classical binary tree
because does anybody have an idea
what the main problem with a binary tree is?
Like each internal node just says
larger, smaller or what?
So...
Yeah? In search for deletion, Yeah.
Yeah.
In search for deletion,
do you have to update the...
This is also...
Modifications can also be a problem,
but the main problem here is that we...
It actually depends on how we implement it,
but what you usually do is that you have different...
So let me back up.
The thing is, if you talk about, sorry, I forgot to mention that.
If you talk about an HDD and you want to access, let's say you want to access five bytes,
what you always will end up doing is that you read an entire page.
So no matter how small the data is you read, even if it's a single bit somewhere,
you need to fetch an entire page.
This is usually four kilobytes of data.
And that means if this is your main bottleneck, getting this page from disk, which can be
really slow, so we are in milliseconds range here, this means that you want to have as
much data as possible, usable data, valuable data within each page that you get.
So the main problem with classical binary trees
is that the goal of different trees, for example,
B-tree is that you have as much usable data in each page
that you use as possible.
So a B-tree, which is not a binary tree,
but has been introduced by a researcher called,
I think, Rudolf Beyer, so this is a Beyer tree.
I don't think I coined it that way,
but everybody assumes it's a B-tree, if I'm not mistaken.
So this B-tree, we have much larger nodes.
So we don't only,
depending on how large the size is and what your actual key is,
we try to put as many keys in each node as you can
so that with each page fetch,
you flatten the depth of your tree,
nodes get larger,
and you only have, even for large data sets,
only very few pages, and you're already at the location
that you're looking for.
And there are different alternatives of this tree.
Many people, there's, for example, the B plus
and B star tree, and the main difference here or is how large each how large you allow the notes to get when do you
reorder and when do you reorganize a tree and where actually data is stored
so it can make for different cases different things work out but I'm for
example a B plus tree doesn't store the data.
So if, say, you store tuples here in this index,
the B plus tree would not store more than the key in the nodes.
And the extra tuple is then in the leaves,
while this might be different, for example,
for the classical B tree.
We can also have tuples in the leaves. And if you have all the tuples only in the leaves, and the leaves are connected, for example for the classical B-tree. We can also have tuples in the leaves.
And if you have all the tuples only in the leaves and the leaves are connected, for example, that allows you to scan the entire table in a sorted fashion just by traversing all the leaves,
which can be quite nice if you expect sequential traversal to be somewhat a frequent operation.
We have a B plus tree example here.
I think, I guess it's something you can look on tomorrow.
It shouldn't be too surprising to you,
but keep in mind, B plus trees or B trees in general,
many forms of B trees are still basically the way to do indexing on block-based storage.
Even with very fast SSDs, even the most modern, most recent database systems still often use
B trees.
So even though it's a pretty old data structure and there are many many small tweaks to it, it's still very widely used and it's
the state of the art still if your data is on a block-based device.
Another example that is also heavily used is hashing. So we have, we store hash buckets and files. So in this case,
we have each bucket. So for whatever we use hashing for, we have different buckets,
into which we add the items depending on the hash value. And now each bucket can be a page for example and then it might happen
because there is skew in the data or data is distributed differently than we expected
that one page is full.
So in this case here what you see is that we have so-called overflow pages.
In this case we have more items for a given bucket than we expected. So we have to account somehow for that. And then we have to have an overflow page in this case.
Yes?
.
JAN-FELIX SCHWARTZMANN- Hopefully that's better. I assume hashing is well known.
How hash tables are used, what different hash tables are there.
You have to account for collisions.
That should be not surprising, right? You have to deal with collisions. That should be not surprising, right? So there's, you have to deal with collisions.
Okay, so then the other level, so now we know how data is stored. The database system stores data
on file. So the next level that we talked about is the buffer manager or the buffer management,
or the buffer manager who does the buffer management and that's what usually is the caching in
the database system. As we have said yesterday, the components above the
buffer manager usually they don't, they shouldn't know how data is stored or
which, how to cache data, how to do that efficiently and effectively. They just
say okay I want to have this page, I want to have this data.
Dear Buffer Manager, please provide me with this page.
And now this Buffer Manager checks
if the data is already in the main memory in its buffer.
So Buffer Manager controls this buffer.
If it is, we just have to return the pointer
to this page and we are done.
If not, we have to fetch it from disk.
And in case our main memory buffer is full, we now have to
first clean some space in the buffer so that we can pull the page from disk into our buffer.
There are various replacement strategies, so-called replacement strategies,
how to do that and how to decide which page should I now remove from my buffer to make place
for a new page. And that's actually often not super trivial what to do there. You can do very simple approaches,
probably the best known thing,
or the most trivial thing is a FIFO approach,
which means first in, first out.
So the oldest page will be evicted.
So usually makes sense,
but might not always be the best strategy.
And one problem that
there's one of the standard strategies you can use is often
LRU, which means least recently used. So you not only store when was a data
added to my buffer, like in FIFO, you keep a timer for each page when was the last
time that I accessed this page and whatever now is the oldest page or the
least recently accessed page, this is the candidate that I will evict before I can
put in new pages. But there is an issue with all the approaches that some more sophisticated replacement strategies try to avoid, and this is scanning.
So imagine you have a buffer and you have, let's say, 1,000 pages in it, and now there is a scan for a large table.
So you scan the entire table just once.
The buffer manager usually doesn't know what the
components above do, so they just say one page after another.
Maybe you can give it a hint. Sometimes it's possible in a system to
give hints that I know I'm doing a scan, but if you don't have those hints you
will just load every page sequentially, put it in a buffer,
and then after you have 1000 pages processed,
you will start evicting the first page and so on. So as you do a scan, there is no second access to
this page probably. So in the end, you will just read all the pages, evict all the pages, and
there's no, you basically destroyed all the entire buffer. For that reason, there are different approaches for replacement strategies.
For example, LRU2.
And in this case, it tries to counter such scans in a way that it decides on the second least recently used timestamp.
To keep two timestamps.
And then the second recently used timestamp will be keep two timestamps and then the second recently used timestamp
will be used to evict that.
So that the last timestamp
which might come from the scan
must somewhat accounted for.
And there are many other strategies
but keep in mind
scanning can be
kind of a problem
if you have such a buffer manager.
You have to be aware of those things
because a scan is the worst case.
Scan actually, a buffer manager or the buffering
in a scan doesn't make any sense.
Doesn't give you any benefits.
Okay, so now how do we access data
that is stored on disk?
And this means we somehow need,
we need to map data in our,
for example, we know now our index tells us that
the data that we want to, that we are looking for
is on this and that page, so we need to somehow
have a mapping from the page ID and the location to the actual
location on disk. So how this is done, there's for example absolute addressing.
That means the record ID is a page ID and an offset. So we know okay now we
have to fetch the page 17 and go to the offset 100 to read the
actual tuple that I'm looking for right now. Or there's also the alternative of absolute addressing and searching.
Now my
router identifiers are smaller because I'm only storing the page ID.
But I have to
scan this page,
so usually the 4 kilobytes,
and search
for my tuple that I'm looking for.
Both make sense.
So the second alternative saves you space, which is good.
And if your disk is really slow,
and nowadays we have slow disks and very fast CPUs,
scanning 4 kilobytes is often not really a problem,
considering that fetching the page takes milliseconds.
So this might be a viable idea.
If you have very fast SSDs, it might make more sense to store the offsets.
So it depends on the workload and many other factors, but both approaches are used in modern
systems.
Few more notes on the data model level.
And what you see here are three levels
that we are, that you see in the database.
In the middle, this is the conceptual schema.
This is level how you, when you create a database,
how you represent the world.
So you might think, okay, I need a relation
that stores my employees.
There might be a relation that is called locations.
But this is basically your world view.
This is how you store the data.
This is the tables,
the attributes and so on. Then above, for different users, we might have different views.
So they can be, depending on access control and so on, or the views that those users can access,
the world might look a little bit different. Might be a condensed view, might be aggregated,
might be that the salary column is not visible
to certain users.
They can still see who is an employee,
but the salary is not shown and so on.
So basically different users have different views
on the data you have in the database.
And on the very bottom, there is the internal schema.
This is something that you don't see
when you use a database system.
It's basically you have to inspect the database system to get some knowledge about it,
but usually the system tries to hide that completely for you.
It doesn't tell you what, or necessarily it doesn't tell you,
but it tells you what index it's using, how data is stored, where it is stored.
So this is something that you want to hide from the user because the user has its tables and so on.
You can access them, but if you're storing it on which disk
and which way with B-tree or hash index,
hidden usually from the user.
Something that you don't have to care about.
Okay, so now we basically know how our database looks conceptually, how the architecture looks,
which layers we have.
Now let's talk about how queries are processed.
So we have a user that sends us a query.
So as we have said yesterday, this is an iterative query.
In this case, we have four columns that we are selecting or projecting.
Name, address, checking, and balance.
From two tables, customer and account.
And we are looking for the name Bond.
And we have another predicate that does not join.
That just says, okay, for those two tables, I want to, account needs to equal,
it's going to be the equal account on both tables.
So potential query, oh, no, first,
the database system now needs to somehow find a plan
that it can execute, right?
So this is just the secret query.
It doesn't really tell you a lot how to execute it.
Now that it's a task of the database to come up with a plan
that it can execute and that eventually
it returns your result.
Here we have like in pseudo code,
a potential query plan that would process this query.
So we have an outer loop, we have two loops here.
First we go over each customer and now for every customer that we load, so there will be a sequential file, so we scan the sequential file
and get every customer. Now we look if the customer's name is Bond, they continue and then we have
another loop here and in this, we loop over account,
each two-bit and account.
And now we check for both tuples, do they match?
If they match, we return the result.
Does anybody know what this kind of join is called?
Hmm?
Inner?
This is an inner join, yeah.
And the way that we join, so is it a hash join?
I think it's a nested loop join.
Yeah, exactly.
So this is a nested loop join.
Usually the most inefficient join, so you will barely see such a join in a real system, because
systems try to avoid that.
But in this case, this is a very simple example, this would be a nested loop join.
And so now once we have this execution plan, there's usually another step that translates
that to an actual execution plan.
So the physical plans are now, for example, use this operator or use this kind of scan.
So you just check for attribute.
Can be done differently.
This is another translation step which we will talk about in a minute.
So what are the main steps?
First step is we just get in SQL query,
so we have to pass it.
So in this case, we need to,
someone need to build up a structure,
for example, a tree that allows us to put that in a,
to first of all check the syntax,
and then also check the semantics, and check if everything that the user wants to access
is actually available.
So does the table account exist?
Is the syntax correct?
And so on.
Then you usually end up with something like an AST that you later process.
And then once you have this tree, so we have now your this SQL string is now passed and
is in a machine readable format or in a format that your database understands, you start with
generic rewriting. For example, you might expand views if the view is selected. So you just replace
this view with the actual definition of the view.
You might look for common sub-expressions
that you can replace and so on.
And once you've done the simple stuff
or the preparation stuff,
then you try to find an optimal plan.
So the plan that you just generate
from parsing your SQL statement
is rarely the most efficient one.
And the main task now is to find maybe not
the optimal plan because that's very hard in many cases and there might be thousands of millions of
different plans but you want to have a pretty good plan. And there are often in many systems there
are two steps. There's first there's a rule based step.
So then we just apply simple rules
that are usually always good to have examples later.
So we just go with the plan and see, OK, can we
pull up or pull down, push down some predicates and so on.
And then the second step
we do, we need to
estimate costs and then say, okay, which is maybe a good join order. And for that,
simple rules are no longer enough. So we have to apply some cost
functions and see, okay, which would actually be the estimated cost of this
join and the other join. So what might be a good order for that and so on and that often requires statistics.
Because if you don't have statistics, you basically don't know how expensive predicate might be.
And the last step
is executing the query plan.
Okay, so
let's say you have now passed your query.
You have checked that both tables as an R
are actually part of your system, so everything's fine.
You can continue with the query.
So now we have a logical plan
that says we are going to join R and S in this example.
Then we have a filter.
We have a filter on column X of R that checks if that
selects all the values that are smaller than 5 and then we project the two
columns Rx and Sz. And now there might be as I've said many many different
physical plans that we that are potential they have to be equivalent so
they have to deal the same results.
And the task of database is to select
the most efficient one.
Fastest one, the most efficient,
or whatever you're optimizing for.
So one alternative would be, for example,
if you have an index on one of the join columns,
then you would say, okay, now I scan R,
so I just pull on, I get all the data of the table R,
and while I'm fetching all this data, I can already execute my predicate,
so I can early filter on, so I can pretty early in the query reduce my data set by filtering.
When I have R, for example, loaded, I can now use an index nested loop,
which is much better than nested loop,
because now I just have index probes. So I use the index on column SA here and see if there is
a matching value for which columns match, which tuples match, and then I return these two columns.
Now, we have to maybe care about, we didn't put it on the slides, but
depending on how much data is stored in your index, you might still
have to fetch a tuple, right? Because we are projecting the column Z of S, so that
means assuming you scan the entire tuples of R, you have the entire R tuple in place.
But if the index might only return you, okay, yeah, you have a match for this tuple.
You have a match in S for those five tuple IDs.
And you also know, okay, it's the value A.
You know the value A right now because you have an equal match.
But you don't have the value, the Z column for this tuple, so that might require you to have another random
lookup to your disk just to fetch the single attribute for
each matching tuple.
Another alternative that also can be very efficient,
depending on how large the data is and if there are indexes or not, might be a merge join.
In this case, we have a sorted scan for both input tables.
That means we read the table and solve it.
We again do already the predicate on X,
so that X is smaller than five.
And then once we have both index relations sorted
we can do an efficient merge join. Is anybody not familiar with the merge join?
Okay so yeah it's a sequential operation right So you just linearly run over two now sorted relations.
So that's usually very efficient.
Okay, so here's an overview of how we map
the logical operations to their physical counterparts.
For a relation, if you have, like say,
you reference a relation,
that usually is often called a scan.
That just fetches all the tuples. It scans the entire relation, if you have, like say you reference a relation, that usually is often called a scan. That just fetches all the tuples,
it scans the entire relation, maybe an index file,
sequentially gets all the data leaves of the index file,
or is it a sequential file,
then it just runs over the sequential file
and gets the relation.
Filters might be executed by a sequential filter,
or it might be an index scan.
Depending obviously if depending on if there is an index or if there are
multiple indexes then you have to decide for one index to choose but you need to
know to have some information if there's an index to decide if you want to have
normal filter or if you can do it more efficiently via an index scan
or index access. A projection, if you allow duplicates, then it's trivial. You just take
the tuple and just keep all the columns that are projected. For the Cartesian product, that's
usually a nested loop join. So that's not very efficient, but Cartesian
products never are, so that's fine. If you have an inner join or feeder join, so we
have some predicates, we can do hash sort my join, we can do index as a loop
join, so this is the standard join that you usually see and there are usually at
least two different implementations every database system
and now we have database system needs to decide whatever it assumes to be the
most efficient alternative to use for your query for allegations usually
there's hash sorted aggregations also depends on how the query looks or the
statistics looks if you have like in the example, we had a nested merge join where we sort data. Now if
you happen to also have an aggregation afterwards which groups on the predicate
that you have sorted on, that you are sorted on now, then you can do a sorted
aggregation and you don't have to sort anything. You just sequentially run
your join result and you're done.
So that's very efficient.
In other cases, hash aggregation might be much more efficient
because you don't have to sort upfront.
If you have projections and you need to eliminate duplicates,
that's usually off the aggregation,
so I just group by all the tuples that you want to project.
Intersection,
special case of a join, so we have to find the matches.
Difference, there is something called an anti-join, which
joins tuples together but don't match.
So this can be used for difference,
and the union is a union.
Hope that's kind of clear.
Okay, so now a few notes on cost-based optimizations.
So what we do in this step,
now that we have, for example, a already a little bit
pre-processed query plan, is to enumerate plans.
So we want to have checked different plans.
We come up with different plans, and now we want to estimate the execution costs.
For that we need statistics.
So if there's, for example, a, there's, like, there's usually for a hash join, there's like there's usually a for a hash join there's a hash probe side and a hash build side
it's where you build your hash table on it helps you a lot to have an estimate or to know how large
the tables are because for hash join usually take the small table with the hash table for that
so it can nicely be cached and then for large table, you probe that one into the other table.
And for that, you need table cardinalities,
which is usually the most trivial thing
because in some way you have to know how large the tables are.
Column cardinalities and frequent values,
that might help you a lot
when you want to estimate a predicate.
Let's say you have a table R
and now we have a predicate on column A and B.
So the question now is how do I execute these two filters?
Do I do first, do the filter on A or on B?
And for that, you need to estimate the equality predicates.
If it is equality, you need to estimate the predicates.
And let's say there is, you have frequent values,
and now you see
okay column A selects the most frequent value in this tuple so we have a huge
result this might already help you to know okay I'd rather start with scanning
predicate B because that reduces my intermediate result already quite
significantly and then run A or maybe it might be the other way around. But you need to have to be able to estimate the equality, you need to have some statistics.
You might have high low keys or min max keys. So this might already allow you
to totally throw away the query. So if you are selecting, have an equal select
on a value that is below your minimal value, you know there's no tuple that can
ever match that. So that might allow you to just throw out the query and say, okay, there's no
result. There will be no result for that. There will be, there's histograms usually in most databases
and index information to estimate how fast an index can could be. As usual, statistics
are flawed because you don't want to
keep them up they don't always want to keep them updated. If you keep your
histogram updated with every insert that slows down your inserts and so on so
usually have something like in Postgres you can do certain functions, analyze
functions that will update and create statistics. So this is usually not done with every data modification and thus it is always
usually flawed. So you have to account for errors there. You can also do
sampling but sampling is often very expensive unless you piggyback it. So if
you scan your entire relation, there's a good chance probably to collect some samples,
which can also be used to estimate predicates
and joints and so on.
Rule-based optimization, so for this one,
so it should have changed the order.
Rule-based optimizations are usually the step
before cost-based optimizations.
Here we do use simple heuristics. So in general
it is not necessarily always the best idea but pretty much always you want to
minimize the intermediate result. That means if there is some operation that
reduces the result set, do it as early as possible. So a filter is always helps you to,
the filter will never blow up your result, right?
Can only reduce it.
Maybe it doesn't reduce.
So it might not reduce,
but it will never increase the result.
And usually it makes your results smaller.
So always, if you have a filter, push it down as early,
to execute as early as possible. We want to minimize also the
materialization that we need to do, so we can do projections early on and
we want to avoid access to secondary storage because disks are slow.
So typical examples, as I've already said, if there are selections, push them down.
So do them before the joining because the join joints might be very examples as I've already said if there are selections push them down right so
do them before the joining because a join joins might be very selective so
the result of a join might be really slow but joins can also blow up the
results so it's depending on how good your statistics are you might be wrong
selections never blow up your results so it's a good idea to just push them down as much as possible. The same with projections. You also need
statistics for that and those heuristics might be wrong.
There will always be cases where a join is
very selective, so a join has a very small result and will be quite fast,
while a selection might be a pretty expensive wildcard like match, which is
actually more expensive to execute on a string column than a nice join
on an integer column. Can be the case. Most often you will be pretty good with those heuristics,
but they will always be cases when you're wrong.
Maybe for like cases you can somewhat estimate it,
but you will pretty much never have the optimal plan.
For joins, we already talked a little bit about them.
So we have different joint methods.
Don't think there's any database system
that has not at least two of them.
So we have nested loop joints.
So the complexity is m by n.
So m times n, this is just usually for large data sets,
you want to avoid that by any cost.
You never want to do an asset loop join.
For small joins, this is frequently done
because you avoid sorting the table
or you avoid having to build a hash table first.
So if you're just joining maybe 20 tuples and 30 tuples,
maybe two pages, just do an asset loop join.
That's totally fine.
That's probably the fastest thing you can do if you have a larger data
avoided by any cost.
Other methods for larger join are, for example, sort merge join. We already discussed it. So we have the cost now of sorting both
input relations and then sequentially
joining them. This can be pretty efficient and can be parallelized,
but can really hurt if you need to do external sorting. So if the main memory is not
large enough and you need to spill data to disk and you need some merging afterwards.
But again, that can be really fast if everything is in my memory or if previous operator, maybe there was another join before and your result is already sorted.
So this can be, the sorting can be in this case for free. So sort merge joins can be really fast in this case.
And usually a good fallback or the good join for most cases is a hash join.
Works only for equal predicates,
so if you have a join that joins with a different predicate,
you usually can't use the hash join,
but in most cases you will be ending up with a hash join.
Yeah, so now we listed the complexities here,
which is often what you do if you talk about joins.
But keep in mind that it's not only about the comparisons that we have to do while joining.
It's also relevant how we actually get the data in the first place.
So if there might be a join alternative that is a little bit slower or a little bit less efficient, but somehow
reads less data from secondary storage, this might still be the better alternative.
So just looking at the complexity is something you would do in a main memory database where all the
data is always in the memory and you don't have to care about secondary storage accesses.
If you care about those accesses, then you probably also want to consider IO costs.
If you're running in exactly that situation, so you want to join, you have a small buffer
and now you need to join two very large tables. Then one example, one typical hash join
implementation is the so-called grace hash join and the idea of the grace hash join is basically that we
partition our data. So in this case let's assume we have those two partitions here, those two tables
first and second. For example now we join by month, let's just assume there are 12 partitions here, those two tables, first and second. For example now
we join by month. Let's just assume there are 12 partitions. Another good thing is
if we now first we partition both input relations by month. So we have
one partition for the month January, February and so on. So we have 12
partitions for both input relations. We can put them, we can spill
them to disk. So we don't need to have the entire relations in my memory, which we would
have with a standard join, at least the hash table would be entirely in memory. Now we
first partition those, and then now we only load the matching partitions, right? So we
know there will be no match from tuples, general tuples with tuples that are in the February
bucket. If we have an equality predicate, this won't match. So if you
have an equi-join, we know, okay, we only have to read the general
partition from R, general partition from S, and now we need to merge these two or join these two partitions.
For example, the hash join and this allows us to keep the memory usage somewhat limited.
In case we have a little bit more memory, we can do something called the hybrid hash
join, but that's something we're not going to talk in more detail about now.
Yeah, in case we have many many partitions,
what you also sometimes do is that you have a multi-step partitioning. This is shown here on the bottom.
For example, let's say you want to,
this is a really a massive result, so you're not creating 12 partitions, but you need to create,
so to be able to join partitions, let's say you want to have many, many thousand partitions,
because it's a huge data set. In this case, you probably don't want to start with partitioning
into many, many thousands of partitions at once, because you will be just sending single tuples to new files.
So each time you will append another file so there will be no
positive effects from caching and so on. What is shown here is that you have a multi-step
partitioning. So you first, for example, partition into 100 partitions,
something you can keep in your buffer. And then for each of these partitions, it will continue and then partition this partition further into more partitions.
Let's have a short five-minute break and then we will continue with database topics. Database stuff.
So this component where we store all the statistics that are usually needed and required to optimize
queries is sometimes called the data dictionary.
Sometimes it's part of a catalog.
But yeah, so data dictionary catalog.
You restore general statistics.
Statistics are very useful, but they're also expensive.
So we need to store them so they account
for memory usage and disk usage.
They have to be kept at least a little bit current
to avoid too much misestimations.
And that's actually quite hard.
So there's a lot of research on that.
So having good statistics is not easy.
And yeah, but to do true transformations, optimizations,
cost-based optimizations,
there's no way to avoid having some statistics.
The obvious one here, for example,
you see a table that tells us, okay us which columns do we have in our system and what is the size or the average size, for example.
So some of them we just need to even to pass the query.
We need to know which columns we have and which data type they have, if the query is actually correct that users sent to our systems.
Others, like which indices do we have
is needed for optimizing this, optimizing queries
for cardinality estimations and so on.
Okay, so then the very, on the very top level
of our five layer architecture,
one part, another part was the access control.
It's kind of boring, but part of every real database system, with this component we can
say, okay, what does every user, what can users do?
For research systems, you usually just ignore that because it's not really that relevant.
But for real systems, there's no way around it.
So you can, first of all, you can say,
which users do we have?
So we can create a user, we can remove users.
We can kill sessions in case you see that something
is going on with some user session.
And then you can grant or
revoke rights. So in this example here we grant all privileges to the
account Lawrence and also with grant option so that means that Lawrence has
read-write access on the account relation and you can also further grant this rights to others.
Still, even with those access controls, that's no complete protection against
malicious actors. So there might be different vulnerabilities of access
rights, so problems might stem from there. There might be access to data
without a database management system. That's often the problem. So you have to
store your backups, you have to store your logs somewhere, and now often, or if
you have the entire log of the database system, you don't even need to have
access to the system because you can just take the log and replay everything
that was done, every insert and every update. So now you basically have the
entire system at hand, right? So the database needs to make sure that
nobody can either access this auxiliary data or this backup data, log data, or
you encrypt it in a way that everybody can, other people can access it, but they
can't make sense of it. So, but you also have to take care of that not only your database system is secured,
also all the data that you write in some places that other people might access.
And sometimes it's even possible to ask questions to derive the requested data.
So you might not have access to the account emulation,
but by asking good questions you might still
get an idea of what is actually in this account relation.
Transactions are usually the logical unit of work. So this is how you as a user post your previous posts, your
updates, your inserts, your selects. They are usually run in the context of a transaction.
Here we have an example
of a transaction where we update an account. So we have for example a bank account here.
We take one million from the checking account, put it into the savings account for the account 007, so I guess for James Bond or so, and then we insert into
the journal that there has been some transfer.
And the important thing here for transactions is that it's this unit of work, and as we
later will see, this either runs through, wants to success completely, or we have to
provide somewhere that a database can roll back all the
changes that it might have done if it does not run successfully. So if we have
already subtracted the 1 million from the checking account and something goes
wrong, we have to make sure that database somehow recognizes that
and is able to put this one million back
because we should never end up in a state
where parts of the transactions have been executed
and other parts have not been executed
and later other users and other transactions
see only partially the effects.
This is something that database needs to make sure
that this never happens.
And how we do that concurrently is by synchronization and locking.
For example, we have two sketches of their two users running concurrently some transactions,
adding, removing, and so on, from their accounts or from this table that we have seen previously
with the transaction.
And now the question is, can a database system determine if there's or do we know when two of those schedules are conflict-free?
And the trick to do there is, or the trick here is, they are conflict-free if they are so-called serializable.
That means, are they equivalent when we run them sequentially. So it is in a serial schedule. Then those are conflict-free, which is nice. If it's
not conflict-free, there needs to be... we need to make sure that we can
resolve those conflicts. This might be locking. So we just lock something
and one transaction can't run at all.
Or we might do it after the execution.
So we just let both run.
And before we let those two transactions commit,
we check if they conflicted. It might be expensive.
But this is usually the way that most database systems do it today.
This is called optimistic.
So we just assume a transaction does called optimistic. So we just assume
a transaction does not conflict, so we run it until the end and then afterwards we have
data structures that are somewhat efficient to check if there really was no conflict. If there
was a conflict then we do kind of a lot of work, but we don't want to lock for every transaction.
We just assume we are optimistic and we assume, okay, nothing will happen.
First execute them
and then check if there might have been a conflict.
Often that's the most efficient way to do it.
There are different synchronization protocols for that.
There are issues with most protocols
there in the database systems.
You see there are issues with most protocols there in the database systems. You see out there, there are issues.
So if you, or they require certain well behavior of transactions.
So if you take your Oracle database, Postgres, for every database in the default protocol,
you will be able to run into issues.
You will run into issues if you want to run into issues.
But something you should keep in your head, no matter what method you use,
they usually do it for efficiency reasons. For most databases, you can say, okay okay I want to have the maximum guarantees there but this is just
usually much slower than what you get out of the box and which is for most cases just
good enough.
But remember you can run into problems.
Can then there's a transaction manager I already talked about this transactions about atomicity.
So the A in ACID is atomicity.
So that means if there's a transaction
that either runs entirely or does not run,
or if it runs, so the visibility of it,
either it commits entirely or it does not commit, right?
So there's no half commit or something like that so this is
the a then we have c that consistency that database from the changes from one state to
another change for example by a transaction the state should always be consistent so we need to
remain in a consistent state and the i is for isolation and this is what the transaction manager does. So basically, you as a user,
the transaction to you should look
as you are running in isolation.
You might run concurrently with many, many other users,
but from your perspective,
this looks like your transaction is isolated.
There's nothing else going on.
And this is what a transaction manager
does. So it might log, it might synchronize. It is responsible to detect
deadlocks and to resolve them if they are deadlocks. And it's also responsible
for logging and recovery in case something happens. So let's say you have
you do everything right with commits, but your system crashes within a transaction.
So you have already written half of the transactions modifications.
And now you have a power outage.
Logging recovery is responsible to log all the changes that have been done.
And then you want to recover into a state that is consistent.
What that means is that you will have
to store data redundantly.
Because, yeah, there's one that there's
the extra tuples in the files,
and then you have this lock that you write
to where every changes are written to this is usually looking vastly different because it's
serves different purposes the lock is something that blocks you when you have an insert for
example you have to write into this lock first as soon as the lock is written you can say okay now
i can allow the user to commit,
not before because it's not yet persisted
and you haven't locked it.
So in this case, for the lock,
you want to have something
that you can just very fastly append to.
And since you hopefully might never read it,
it doesn't have to be read optimized.
Your file on the other hand, where you query on,
that needs to at least somewhat be read optimized if you want to have selections on that.
For the log, that's very different.
And you can see that you have basically the log and the memory.
And then what most systems do, you would not flush your data.
So if you have a file, you need to flush your writes to the disk.
You would not do the flush for every single write. Often you do a little bit of batching for a short amount of time
and then have five transactions, for example, waiting
until they are allowed to commit and then you after five or you wait for
one millisecond and every millisecond you flush, then now every transaction
has been waiting for the flush. Now you can say okay now
you're allowed to commit.
And this is, this is the last slide here. This is an overview of all the parts that we have
in modern database systems.
On the very top, this is not part of the database,
so this is how users interact with systems.
And we have usually different users here.
On the very left, you see the inexperienced
user. This is sometimes users that don't even know that there's a database. So this might
be a website user and the user just uses the website, no idea that there is a database
in the background, there's an application and this application uses a database but the
user does not see it. Then we have advanced users. This might be
analysts or somebody who really, for example, poses SQL queries to the database.
So he's already after this database. He knows what data there is in the database.
And he might read or modify data. We have software engineers.
In this case, for example, they might write UDFs so
user-defined functions. They might write code that is running within the
database system. So usually some kind of expert for the system and then we have
database administrators at least for the very for large enterprise systems you
have administrators and their responsibility is to ensure that
everything is running as expected, as efficient as possible, so they are responsible for creating
indices, for vacuuming, for checking that all the backup is running as expected and so on.
In the middle, this is basically the main database process here. We have schema management, we have database manager
that handles requests, DMA compilers,
so the compiler that take the queries
and compile them to something
the database can actually execute.
We have recovery transactions and so on.
So the components that we just talked about
and on the very bottom we have files.
So this is our disk. The disk needs to store our indices, the log, very important, snapshots
maybe, and the dictionary and so on. So that's the complete stack you see here for the database
management system, which pretty much every system kind of has.
Okay, so what do we talk about today?
Yesterday we started with database management systems,
where we talked about relational model,
talked about SQL, the latest version of SQL,
very briefly, we talked about the five layer.
We continue talking about the five layer architecture
in the indexes.
We talked about B-trees.
So what is one of the most efficient index
if you are on persistent block-based storage?
And we talked about how to process queries
and for example, how to optimize a query plan into something that runs efficiently.
And, yeah, next week, I think it doesn't make sense to start.
No, wait, it's 12.30, right?
Yeah.
Yeah, no, we will start with CPU.
Not next week.
Sorry, today.
This will be the next topic.
I'm gonna start in a minute.
Are there any questions for us through the database part?
SQL relational, that should be kind of well known,
but anything, any remaining questions there?
No, Okay. Okay. So, let's at least start with CPU and caching.
First of all, all the code examples, there will be a couple of code examples.
You can find them online.
I will put on the slides later online so you can
don't have to type, you will have the link in the slides. But the slides are not yet online.
I will do that later on. So what are we discussing now, today and next week, Professor Rabl?
We have three main parts in this lecture. We have the
intro, which is just finished. Yay. We have CPUs and then we are going to talk about peripherals.
Actually, I don't know how to pronounce that, but yeah. Now we are just going to talk about
CPUs the next couple of days and we will talk about the things that we
can see on the right. So we're talking about DRAM, CPU, what we'll learn, what
cores are, what caches we have and how actually this system looks like a CPU
and what a CPU looks like. Today we start with CPU architecture and caching
which are very fundamentally important things if you want to have a good
performance.
You should be aware how CPUs look like, how they execute code, what caching is.
In the next weeks we are also going to talk about how instructions are executed, what
vectorized execution is, what execution models there are are and more.
So the BASIC means almost in time, a little bit behind the schedule, but it should be fine.
This lecture, or today, I see how far we get,
but first we will talk about computer architecture,
so we will talk about buses and the memory hierarchy. And then the second part is about memory access, so what is the
virtual memory, what caches do we have, what caches do we have in modern CPUs, how is that data
layout and what is alignment. The lecture is based on different books, you
can see here on the right, which we can recommend, and also other system, other
lectures for example by the guys, Guido Mockette, Sebastian Breiss, and Jana Kicsewa.
So these are the sources for the following slides.
Okay, computer architecture.
This is not every system you nowadays see in modern servers,
but this is kind of a system that most servers,
especially if you rent them from Amazon EC2 and so on,
what most servers look like.
So you have CPU. CPU.
Yeah. what most servers look like. So you have CPU. So here in the middle, this is central processing unit.
Everybody should know that.
And modern CPUs, as you probably also know,
have multiple cores.
So now we are no longer having single core CPUs.
We have cores, modern CPUs with dozens, many dozens of cores.
Right next to it we have the DRAM and then there's a PCIe interface which
basically connects other components to our system. This might be the GPU, network,
disks, FPGAs and so on which we cover later in this lecture. If we are talking about larger systems, larger servers, we might not only have a CPU,
we might have multiple CPUs. In this case, we are talking, in modern CPUs, about a NUMO architecture,
or a NUMO system. And NUMO is a non-uniform memory access. That means, as we'll cover later,
but basically that means that if your thread is running on this CPU shown here, this large CPU,
there's a difference or access to memory is not uniform. That means if you're accessing memory
that is part of this CPU or that's located near this CPU, this might be faster than accessing
data that is located next to the other this might be faster than accessing data that
is located next to the other CPU. So this is something you need to be aware of. If
you're having large servers with multiple CPUs, there are servers with four and
eight CPUs, so you might have multiple hops to access data even though it's DRAM.
So usually consider the DRAM is always fast. It's so fast, but it's slower than your local DRAM,
something to be aware of.
And one of the, not recent,
or one of the trends that continues to be a problem
and that we should address,
and that's why we have this lecture,
is that there's a growing performance gap
between CPUs and the speed with which we can access DRAM.
And the problem is, yeah, it's increasing.
And it led to the situation that, for example,
for memory databases and other systems,
that we often now have vast amounts of DRAM.
So now, I think even at Amazon,
you can simply just rent a server
with at least hundreds of gigabytes,
sometimes terabytes of gigabytes.
So you have vast amounts of DRAM,
you have many CPUs, we have hundreds of CPUs,
but you're actually bound,
the performance is bound by accessing DRAM.
So this might be your main bottleneck, but unless your data is really really huge, so you don't have to
access disk, but if you have hundreds of gigabytes, many data sets entirely fit in
my memory. And in this case, the problem that you often see is that your main
bottleneck for performance is DRAM. Even though DRAM is supposed to be really fast,
this gap is growing.
So we have many cores that concurrently access data
and they wait more and more.
They could do more and more,
but they have to wait for DRAM.
And yeah, this is something that we will discuss today,
how this can be a little bit addressed
by modern architectures,
and also later how this can be addressed by you running code
that is efficient.
Coming back for example to the example we had yesterday in the micro benchmark with
accessing the matrix in different ways.
So memory, not every memory is, not all memory is the same.
We have different kinds of memories, and the one everybody knows, that's dynamic RAM.
So this is DRAM, and here they're just kept in a capacitor.
And what the interesting thing here is that this needs refreshing.
So DRAM, just to keep the state, to keep whatever is stored in DRAM, you need to refresh it. That
means it uses energy and it also has an impact on the performance. So this is usually what you have
in a DIMM. This is your memory in a system. Then we have static RAM and this is another
design. It's much more complex and it's much more expensive. So this is larger and more expensive and
for that reason
something like SRAM is used in a CPU cache. So this is the
small memory that is part of each CPU
while the dynamic RAM is usually used in a DIMM. So this is where you have something that you put on a DIMM
that's connected via a bus. So that's not something really directly that sits on the CPU.
This is something you have outside of the CPU in much larger capacities.
And the reason for that is that DRAM is comparatively slow, at least if you compare it to SRAM. As I've said, we need to refresh it periodically. That means there is certain
work to do just to keep the DRAM in its state and you have to account for that.
So this refreshing takes hundreds of cycles. So accessing DRAM can be slow basically.
But this does not have, so you don't have to
refresh it before you can access the read data.
This is like all the time and if you access,
depending on how you access data,
you might be able to hide that
latency very well. For example, in the end what you do, if you
take, let's say you want to access a single byte, what the CPU does for you, or what the CPU
does, it reads an entire cache line, so it reads more bytes than the actual request.
So this is one thing you can address a little bit and then if you have for example a
sequential workload so you're running sequentially over data one bit after
another or one byte and the CPU recognizes that it might allow you to or
it might automatically for you pull multiple cache lines at once so you can totally hide this latency that you have
when accessing DRAM if you're programming it in the correct way which we'll talk
about later but keep in mind accessing DRAM is actually comparatively
expensive and comparatively slow. Yeah, so, right, talk about that later.
SRAM on the other hand can be really fast, right?
And that's instantaneous,
so we don't have to wait hundreds of cycles,
but it's too expensive.
So just, yeah, it's not possible.
You wouldn't use SRAM to put your hundreds of gigabytes,
hundreds of gigabytes of it on a DIMM.
This is something that is so fast you want to have it, you want to use it,
but it's just in very small spaces, and you put that, for that reason, on the chip.
This is a local chip memory that you have,
and much smaller sizes than your actual DRAM.
And so how we combine these two different memories
is that we have an hierarchy
that you probably have heard about before.
So we want to have very fast memory near the CPU.
So once we load data, we want to access very fast.
And then we have different levels of that
growing by size, getting a little bit slower every time until we are at the DRAM level and
then the next level it will be slower.
So how that looks in a modern system is shown here.
So the size of the pyramids, so basically the size of each level shows how large it is, the capacity,
and on the right side you can see the latency.
For registers, that's basically instantaneous, right? So as
fast as your CPU runs, the data is immediately there. For caches, so as we
have said about S-RAM, there's also immediately there the data, but the caches
do need some translations, they need to check if data is in the next level or
have to have whatever we have to pull data from. So there's some more effort that caches need to do.
So we have a little bit of a latency hit there if you access caches depending on which cache.
As you've seen in the slides before we have in modern Intel CPUs we have three level caches.
IBM I think has four. Apple has I think two right now. So it depends on the CPU
that you actually have, but you always have multiple levels and also different latencies
with that. For memory we have often something like 50 to 100 nanoseconds, so definitely less
than a microsecond. Hard disk, we had that before, it's in the millisecond range.
And then we have tape, much slower. Not shown here, but in between main memory and hard
disk would now be your modern SSD. And here that's roundabout like 1,000 times faster
than hard disk. So you sometimes can get into the single digit
microsecond range,
mostly between like 50 and 100 microseconds
for latencies when accessing modern SSDs,
depending on the SSD.
So this would be running between hard disk.
Okay, and I think we better stop now before we're going to continue, or Professor Raube
is going to continue with virtual memory next week.
Are there any questions so far to general architecture and memory hierarchies in modern
systems?
No?
Okay, if there are no questions, then thank you, and we'll see each other.
You will see Professor Ravel next week.
Okay, thank you.