Hardware-Conscious Data Processing (ST 2023) - tele-TASK - DBMS Recap
Episode Date: April 19, 2023...
Transcript
Discussion (0)
Welcome everybody to our second session. Today we'll talk about database systems.
Kind of give you an overview again of what is a database system, like traditional architecture
and some of the terminology. But before we do so, I will actually switch back because last time I didn't all finish up where we stopped. I think we were here. And I'm just
going to end the introduction and motivation just to give you a preview of what's going to happen
next. So I told you all this, so I hope you remember the motivation. So architecture is
changing. Computer architecture, chip design is changing. And this
is if you want to use current chips efficiently, we have to reprogram our systems. Everything
that's been done for single threaded CPUs needs to be adapted. And all of the systems and database
systems, actually, that's something that I didn't tell you last time. These are systems that run for a long time.
So you can still find like really old COBOL systems in companies deployed, crunching old
customer data and stuff like that, because it's just hard to move to a new system eventually.
So, I mean, of course, eventually people will move and you will use new
hardware once the old mainframes are not working anymore or your data is getting too large or for
new applications. But still, at some point, we need to redesign basically. As soon as we have
a new system or as soon as we want to deploy something that's efficient, we need to adjust to this.
And besides looking at the CPU and the memory, we also have to look at what else is there in the computer.
So we have our server and it's not just the CPU and the memory, but it's also the storage and everything else around it. And there's a couple of different accelerators that are frequently deployed in current servers
as well.
And these, of course, we also want to have and want to use efficiently in our system.
And some of this I will show you now and we'll go deeper in the lecture as well.
So one of the things that was kind of really hot for a couple of years is persistent memory.
Right now it's kind of going down again because the one chip or the one memory device
that was available for purchase or still available for purchase will be discontinued.
So this is the so-called non-volatile memory
or storage class memory.
And it just looks like a DIMM.
So if you've ever opened up a server,
you will see stuff like this just
without the clamps on top, which are the memory DIMMs.
So this is what your memory looks like if you have a larger server.
In a notebook if you can replace it, it won't be smaller, but essentially have a similar structure.
And these ones are special in the way that they have a different technology inside
that keeps the data persistent if you switch off the power.
So that's the cool thing about it.
This means you don't need this.
Conceptually, you don't need this anymore
because every kind of thing can stay in memory
and it won't be lost if the computers turn off.
And the technology is actually also cheaper than RAM.
So you can also put more data. You can pack it more also cheaper than RAM. So you can also put more data.
You can pack it more densely than RAM
by just paying with a bit of latency, basically.
So it's similar in addressing.
So it's also byte addressable.
It's not block-based, but it's a bit slower.
And there's a question, yes.
So there's a factor of two, three, something like that.
So that's still quite good.
So it's not like a factor of 10 or a hundred
that you would see with disk.
What disk we said, right?
So factor of 10 to the power of 5, so 10,000,
something like that, 100,000.
But this is 2, 3, something like this.
Well, I'll have more detailed number in the recording
section.
So and it's, as I said, within the same order of magnitude,
it's 10 times more dense. So, meaning you can put 10 times as much memory into one of these DIMMs
as you would have on your regular memory DIMMs. So, this can go up to 512 gigabytes, one of these.
And then you can have, say, four, eight, something like eight, I think,
you can have in one server. So that's actually a lot of memory then. And you will use it just
like you use RAM, more or less, with some intricacies that we'll talk about later.
It's discontinued right now. But I mean, there is new technology on the horizon, stuff like this.
You can still buy it, but it's not manufactured anymore right now.
The other thing that we also see, and that also changes a lot of things, and similarly to when we thought talked about storage right so talked about disk
traditional wisdom was if we have to access a network then the network is the bottleneck period
right so if you have traditional network i don't know ethernet like past times maybe 100 megabit
or something like that it's already fast that means everything that
has to go through the network will dominate your throughput and your latency so whenever we're
talking about distributed setup well let's optimize for this and again this is not necessarily true
anymore there's uh technology like infinibund and that's been around for a long time. So it's always been around, but it's gotten
like more focused recently and it's gotten faster
also. And the way you use it has changed
in a way that this network technology
actually can get almost as fast throughput wise
as or bandwidthwise as memory.
And that means it doesn't really,
I mean, the latency will still be higher,
but the throughput and the bandwidth that you get
will be in the order of memory.
And that means if you're not latency bound necessarily,
it doesn't really matter if your data is stored
on your device or on your node where your processing is done,
or it's stored on a different node, like in different memory again.
We're not talking about storage necessarily here.
And for this, there's like a special type of networking that you can do that's called remote direct memory access. So that means rather than doing this package-based communication,
where usually the CPU is always involved, not usually,
everything you do with IP, et cetera, will go through your CPU.
So you know the classical IP stack, right?
So lots and lots of layers.
And all of this basically has to be done on the CPU.
So your network card will receive something,
will go through all the layers, unpack your packets,
and then look at the actual data.
Every packet is packed into other packets again.
Then if you still want to send something,
you have to do the same thing
again, go to the other side and there it also has to be done on the CPU. So the CPU actually spends
quite a bit of amount of work just handling the networking. And RDMA is different because RDMA,
you basically just assign a memory region. You say, this is the memory region you can use. And the NIC on the other side, or the NIC, so the network interface card, both network
interface cards will know about this memory region, will just directly access it.
They won't bother the CPU and you can basically directly, I mean, of course the network cards
need to communicate, but the CPU is not involved. That means the CPU can do something else, and you can issue either two-sided or one-sided communication,
just saying, well, please write this to the memory of the other node, please.
And then when the other node is ready, it will just be able to read this from memory.
Or you can say, please read this part from memory.
So I'm just creating some kind of buffer, for example, in my memory.
And a remote node can read from this buffer directly without accessing,
without communicating with the CPU.
Of course, the network cards need to communicate.
But there you get very low latencies also, and you can get very high throughputs.
So no OS, no CPU, et cetera, just directly on the hardware.
But, of course, you need to rewrite your application.
You need to rewrite your system to use this properly.
We'll also talk about this.
And this is, I mean, once you're there, your network doesn't have to be the bottleneck anymore
and all of a sudden well you have to think about how do i do the deal with the rest of my system
if the network's not the bottleneck so where is the bottleneck now how
how do i use my system efficiently or build my system efficiently
okay so then uh of course all of you you know, right, GPUs everywhere, just because machine
learning everywhere.
And if there's a GPU anyway, why not use it for data processing as well?
So because if we're doing lots of data processing, why would we have the GPU idle around and
we're not using it? So we can use it and it's actually fast.
It's not in all cases super efficient, meaning, well, do we get the same for the amount of money that we're paying for the GPU?
Usually we can, but we need to have really large problems or we need to have really large amounts of data to actually be able to saturate this.
So data processing in contrast to machine learning is usually data heavy.
So we need to move a lot of data around.
In machine learning, we do a lot of computation. So that's more fitted to these kind of accelerators because then you
basically just put some data into the GPU and then that just let the GPU run, run, run, and then get
some data, like little data back. In data processing that's different. So we basically
remove a lot of data there, we get a lot of data back. So we're all about interconnects.
But still, there is ways in using a GPU efficiently and getting performance speedups
and even getting cost-efficient performance speedups if we have enough data.
If we're talking about small data, it's going to be hard.
But the GPU is nice because the GPU is highly parallel.
I told you so already. So if you look at a very basic setup of a CPU, right?
So we're talking about like a multi-core CPU, we'll have caches where even a single core could look like this, right? We have a cache, we have
some control units, and then we have these multiple arithmetical logical units, and they can do stuff
in parallel. So even a single core will have this. But of these, we might have 4, 8, 16, you know,
24, something like this on a CPU.
A GPU will have hundreds to thousands of this.
And all of them need more or less, I mean, it depends on the version of the GPU,
but the baseline model is all of them will have to do the same thing in a lockstep setup. So all of them will do the same thing,
unless you can partition it somehow.
But still, you need larger regions of the GPU
with many threads that do the same thing all the time.
So you need to think differently.
You cannot just say, OK, let this thread do this,
and I have this management thread that does this. And I mean, they do like
different amount of data, and then I'll just schedule this somehow and let the scheduling
even everything out. Here, everything needs to be nicely partitioned for this to work out well.
So all of the workload needs to be adapted, and the memory accesses are expensive. But again, there's lots and lots of things going on in a way
that we can actually be cheaper with the memory access.
We'll go into more in depth later in the lecture,
so close towards the end, to see this.
Another really cool technology are field programmable gate areas, also something that we
see increasingly in server software. So say Azure, the Azure cloud or Microsoft cloud, they have
FPGAs in every server. So every server has an FPGA, here mostly for networking. And the FPGA is a piece of hardware.
So it looks something like this.
So this is an Intel board.
So you can see PCI Express board will be connected to a PCI
Express slot on your server.
And then this chip has a special design,
which basically means you can reprogram it.
So the reprogramming actually doesn't take that much time,
but finding good programs or layouts for the software takes a lot of time.
So basically saying, I want to reprogram this for this new application.
This will take hours to days to compute.
But then if you already have some
programs that you want to play on to this, this actually goes faster. This can be a sub-second
delay. And then you have a specially designed, in a way, specially designed chip just for the
application that you want to do. So say, for example, so it's often used for
some kind of network protocols, for example. So I can put a special kind of network protocol,
or I could say I want to have a PCI controller of the next generation on here, or I want to do
special kind of filtering on here, everything like this. I can basically have it in hardware. And then there's different ways of designing this.
And while the chip itself is not clocked as high
as a usual CPU, so we're talking about megahertz here,
like hundreds of megahertz, but this will do not gigahertz
as you would have in an Intel CPU or AMD CPU.
But because the layout that you're putting there is so efficient,
so you're not basically having like you're putting a lot of code there,
but you're designing it in hardware, basically.
You're designing data flows in hardware, what you want to execute.
So this is basically done by logical tables,
so lookup tables saying, okay, this will be an AND gate.
Here I have an OR gate, whatever.
So in this way, and this way,
basically the processing can be super fast,
especially for simple things.
But these have like a lot of lookup tables on there. So the chips are actually quite
large. So you can do a lot of processing on there super fast at basically line rate.
And of course, there's some memory on there. So we'll also talk about this. And this, again,
it's very specific. So it's highly specific on how you use this.
But there are applications where you can use this.
I mean, of course, a lot of prototyping.
Also some financial applications where you would say, oh, this is very nice because you can be much faster than somebody else.
And with this, well, you get high speeds, but it's very complicated to program. We'll talk
a bit about this. Unfortunately, we don't have an exercise with this, but maybe I can show you
something. And if you're curious, we have hardware like this that you can try out. And on top of it,
it's also energy efficient. So it's more energy efficient than regular CPUs.
However, of course, you have to pay for reprogramming it.
And the reprogramming is actually, as I said, it's going to be hours to days to reprogram it,
where you will have a server just running to calculate your new program.
But cool stuff.
And the next level would then be after this would be an integrated circuit.
So a specialized chip just designed for some kind of application. And you will have these,
for example, in rout stack, et cetera.
And then this you can get very fast with actual hardware.
It's just designed for this.
And again, everything that's kind of evil, like blockchain stuff or Bitcoin mining,
something that, of course, you're going to find a lot of of specialized circuits or chips
just for this just because otherwise it doesn't really make sense anymore to be fast enough.
But this we're going to look at ASICs not so much I mean the idea is basically the same
with the difference that you take away all of the reconfigurability and just have it in hardware.
With this, this is basically already I'm going to stop
here with the introduction.
This is what we will see and why it's actually necessary.
I hope I got this across that it's really,
if you want to be efficient on the hardware,
you cannot use your old Postgres.
While Postgres is really a great system, right?
So I don't want to bash Postgres, for example.
It's a really nice database system,
lots and lots of cool features,
but it's not going to be able to utilize something like this
just not only because it's not implemented on there,
just because the whole design doesn't allow being efficient on such hardware.
Okay, questions so far?
Let me find this.
Just have to switch.
Find my pointer and up there. Yeah. If no, then let's go to the basics. We're here, right? So introduction's over. Basics, all about traditional database terminology,
let's say. So we're still first week. Next time we're going to talk about performance management.
You will see all this. So what I want to cover very, very briefly,
the relational model in SQL, because all of the processing
that we're talking about is really this.
Then the classical database architecture.
And this is, of course, super fast.
This is basically, this is database systems one.
This is database systems one. This is database systems two. I mean, this is just the very core two slides.
So quick rundown just to remind you of the terminology.
So if I'm going to say join, you will say I joined. I know this.
So that's the idea. If you have questions, feel free to ask, of course.
If you've never heard this, it's also not a big problem
because we will, like, whatever we're going to change
or whatever we want to implement, we will discuss it in the lecture.
But just so you know where to find this kind of stuff,
this is basically why I'm repeating it.
And if you want to learn more about it, there's this nice book, which is like a huge book where you can
read all about it. And the lectures of Database Systems 1 and Database Systems 2 are completely
based on this book. Okay. So relational database management system. I don't need to motivate this because you got this in database systems one and two.
And they are in every company that you can think of.
So any kind of industry that works with data
will have a database management system.
And they will, while they will have multiple,
they will at least, like most of them,
will be relational database management systems, period. So, and I mean, it just was a huge success
in a way of storing data in a structured fashion and in a sort of standardized fashion. So in a
way that everybody understands how to use the data.
And this is because a lot of the way that we think about data and that we organize data,
not necessarily naturally, but if you think about it, how do you, if you write something
up, you want to organize some of your financials or whatever you usually will do this in tables so at least like
in two columns or something like that but often than even multiple and it's just a very natural
way of thinking about it in like a two-dimensional way not necessarily nested because then it becomes
unhandy you cannot really organize stuff 3d easily so you want to have this two dimensional table layout.
And this is what the relational model is all about.
And that makes organization easy.
It makes understanding the data easy.
And while, I mean, we're often teaching such a nice model
and it's easy to understand, usually in most systems
you're not going to deal with this
on the low level. So what is actually stored in these tables and how they're organized often is
completely hidden on the layers and layers of systems. But just because the way you enter this is easy.
This has kind of worked out for many people, for many companies.
It's the de facto standard for data organization as soon as we're talking about structured data.
So all of the data that we're talking about will be represented in relations or tables.
So this is what this looks like. We'll have usually have a name, the relation name. So we
have an employee table here that has different kind of attributes. So these are the attributes.
Actually they have a name. So person ID or PID, given name, last name, age, and address, would be the attribute names. And then we have
what we have attributes or the individual tuples or rows. So, say for example, Peter Müller would
be one row here and all of the last names would be one column. And or the Peter Lula topo is a row is also a topo.
And then we can talk about, I mean, if we want to be more statistics or have more features,
characteristics of this table, then we can say its cardinality is four. We have four topos in here. This is often useful if we talk about the size of a table
and how many topos can be processed
and we talk about the cardinality.
And especially often we're interested
in intermediate results.
And the size of intermediate results
usually is called the cardinality.
Or we're talking about cardinality.
And then we have to rank that something that's rarely used,
which would be the number of columns or attributes.
So it's the rank of a relation.
And of course the different columns
can have different data types.
And that's also important, right?
Different data types are very different to handle. So if we have strings or
integers or whatever IDs, that's a very different way, or often at least it's the way how we process
this is very different. And many prototypes, many also research prototypes and databases, they will just use integers, which is much easier than handling strings, for example.
And in real world, a lot of industry applications, they actually have to deal with lots and lots of strings and lots and lots of strings, conversions, transformations, stuff like that. So that will take a lot of computation and the actual integer handling then later on,
or the joining stuff like that, that might actually be not the most critical part. So that's just good to know.
I might show a slide here or there eventually regarding this. So that's our data. This is the organization of our data. We'll also organize this in different ways on this. We'll touch on this soon. And then we need to work with the data. And for this, we typically use SQL. And we're not going to parse SQL in this lecture anywhere. But we're thinking in SQL in many ways.
So we're thinking of this kind of, these kind of operations are the operations that we do
with the data.
And this is kind of, this is the structured query language, or you can call it SQL or
SQL.
You will find both.
Most stuff that you will see will be something like this.
I can select from where, maybe group
by having an order by queries.
All of you know this?
Yes, perfect.
OK, so this is kind of what we want to see.
These describe or such a query describes in a we say declarative way what data we want to see
so we're basically saying i want from this table these and these types of tuples worked on
or joined with these and these levels and do please, do it for me in the way that you see fit.
So that's the idea.
So we're not saying,
please first join this table and this table
and use this kind of join
and use these kinds of aggregation functions
or this kind of index in your query.
But we're just saying, give me this data.
And then the database system basically does all the magic.
And the syntax, at least if you look at this kind of syntax,
is quite restricted.
I mean, SQL 99, from SQL 99 on,
you have Turing completeness,
so you can program anything in SQL.
So you're not going to get this kind of nice declarativity
anymore.
But in these simple queries, this restriction
basically gives you a lot of flexibility for optimization.
So because we have a constraint set up,
we know how we want to do the data accesses. We know a lot about the data. So we know we're
just going to work on these kind of data types. So that will make stuff much easier.
And then there's other languages, et cetera, not really important. So this is what we look at most of the time.
Sometimes we might also be interested in the insert and updates or deletes inserts just because we also need to work with the data or we need to update the data.
So just doing, of course, somewhere the data needs to come in.
That means we also need to do something there.
But most of the processing that we look at is really just how do we read data.
That's the assumption in most relational databases.
We basically write, but then we want to analyze a lot.
We want to read multiple times.
It doesn't always hold.
If it doesn't hold, again, we need a different kind of,
or it makes sense to think about different kind
of architectures.
Regarding architecture, this is the typical classical database
architecture.
So the database in classical sense
is structured in five layers um so we have uh and we have
sort of standardized interfaces in between so on the bottom of course all the way on the bottom
is the hardware this is what we're looking at right then we have an operating system
some systems or very traditional databases systems would want to have either have
their own operating system or get rid of the data operating system because operating system
is not optimized for databases. The operating system is optimized for like let's say general
use and that means lots and lots of small files, typically.
So this means if you're going to a classical file system,
an operating system, even the operating system itself
will create many files, like lots and lots of very small files
and work with those.
Like all of the, if you go to your Linux or Mac
and look at the base folders, like the systems folders, you will find
lots and lots of files with basically nothing in there. It's just management stuff, but they will
be touched a lot and worked a lot. And database doesn't really like this. So database has huge
files, still probably many, but not that many, and will work with lots of data, large amounts of data.
And because of this, we typically also have
like an additional layer on top,
which is the buffer management.
This is also dealing with any kind of remote
or peripheral storage accesses.
So we have our data.
We might have our data in memory.
And I told you, like, we might have our data in memory, and I told you we might have large
memory so everything could fit there, but traditionally we think everything eventually
or most of the data is on disk, so we need to deal with the disk access and for this
we have the buffer manager.
And then the data for the disk is maintained in larger chunks of data.
So we're not dealing with very small files, but larger chunks.
Then this data is in the memory. So for this we have some kind of memory structures where then we can actually deal with individual tuples.
We're not talking about these very large data blocks anymore, or large data blocks, but
we're talking about individual tuples already.
On top of that, we have this interface where we can then access individual tuples rather
than larger blocks.
And we have some kind of logical access.
So here we can say, I want to access individual topos through indexes.
I want to do some sorting. I might have transactions. We're not dealing with
transactions that much in this class, but things like that. So then we'll have the basic record
oriented access and on top we have the data model. So here
is where we're actually talking about relations and we're talking about SQL etc. which would then
be all the all on top of there. So and the important part is below here the system can do what it wants, right? So below here, basically, we don't need to have relational data management.
The data could be organized in any way.
And this is basically up to the system designer.
So eventually, basically, you can say, well, I want my data to be organized in a graph
or I want to have whatever.
So this might make sense for some applications. And I still can have this relational access on top.
So the database course would go from the bottom typically to the top.
Some stuff, I mean, this is a hypothetical architecture.
So this fits some systems more than others. If you want to be really
efficient, you will basically not do all these individual abstractions. You will try to combine
everything because then all of the abstractions will basically give you some kind of additional
overhead. Like you have some additional communication.
So you might structure this logically in a way,
like your architecture,
but you don't really want to have communication
in the system through interfaces
if you want to be efficient.
Otherwise it's actually communication, right?
So otherwise something needs to be written,
which might not necessarily need to be written.
And some stuff you actually cannot logically split up in these ways.
So some like the locking, recovery, some optimizations need to go through multiple levels.
Anyway, so going up from the bottom, very briefly, very quickly, we'll look at the individual levels.
So all the way on the bottom, we have the hardware.
And classically, this would be the hardware.
And as I said, this is basically what systems still are optimized for.
And I don't know if you've ever seen any of these.
I think you've probably seen pictures like this.
But if you still have seen this in reality, when I started,
this was still the case where all of the systems
basically had something like this, like solid platters.
So this is basically really, really metal platters
on a spindle that will all rotate the same and then you have
like a static arm that will go in and out and find the right tracks. So it's not like a record
in a way but it's not like a single track but it's multiple tracks going in and out. And the fun part
is the platters always rotate at the same speed. So they have a certain speed and there's also a physical limit to that.
So this is usually maximum 15,000 rounds per minute.
I think typically cheaper drives would be 7,000 rounds per minute, but 15,000 is the
max you can typically get.
And this is physical reasons, right?
So if you make it much faster, the drive will explode.
So basically the small pieces will fly out through your server and will
somewhere stick, hopefully not in your belly.
But that's why it's, and that's also what you basically have with systems sometimes,
like if you have a bad crash or something that you drop it, the platters might actually break.
So this is all, let's say ancient hardware in a way, like physical rotating stuff, basically.
And the arms go in and out, and like a typical problem would be a so-called head crash.
So if you have
some vibration, something in the head touches this platter and will scratch basically the
magnetic material off the platter. And you can also see that. So I had a head crash at
some point in my disk and then of course you're opening it up and you see like there's a scratch,
like a nice round circle when the
head basically touched the platter. And the way you access this, of course, is like this, right?
So this arm, it's one arm, it's not multiple independent arms. So this means you have one
track, you have one sector in a track, but you won't just read one sector.
Basically, you're going to read the whole or the sector
on all of the platters, potentially both or usually
both sides.
And that's what we call a cylinder then.
So this is basically one thing that
where the whole thing is a cylinder, basically.
And then we're going to read all of them in parallel. So this is like a tit or a tiny bit of parallelism in there. And we completely,
I mean here there's a huge difference between random and sequential IO. And that also makes
sense, right? So if we're having sequential IO, this basically means this is rotating at 7,000 rounds per minute.
And at this speed, we're basically reading these sectors one by one.
If we have random IO, this means we have to move this head somewhere and then find this track and then start reading.
And we might just read like a single block in random IO.
And this means the smaller the data that we're reading,
the worse the performance gets.
And so as soon as we have sequential IO,
so we're reading a couple of megabytes at a time,
multiple megabytes, let's say,
then we're going to get like the maximum performance. This will be something like 100 megabytes per second
that we can get.
If we're reading individual, this
will drop below megabytes if we're unlucky.
Also because we cannot read just a single byte here,
but we will have to read one complete piece at a time.
And this is also true for still all block-based devices. So SSD is the same case.
We're not going to read just one sector or just one byte. We're going to read one sector,
whatever. Typically, we read one block. And the block might be something like four kilobytes,
maybe one kilobyte, something like that. And that's our axis granularity. So if you're accessing one byte or one bit on this,
this doesn't care.
It will read a couple, like one kilobyte or more.
And that's true for an SSD as well.
So we have to differentiate.
And we will address this by going to the cylinder.
So which one's the right cylinder? Then we're going this by going through the cylinder.
So which one's the right cylinder?
Then we're going to find the right platter.
We're going to find the right track.
And then the exact sector action.
OK, so with this, this is ancient,
but it's still in use.
This is still the cheapest online storage
that you will find.
And this is why Google, for example, still uses this a lot.
And by online, I mean you can randomly access this.
So of course, there's tape, which would still be cheaper.
But tape, it's not like random access.
Tape will be very slow if you do random access because basically you have to go through the full tape.
I don't remember if I have a slide on tape, but I might actually.
Okay. So then on top of this, we have,
well, we need some access methods, right?
So, and this is basically, well,
in a way file-based typically.
I mean, the way we actually organize this is up to us,
but typically we say we have different kinds of files
and then we could, we have different kinds of files and then we could we have different kind of organization of these
files and one typical way would be a so-called sequential file so we're just storing one tuple
after the other in a sequential manner and while this is a nice table now of course you have to
think about it as one long area right so? So this is how we write stuff anywhere.
It's not like our disk, our SSD,
our memory is not two-dimensional,
at least from an organizational point of view,
but it's one-dimensional.
So we're writing everything in a long string.
And then beaming, if we want to find something,
we're going to start from the start
and just continue to read.
If everything's ordered,
I mean, there I have it.
So how much does it cost us to insert something?
Well, I'm just going to jump to the end of the list.
I'm going to memorize where did I stop last.
So then jump to the end.
I'm adding something.
Cost me just one access.
If I want to find something, I really
need to go through the whole list.
I need to go through the whole file, read everything,
basically, to find it.
The first one will be easy.
That's simple.
The next one or last one, for example, will be easy. Or the next one, if I've
found one item, I want to find the next one will also be easy. But finding something
in the list is expensive. Unless it's ordered. What can we do if it's ordered? Who has an idea? Binary search. Very good. So what we're going to do is,
rather than just scanning from the beginning, we're going to go to the middle and see if it's
larger or less. If it's larger, we're going to go to the middle of the rest, see if it's larger or
less, and so on. Then we have binary search, so we're saving a lot,
because we just have logarithmic cost.
Lead same, replace same, so you get the idea.
And because we cannot order everything,
sometimes we just have to store unordered,
and we cannot order everything by multiple attributes.
So remember, we might have a tuple with many attributes.
And sometimes we want to find something by this attribute,
sometimes by this attribute.
So we cannot order it completely by both attributes.
Doesn't work.
So we have to have one primary order and then a secondary.
Of course, we can do a combination,
but still this won't help us if we have an access by one of the attributes so we might get some speed up but we can have an
auxiliary structure which is an index that will give us like basically point us to the area or to
the location in our file where we can actually find our data.
So, and this is basically then an index file,
typically in a classical system,
which could be a tree like structure, for example.
This would look something like this.
So this is like a binary tree, we have the root,
and then we can get the same search efficiency like our binary search
on any kind of key that we have just by using this binary tree. And you've heard of this already. And
the cost, I'll leave it as homework, but you actually know this anyway because you know
binary trees. However, this is not efficient. Especially,
it's not efficient if we're talking about disk. And the problem is these individual
nodes here, they're super small, right? And they're, like, they will be many because we just have binary, so we have many levels.
So logarithm to the power of two of the number of tuples, many levels.
And we will have to find, go through this and potentially they're in random locations
on disk.
So this means for each of these levels,
we will have a random access to disk.
This is of course, well, super slow.
This might even be much slower
than just scanning through the files sequentially
because we said like the difference in sequential
and random access will be a factor of 100
or even more in many cases, especially
for talking about small data.
So then, so there we might have a factor of 1000 or more.
So that means if we're not talking about huge, huge files, this will be more expensive.
So what can we do? And the one thing we can do, we can use a tree.
I don't know what's going on.
It's a bit sensitive. I'm not going to touch it. We can use a tree that's more wide. So rather than
having this binary tree, we're going to use the so-called B-tree for fire trees.
So Rudolf Feier was a computer scientist
that used to work in Munich and invented this nice tree.
And I took the original picture.
This is why it looks so nice from the paper.
And the idea is rather than having this binary tree,
we can, we're going to have an anery tree, meaning we have many nodes, many nodes or children nodes
per one parent node. And that basically gives us a much broader tree and a much flatter tree. And each of the nodes will be much larger and larger,
ideally, of course, somehow related
to the size of the chunk size that we will have on this.
So typically, the block size on this.
And that means just having this organization of white tree nodes with then many pointers
to the next children means we will have less levels.
And we're just going to read basically from this what we need.
So we're going to read a complete node from this at a time, which is all the data that
we need on each level, basically.
And typically, we're talking about, as I said, disk blocks. So this would be in the,
say, four kilobytes, for example. So each of these nodes would be four kilobytes or even more.
This means, and these are just pointers, right? So this means we will have like a width or an N per level of, say, 1,000 or so.
So one node will have 1,000 children.
And that means, of course, you will get exponential increase per level.
So 1,000 by 1,000 on the next level.
And that means these trees are really flat so this in like
practical settings like two three level now three four levels let's say and that means to find any
kind of key all you need is to go down four levels and the first two levels you're probably
going to have in memory so you need just two disk access, something like that.
If you have four levels,
you're going to go through the third level
and then the fourth level,
and then you already know where your data is.
And ideally you already organized,
I mean, in the classical B-tree,
you have the data also in your nodes,
but then the tree of course gets deeper again.
So the modern tree is the called B star tree.
The data is only in the leaves,
but it's already there in the leaves.
So you can actually answer a lot of the queries
just by reading the levels rather than actually
accessing the actual data.
So this would be looking something like this.
So here I have an example.
Again, I'm going to skip this, like the details.
If you're curious about this, you can look it up.
It's not super important for this lecture, but it's a great data structure.
And it gives you kind of an idea of how to design something for hardware, right?
So this is really, really designed for disk.
So this is, or for block-based storage,
let me put it like this,
doesn't even have to be disk.
Any kind of block-based storage,
this is a cool idea
because it will give you this very fast access,
very low, few levels,
you're not wasting stuff.
And that's, but it's not cool if we're talking about memory, right?
So that's the problem.
So in memory, this doesn't make sense anymore.
But this is the go-to structure for disk-based access.
And this is why almost every database will use this kind of data structure today.
And then if you're talking about new systems, we'll have to think about something else,
right?
So we'll have to replace this in many cases with another data structure, how we organize
our data.
And one of these would be the hash table.
So that's something that's already existed before.
In disk, so if we're talking about disk accesses, this means we'll have buckets that, again,
lie on disk.
So, we'll have our hash table where we hash our keys with some kind of hashing function,
and then we'll hash them into buckets that actually will be placed on disk.
Again, they should have the size of blocks on disk,
because otherwise it's going to be inefficient. We don't want the hash table to be all too large,
because otherwise we don't have enough space in memory to store these intermediate results,
and we'll have to individually place this. And this means eventually we will have overflows, we'll have collisions.
Also, this is again, this is in a disk-based access.
This makes sense to have collisions, right?
So we want to have multiple items, multiple tuples in a single block, because otherwise we'll just have like a single tuple for a hash table entry.
And that's going to be inefficient,
because every block on disk will just contain one tuple.
So we want to fill them up.
Of course, it's not going to be perfect.
So some of them will be empty.
If they're empty, we're not going to produce them.
We'll have only a few tuples.
Some of them will overflow, and then we'll change that tippy.
And there's different strategies that we can use.
This is not the lecture for hashing,
hash table details will touch some of it,
but the standard technique would basically having a chain
of overflow pockets basically.
And this is of course only works for one attribute
and the access also only works for one attribute.
And this will be completely unordered.
So if I want to see some like search for something else
this is not gonna be great.
I'll definitely have to scan through everything.
And if I wanna scan something's also not great. The cool thing about the B-tree
is here on the lower level, everything is sorted. So here I can also have good range access and
stuff like that. This is something that the hash table doesn't give me. But it's super fast, so
it's actually nice. And then, well, these data structures, as I said,
this part will be on disk.
This part, like these individual parts, will be on disk.
So now we have to think about how do we put them into memory.
And this will be done by a buffer manager.
So the buffer manager basically says, well, conceptually, everything is on disk.
But the system on top, our database system or our access system, so the record access, they say, well, they think everything's in memory.
So they basically on top, we don't want to deal with disk accesses.
We don't want to make the differentiation.
What is on disk?
What is in memory?
Whatever.
So this is basically where the buffer manager comes into place and says, everything's in memory.
Don't worry.
And if something is requested, some block is requested, which is actually not in memory, well, then we'll have to take them from disk.
So we'll say, well, it takes a little time.
It's in memory.
I'll find it somewhere. And if it's in memory, it will just be directly accessed.
If it's not in memory, it will be taken from disk and put into memory. And then basically something will be replaced. And there's lots of replacement strategies.
We'll not cover all of them here today.
I mean, the most simple one would be first in, first out.
So I read this first.
Once everything is full, this will be kicked out again.
Something much smarter would be LRU, so least recently used. So basically, I'm always ordering the accesses based on what I've recently,
most recently accessed, and the one that wasn't most recently accessed
will be kicked out.
This is conceptually great, but has one big problem.
It doesn't save you from scans because the scan basically will flush your complete
buffer. Because the scan will go through a huge table and then go like override the whole memory
basically and this is why we want to have something like least recently used two, which means we're
not looking at the last access but the second to last access. Because the scan will access everything only once.
So we're just going to wait basically what's been accessed twice.
And basically the previous access was the most recent.
Not the last, but the previous to last access.
And that will basically save you from the scan.
So the scan will just go through the data
and not replace everything that we're frequently using,
but replace everything that we just used once.
And later on, if we're doing something else,
basically if the data that's been scanned has been reused,
it will stay in memory.
But we're not going to keep stuff in memory
that's never been accessed or not accessed for a very long time anymore.
And replace it with stuff that has been accessed multiple times.
Okay.
So then we said this is going to be blocks.
So not individual records.
Now we have to find these individual records.
So the individual records, we need some kind of addressing.
And this is typically like a record ID.
This can be absolute in terms of we're saying on this page
and on this page in this place.
Or we're going to say on this page,
and please find it on this page. So we have some kind
of additional ID on there. Both is possible. There's many different ways of how to implement
this. We also can have like a completely arbitrary addressing and then some kind of mapping. But
typically we would have something like this. So we want to find, we need to find the right page.
And then either we store the offset on the page or we store the offset
already in the record ID. Then I know exactly where to go.
If I store it like this, this gives me less flexibility.
If I have to replace something,
I need to set up tombstone like a redirection here, or I have to update every place where I've used this. And this might have lots and lots of indexes that use this page ID in addressing.
So addressing is also tricky. Nothing is easy, essentially. Okay, so very briefly, and then we're going to do
a two, three minutes break. Very briefly,
what we basically, from a
data model view, I already told you,
on the top, we have this
relational model.
And we also divide this into three levels.
So we have the conceptual schema.
This is what, like if you have a modeling class where you do,
I don't know, structure some kind of company
or the university with students and lectures and whatnot. So this would be your conceptual model.
This is kind of the world model that's inside the database. This is what we store in the database
and this is how we structure it. Then we say different kind of users will see different parts
of the data. So that's kind of the external schema. And then internally, we do like we as system designers,
we do what we want.
So we can basically say, well, you do your relational model
or whatnot and something, something.
We organize this in our disk like this and that.
And that's then the internal schema.
And there we also have like indexes.
We have like different ways of storing. And here we
might have lots and lots of different kinds of data types, or we have SQL data types. And
internally, we might just join, like merge some of the data types, say, okay, we'll all store all
this as strings or something. Or we might be more diverse and say, well, you say it's an integer,
but actually it's a tiny integer. So we're just going to store this in like a reduced format,
and we're going to be more flexible and more efficient here. Okay. So run through the
architecture. With this, I need a short break to catch some breath.
So we're going to do a five-minute
break and then talk briefly
about the query processing
after this.
Or do we have questions right away?
Here.
Everything known already
like this?
Okay.
Well, then,
still as a reminder,
maybe.
Okay.
So let's continue
as I figured out
the problem with my pointer.
So query processing so the last couple of minutes we're going to talk about query processing this is also what we are interested in
the in the course so you've seen the architecture now let's have a query and uh and do something
with it right so how what does or how does this work in a system? So we have a declarative
query. This would be give me the name, the address, the checking account, the balance of
customers and accounts. So customer table and account table. We're going to join these two tables based on the account number.
And we want only have the customer with the name one.
This is declarative because we don't tell the system how to do this.
We just tell the system we want this kind of data.
Well, we also tell the system we want to select specific subsets of the tables and we want to have the two tables join.
But this is already implicitly set here.
So we're joining this or at least we have the Cartesian product and we're selecting a subset of the Cartesian product to these two restrictions.
And then we're not interested in all attributes,
but only in the ones out there.
So now we have to translate this into something
that will be executed in a way.
And there's many different ways on how to execute.
This will also talk about this in the lecture.
So we need some kind of query execution plan.
And this can be as low as this, right?
So we just have like a loop that says for each C in customer,
so for each individual tuple in customer, if the name is Bond,
then another loop for each A in account to if the account equals the account in A
equals the account in C,
then we output this tuple, right?
So this is just like two loops basically
where we do the join and we do the selection.
And this is called a query execution plan.
This would already be quite low level.
It's a procedural specification
and it needs to be semantically equivalent to the query.
And because this is high level, this is declarative,
we can generate many of these query plans.
So there's many different ways of executing this.
And well, our system needs to find a way
how to do this efficiently.
And a lot of database research
is just about this transition, right?
So from here to here in an efficient way.
And then this part, what we're doing is also this one make this efficient
on the hardware and different versions of this. But in this class it's really more about
if we have something like this, how do we execute it efficiently? How can we restructure it? What
are the ways? How do we write something like
this? We're not so much of how do we get from this to such a plan, but we'll briefly touch on how we
get there. And what we do is first we have to do the query translation optimization. In the class,
we'll mostly think about already optimized plans so we know what exactly
will be what should be executed something like this at least from a high level view all of the
optimizations are done and then so we're not about so much about index selection etc so we know we
want to use this index we know we want to use this index. We know we want to execute this order.
Now, how do we do this efficiently?
So, but to get there, what we need to do is first,
we need to parse the query.
So we get this kind of SQL text here.
I mean, this is just text, right?
What this system gets.
Now it needs to do something,
create something that it can work with.
And first, of course, it needs to check semantics, et cetera.
Do the tables exist, et cetera.
Is the syntax correct in the first place?
Then it will be rewritten generically.
So if we're using views, for example, that don't exist,
we're expanding them.
If we have common sub-expressions,
they will be eliminated and replaced.
We try to find an optimal, not necessarily optimal,
but good enough query plan.
An optimal plan is really hard to find.
So we try to find something that works well in the average case.
And there's two ways that we optimize. First, we apply a couple of
rules. And then we just apply one after the other. I'll show you some. And then we use a cost-based
optimizer. So based on an estimation, how much this query will cost or different execution plans
will cost, we'll choose one where we think this will be the cheapest.
And I mean, traditionally, we say we generate many candidate plans, and then we find the one,
we cost them all, and we find the one that's cheapest. But often we have more like a,
let's say, more greedy approach. So based on certain generation rules,
we're going to find something that we hope will be cheap in the end.
We're not so much generating a huge search space
and then move through this somehow
and find it in there.
And in order to do this cost-based optimization, we need to know something about
the data. Otherwise, we cannot really do anything about the cost. So say, for example,
the cardinality of the table. So we need to know how much data do we actually have to process.
On a disk-based system, do my intermediates fit into memory? If yes, then I don't have disk access
after the first scan anymore. If no, I will have disk access in the intermediate
generation, which means additional work, for example. The same kind of things are true
in general query processing, also in memory. And then we execute the query.
And of course, we can also, if we notice,
well, our plan wasn't optimal here,
or some of the data patterns change,
then we might do some dynamic refinement there.
So think about sorting, for example.
I mean, if you use like a basic sorting algorithm,
we'll touch on this next time,
then the sorting algorithm,
so the standard sort in a C library, for example,
that's going to change the sorting methods
based on the data that it sees.
So it's called an introsort.
So it's an introspective sort.
It will check what is the data, what does the data look like.
If there's only a few different values that we're sorting,
it's going to just basically do a radix sort
or basically just directly insert the data
into in the right direction.
Otherwise, it will switch to a quicksort, for example,
or other functions.
So I don't have the, I don't remember the exact details on which order, but it basically
changes the sort function based on the data it sees.
And the same we can of course do while executing query plans.
So going from a logical plan to a physical plan would look something like this.
So our logical plan already rewritten into relational algebra.
So we have two tables.
We're joining the two tables.
We're selecting based on attribute X and table R, where everything is like all Xs are smaller than five.
And we're projecting to our x and s set.
So our tables here would have attributes a,
and so r at least has to have a and x,
s at least has to have a and z.
And then we can translate this.
And a direct translation of this would be, we're, let me,
yeah, so direct translation would be we're just doing a Cartesian product. So we're just basically
combining every r and every s. Then we're selecting the ones that are equal in the a tuple, or in the a attribute.
Then we're selecting the ones where x is smaller than 5.
And then we're projecting to x and z.
So that's the direct translation.
That's exactly what the query would say.
But of course, we want to do something more.
So here, we already have specified a join.
So we can actually use something like a join. A very simple join is a nested loops join.
Here we're already using something more specific. So nested loops is what we saw earlier.
Two loops nested in each other, one going through one relation, the other going to the other relation. Already here, the order makes a difference,
depending if we can keep one of the relations in memory or not,
or if we have to basically read data from disk
in every iteration, basically.
If we have an index nested loop, so if we have an index
on one of the columns, we can use an index nested loop,
meaning we're just scanning one of the tables and then using the index on the other side.
Again, it really depends, yeah, using the index here, if this makes sense, really depends on how many access we will have here. If we basically read the whole relation here,
or this relation would already fit into memory,
but rather than, because we have an index,
we're going to use here the index and scan the other table,
then this will be much less efficient than just having
reading the whole, scanning the whole table once,
and just basically scanning the other table
once more or less. So if we can fit one of the tables into memory then the index nested loop
means we're just reading on the other side where we're just scanning once on the other side more
or less. But this way if we have large tables tables, for example, we can directly, we don't have to find something.
This is not ordered in any way or something.
We can directly find through the index the exact tuples
that we want.
And here, the scan already gives us the right tuples
on the left side because we can can in the scan typically already filter.
And then we just return.
So here we would actually also need a projection on the top.
But projection often is
implemented in all of the operators anyway.
So every operator can basically say,
I'm just going to hand out the subset of the tuples.
Depends again, depends on the implementation.
So that would be one implementation.
Another implementation could look like this.
Rather than using a nested loop, we
can do sorting on both sides.
So we do a sorted scan.
Maybe this is already sorted on the join key.
Perfect.
We don't need to do sorting. Then we just already can do the join key, it's perfect. We don't need to do sorting.
Then we just go into the filtering.
And then we only need to merge rather than doing this looping
through the data.
We know what we see is what we need right now.
So I mean, it's sorted on both sides by the join key.
If we want to join, we just need to see, are we off somehow, right?
So we basically start from the smallest to the largest items.
So if on the one side I have a 5 right now, ID 5 here,
if I have an ID 4, I will have to move on the side
where I have the smaller item.
Then once I'm larger than 5 here, I will move on the
other side. So then I can just basically scan through both sides once in this merge phase.
And as I said, we sorted already. Perfect. This will be actually typically the fastest way of
journaling. If we have to sort, different case, then other kinds of joins might be better.
Our nested loops, so forget about the index, our nested loops join is the fallback whenever we're doing like random types of joins.
So we're not doing equi joins, we're not comparing by quality, but doing something larger than or whatever string matching join, whatever, then nested loops
is kind of the fallback that can handle everything.
Because here, we're going to compare every tuple
with every tuple anyway.
OK, so if you look at these relational algebra objects
or operators, then there's more or less a direct, not a direct
mapping, there's a mapping. So each relation will basically be a scan, meaning we have to scan the
table. We can optimize this away in certain cases, say if we have an index or something.
A selection would be a filter or say, for example, an indexed axis.
Projection, if we're allowed duplicates, we don't have to do anything.
We just have to maybe cut off the tuples.
A cross product would be a nested loops join, as I told you.
As a regular join or an equi join, we have many different ways of implementing this. Could be
hash, sort, merge, index, nested loops, join, nested loops, join, etc. Grouping would be a
hash aggregation or aggregation basically, hash aggregation, sorted aggregation. The duplicate elimination would be a special kind
of aggregation because we're just
aggregating all of the duplicates
into individual values.
Set not union difference would be
the special case of the join and the union is just like adding
the two and not the difference.
What's it called?
Intersection.
Intersection.
Thank you.
Intersection would be a special case of the join.
The set difference is basically an anti-join.
So we're just figuring out which tuples do like we don't keep the tuples that have joint partners.
Okay.
So then once we have to basically directly translate this,
we do the optimization, the cost based optimization,
and we're using statistics.
So the table sizes. So if something fits in memory, we're gonna statistics. So the table sizes.
So if something fits in memory,
we're going to keep it in memory.
If it doesn't fit in memory,
we're going to deal with it differently.
For example, the cardinalities.
So how many different value type
or values do we have?
So basically the domain of an attribute.
I mean, the domain can be large but how
many different values do we actually use so say for example we have an integer that's age right
so the age will be from zero to say 120 to something like that and nothing else and that
means we're not going to have very many different values in this column.
And based on this, if we say, for example, we do hashing, we do a hash table where we can have, I don't know, 100 million different values.
We're never going to be able to use this properly because we will only have 122 different values. And most of them will be in the, if we look at demographics, most of them will be in the range of 40 to 60 or something like that. So that means, and this affects our
choice of operators. This affects the way how we use the data, basically. And we can know what's
the highest, the lowest key. We can use histograms for this.
If we have very frequent values, just like having this one value in memory, like, or all of the instances of this one value might already explode and like be larger than our memory. If we don't
deal with this, we might not be able to answer a query. And of course, the statistics are usually flawed, especially
for joins.
So if you have many joins on different kind of attributes,
the assumptions typically are wrong to some case.
And this is actually where machine learning helps a lot.
Because basically, you can learn these kind
of distributions better than just use basic thresholds.
Like classical databases, many of these things,
they will just use thresholds.
So saying, if the data is like this, then we do this.
Next level, then we do this, et cetera.
Often you can do some sampling or you
do sampling while you're reading the data,
while you're executing the queries.
Rule-based, same or different way of optimization.
Here we have certain kinds of rules that we're applying.
So say, for example, selectivities,
we're going to do them as soon as we can
because they will reduce the data size.
Projections, same thing.
Any kind of Cartesian product that we can turn into a join will do this,
because it's going to be cheaper to execute.
As soon as we can reduce the intermediate results, at least we hope, we will be faster.
There are corner cases, if we're talking about very small data
sets, where this actually doesn't apply.
So sometimes having, like in certain cases,
it makes sense to just even use brute force algorithms
rather than doing some optimization,
if we're talking about very small data,
because the optimizations will cost us more,
or instantiating certain data structures costs us more, for example.
But on the average case, these kind of rules will apply and then we're just going to use this.
So we're pushing projections, we're pushing selections, and this will help for some things that won't help for all of the different kind of executions.
And so we want to have little small intermediate results, few materializations and access secondary storage as little as possible.
Or, I mean, also, as I said, classically, don't use networking if you're in a distributed setup.
Again, doesn't necessarily apply if our network is
faster than we thought.
And we kind of have to keep this trade off.
So how, what's the, like in the past,
we would basically say the disk is so much more expensive,
I don't have to think about memory accesses.
All of a sudden, if my persistent memory or my SSD is getting closer, say, and closer meaning a factor of 100 maybe,
then if I do 100 memory accesses or 200 memory accesses instead of one SSD access, all of a sudden I'm actually slower.
And for this, this never happened. Or basically in the past, it would rarely happen. You would
be have to be super inefficient. But something like accessing 200 times like memory 200 times,
you can quickly generate something like that. It's just to keep it in mind.
OK, then we have, of course, different kind of operators.
I've mentioned these already.
And since we're running out of time,
well, there's different ways, sort merge join, hash join,
or the fallback solution of the nested loop join.
Rather than going through the Grace hash join,
so this is something you can look at.
So how does the Grace hash join or hybrid hash join work?
And the other parts I want to stop here and ask you
if there are questions.
So the rest is basically it's not super relevant for this lecture.
We talked about the database model, so relational model in SQL.
So this is something you will see SQL every now and then.
I hope you can read SQL.
It won't even be as complicated as the whole having, whooping, whatever that we had in the beginning.
More like small parts. but you need to understand
this. So if you need a refresher there, take a refresher. Then the classical database architecture,
it's good to know. Also, it's good to understand how a system does this, because only if you know
how systems access or you
work with the data, how systems work internally, you can actually use them efficiently.
So even if you're not up to programming the next fast database system, but you only want to do
your, I don't know, Python notebook, Pandas executions, whatnot, there's also joins in there.
And if you know how this kind of stuff works or the different kind of design choices you can make,
then you can program this faster and more efficiently than if you don't know it.
Because if you're just like on a, not in a bad way ignorant about this lower internal
execution, you can come up with very bad execution plans. And if you're doing like Python, et cetera,
you will have to write this yourself, your execution plan basically. And that might be
arbitrary bad. So it's actually good to understand how this works internally.
And yeah, this is kind of what we talked about today.
Next time, we will talk about performance analysis. So having these kind of systems or generally having an idea of how you want to execute something.
How do you analyze this?
And how do you also just get an idea, a model of how fast your execution would be?
So this is what we're going to talk about next time.
So questions?
No questions.
Perfect.
So thanks a lot.
See you next week on Tuesday.