Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching

Starting point is 00:00:00 Today we're going to start with CPU architecture, but we'll first finish off with the benchmarking part. Anyway, I still have the announcement, so you heard it yesterday. Today we'll have the bi-weekly research seminar with you, with Darmstadt. You're welcome to join if you're interested. And, yeah, the rest I will show you later. Now let's switch to benchmarks. So that's my only announcement today. Okay, working. So this is where we left off yesterday. So we discussed some statistics, some terminology.

Starting point is 00:00:45 Now we're gonna talk about benchmarks in more detail. So what kind of benchmarks are out there? And yeah, that's mostly it. And then if you compare different system, that's the later part, how you do this, or what are some of the traps that you can have in there. So there's many different or not that many, there's different types of benchmarks. One type is micro benchmarks and this is basically the stuff that we do here all the time, right?

Starting point is 00:01:19 So the stuff that I show you in the lecture, these are micro benchmarks. We're basically trying to figure out how does the caching behave, how yesterday, how fast is sorting by looking at branch mispredictions and things like that. So we're trying to evaluate a specific low level system operation. Could be a CPU operation, could be something in a single system. Say minimum, maximum, average on a single laptop. The caching stuff, et cetera, which i will show you later. And super useful, but not necessarily, won't necessarily Tell you anything about end-to-end performance.

Starting point is 00:02:07 Then there's component benchmarks. So these would be more high-level functions. So say, for example, if we think about like whole large systems like MapReduce or something like that, then sorting would not necessarily be a micro benchmark, right? Because this means we have to spin up the whole cluster. We have like everything in the system is involved, but we're just testing one specific component or one specific function. And so, or basic SQL operations might be that.

Starting point is 00:02:42 So they can be on different levels. If we're implementing, like hard coding, a very specific SQL function, it's more like a micro benchmark. If we actually go through the whole system, we're doing a functional benchmark. Then we also have the application level benchmark. And this is more, okay, let's see how this system behaves in a real world application. Typically in a confined environment, because real world is super messy, meaning we're going to get results that we don't expect, that we cannot necessarily actually

Starting point is 00:03:16 interpret or explain, because there are some outliers, something's happening. So we want like a clean room environment, but still sort of, or still close to a realistic setup. Meaning we try to cover all the aspects that we want to see in reality or that we want to benchmark for real workloads, but we don't want to have effects that we cannot explain. Of course, we also want to test for things like that, but we don't say, for example, in a real world, you will have lots of issues with outtakes, with like disk crashes, etc. In a benchmark, of course, you need to be aware that this happens and you need to talk about availability. But these kind of

Starting point is 00:04:06 effects, if you have them in one benchmark run, not in the other, you're going to have a hard time comparing systems. So we're talking about a given data and a given workload that's at least structurally etc. and from a benchmark point of view it's known. Doesn't necessarily have to be all known for the system so the system should still be challenged like in real world but we as a researcher know all the details basically or we as a people benchmarking. And finally of course there's real applications meaning also we can you work with real data makes a lot of sense but has this kind of messiness.

Starting point is 00:04:46 We don't necessarily know what's going on in there unless we deeply analyze everything. And for benchmarks, we also can go from another, let's say, dimension that's orthogonal to these kind of types or levels, that would be we have standardized benchmarks. So the TPC and spec are the two standardization organizations that are relevant for database benchmarks or hardware benchmarks. So spec is really all about hardware and of course also different kind of applications levels. But these usually are kind of suites of benchmarks that really try to cover a broad range of micro benchmarks in many cases or realistic applications. So it's usually a suite of many different kernels that will be run.

Starting point is 00:05:46 While the TPC kind of tries to emulate a real scenario, say like a complete data warehousing application with all the aspects that you would have concerning the database. And these are standardized benchmarks. So there's a lot of rules on how they need to be run and you can publish those on the official websites. But of course, you have to pay and you have to make sure you have there's a lot of rules around it, how to execute the benchmarks and what you need to disclose.

Starting point is 00:06:21 And a lot of these things we actually as researchers, we cannot disclose because we don't know, say, for example, the exact price and maintenance cost. So usually we don't really pay premium enterprise service. And this is what you would need in order to get kind of the level of service that you would need for one of these benchmarks. Anyway, there's also other benchmarks, which I would call widely used benchmarks. And this is more of a soft criterion, but it's something where I would say, if you read a lot of database papers or read a lot of architecture papers, stuff like that, you will come across the same kind of benchmarks over and over again. database papers or read a lot of architecture papers, stuff like that,

Starting point is 00:07:08 you will come across the same kind of benchmarks over and over again. So a lot of people are using them for their publications, for their research. And this is a good thing. And this means a lot of people will also understand the numbers. So and this is why this is also so while we have to standardize benchmarks that most people in the field should know and should be able to talk about the numbers. There's also other benchmarks which are not standardized, which are still super useful because most people still will understand what you're talking about. Or there's already a lot of work that you can compare to. And then there's other benchmarks which also make sense. So very

Starting point is 00:07:46 specific benchmarks for certain applications. This is perfectly fine. Also, everybody will come up with new benchmarks for their work just to really show what their system, their application, whatever can do in the exact scenario that they're targeting. But these numbers are usually not comparable to something else because nobody else runs these benchmarks or they're not available, etc. So this is useful for certain applications, but not for comparability across different systems. So one micro benchmark example, and I'm not going to

Starting point is 00:08:27 go into details here because this is something that I actually have in the next lecture which we'll also cover today but this would be say for example we're looking at caching effects so how efficient is cache line access this would be a micro benchmark right so depending if we do row wise or column wise traversal through an area, or in this case, it's like a single area. If we consecutively access memory or an area, or if we jump in the area, we get different performance. And well, we'll see the results later on, but this is,

Starting point is 00:09:04 I mean, it's actually quite significant, the difference that you can see. So you can see we're kind of doing two loops through a single area. I mean, it could also be a two-dimensional area. It doesn't really matter, but here we basically enforce that we have this jump in the data or that we consecutively walk through the data by having a single array. In a two-dimensional array, it really depends on the architecture or the programming language how the array then is laid out in memory. So if it's column-wise or row-wise laid out. But here we know this will be laid out row-wise. If we do a column-wise or row-wise laid out. But here we know this will be laid out row-wise. If we do a column-wise traversal, we will jump in the area

Starting point is 00:09:50 basically in every individual iteration. And this means we're not going to benefit from the caches much. And this can lead to very different effects. We'll see the details later. I also have this as a practical example which I will show you how it works on my laptop. So yeah, more details this lecture. And micro benchmarks are good because we can really focus on something small in a very high detail. We have a very

Starting point is 00:10:24 controllable workload and data characteristics, typically synthetic or maybe even real data. Everything is kind of clearly as we expected, at least it should be, right? So we have very much control of what we're doing. And because they're small and they're fast, we can test a lot. So a typical application benchmark will run for hours. This will run within seconds. Right. So this really depends. I mean, you put some of the stuff you can run with like under a second even if your variance is not too much, because you're going to get stable numbers quite quickly because the operations are simple.

Starting point is 00:11:07 You kind of covered all of the variance quickly. For larger experiments, well, if your experiments say you're training a huge machine learning deep network, this will take hours. You cannot do this a million times. Small operations like cache access, I can do millions to billions of times without spending hours and hours, right? So it's really useful for in-depth analysis and has a low threshold and it's easy to run. And it's also something really nice to get these kind of numbers that later on we can use in our back of the envelope calculations. So if you're for example not sure how fast is my laptop

Starting point is 00:11:51 in branch mispredictions, how fast are the accesses on different cache levels, right? So then you write a small program. This won't take you like days, this will take you an hour or so you can actually evaluate and then you know for sure what to expect so you can use my latency numbers for a lot of the stuff but this might be outdated this might not hold for your architecture you're trying something on a very specific nice new architecture say high bandwidth memory you're going to get drastically different latency numbers than what I showed you. So you can test it. And it won't take long.

Starting point is 00:12:32 And this is really good. However, it really neglects the larger picture. So just benchmarking caches won't tell you how fast your application will run because you might not be cache bound, right? So if you compute bound, caches don't really matter that much. So you can optimize away as much as you want. We're not going to get like any details

Starting point is 00:12:57 about the performance of your application in the end. Also, just by, I mean, of course, you can still do the back of the envelope calculation, you get a rough number, but just from the performance results of your micro benchmark, you don't know, right? So, it won't give you the larger picture. So, you need to embed this into the context. And so, the generalization is difficult, but as I said, it's super useful for your basic estimations. And you can incrementally grow this, right? So now you know like the individual cache performance.

Starting point is 00:13:34 You want to know, well, you have a more complicated memory access patterns. You can combine these individual accesses to like your individual benchmarks to more complex structures and that might already give you an indication how fast your data structures will be. So this is useful, but it won't give you the complete picture. Even if you know your data structure, your data structure might not be the bottleneck. Then again, you need this larger picture, basically. And on the application level, we have these standardized benchmarks. And I have a few examples here.

Starting point is 00:14:16 This is basically what you should know if you do database systems work. So there's the Transaction performance council the tpc that's the standardization organization for database benchmarks and it's non-profits it's vendor neutral well there's a lot of vendors basically running this so you can see like Actian, Alibaba, AMD, Oracle, Intel, where's Oracle? Yeah, IBM. So Microsoft, all of the companies that somehow are producing at least or used to produce these large scale database systems are part of it. And they work together to standardize benchmarks that they can compare the systems.

Starting point is 00:15:09 So the idea is, okay. And I mean, the reason why they did this is basically initially everybody would just come up with their own benchmarks or their own comparisons and would just do benchmarketing, what we call it. Like somehow produce numbers and show that their systems is 100 times faster than the other system. And the comparison would never be fair. So here the idea was how can we basically build something so we get fair comparisons.

Starting point is 00:15:39 And a lot of it is then based like there's a lot of specification and a lot of it is then based, like there's a lot of specification and a lot of it is really based on having some auditors come to your company and check that you're actually doing the right thing. And of course, in research, we won't do this. We won't pay an auditor to basically check that our code is executed in the right way. But these benchmarks are well understood. And that's the cool thing about it. so there's olap and oltp benchmarks so tpcc tpce these are the old tp benchmarks where every i would say every database systems researcher will know them and will basically have an understanding of what to expect, like how hard these are to execute. Same for OLAP. So TPC-H, TPC-DS, there are a lot of people out there who can basically tell you the performance for each individual query in these benchmarks.

Starting point is 00:16:40 So TPC-H and TPC-C are the older ones, are a bit simpler. So TPCH, for example, has 22 queries, which are large OLAP queries. TPCDS already has 99 queries. Much more complicated, much more complicated schema. And TPCC, again, more simple, few operations. TPCE, a bit, few operations, TPC-E, but more complex operations. And these are all like TPC-H and TPC-DSC and E are just specified. So for those who have a data generator, you might have a query generator. So TPC-C doesn't even provide a query generator.

Starting point is 00:17:22 You have to implement this yourself. There are some open source implementations out there, but in general, you just get the data, you get the queries and now do yourself. But then there's also other benchmarks of TPC XBB and TPC XAI. So these are express benchmarks. So this is something the TPC came up when they noticed, well, not too many people are actually still using our benchmarks, at least from the industry side, because it's just so expensive to implement this. So, I mean, for a company, of course, you can imagine if you want to run this, you want to have a very good number, right?

Starting point is 00:18:03 So you want to have a very good result out of this benchmark. And this means you need a really beefy system. What they usually do is if they have a huge customer, I don't know, some US agency wants to have like a new supercomputer or a new super database, they will say, well, it's going to be here in a couple of weeks and we're going to just test it before and then they will use exactly this system to benchmark because that system of course is super beefy and they don't have to pay to build

Starting point is 00:18:37 it just for the benchmarking still running this having people working on this will cost often in the order of millions of dollars to get this set up. And again, this is not something we can afford. So we're just going to do our stuff, just going to try to understand the queries, try to understand the data and see how we could optimize something like that. And for the express benchmarks, these have a complete kit. This means there is something, so say for example, XBB is for big data.

Starting point is 00:19:11 There's a Hadoop based implementation that uses Spark and you just need to basically install Hadoop and Spark. You don't have to build a complete driver and everything. That's all included in the kit. So it's like you have all the queries, you have all the data, you have the data generator, you have the query generator, and you have the driver that will run everything, and then some scripts that will compute all the metrics, the final results, and even print out some kind of report that later on you can give to the TPC. And all of that you would have to implement yourself for TPC-H or TPC-C. And same is true for TPC-XAI.

Starting point is 00:19:50 So that's an artificial intelligence benchmark. So machine learning workloads, that's also all completely implemented in this case in Python and in Spark. Okay, so however, these are not the only ones that are relevant. So while everybody will benchmark, say, if we're talking about fast hardware, everybody in an OLTP setup will use TPCC in one way or the other. But it's not the only benchmark.

Starting point is 00:20:26 So at a certain point, people, especially Patrick O'Neill was upset that TPC-H doesn't really have a clean star schema. And some of the queries don't really, I mean, they're more from a business side rather than from a business side rather than from a functional side. So the queries basically specify certain business applications, the same with all of the TPC benchmarks, but they're not really targeting certain functions necessarily of the database. They try, but this was not the first goal, let's say. So Patrick O'Neill and colleagues started to build the star schema benchmark, which

Starting point is 00:21:09 has this clean classical star schema. So you have one fact table, you have a couple of dimension tables, and then you have like different query flights that do different kind of like tests, different kind of functionality. So some kind of drill down. So finer granularity of the tables, more joins, et cetera. And the complexity in each query flight will then per query will be higher. And you will have like more complicated queries in there. And that's also used a lot.

Starting point is 00:21:43 So this is also a lot of papers you will see will basically take the star schema benchmark and you and basically benchmark the system with this and people will know what you're talking about if you're using the star schema benchmark. Something that the TPC doesn't have yet and which is a big topic in industry and research, is HTAP. The idea is you have a hybrid transactional and analytical workload, so hybrid transaction analytic processing. So rather than saying, oh, I have one system for OLTP, so for my daily business, people order something and get the orders and the payments,

Starting point is 00:22:29 et cetera, built in one database. And then I have my analytical database that's separate from it, where I do all my daily or yearly analysis to say, okay, how can I improve my business? How can I improve my business? How can I improve my depots, etc. A lot of systems, say for example SAP HANA, they try to do this in one system. So they try to have everything in a single system. And of course for this you also need a benchmark. And Florian Funke and colleagues out of a seminar at some point, came up with a combination of TPCC and TPCH, where then you can do both in a single benchmark.

Starting point is 00:23:14 So this is TU Munich and other German universities, or other universities, I don't remember. I think it's more international so a lot of people work together to build this benchmark which has both aspects so you have this fast small um accesses and updates to the database so all tp is characterized, you're just looking at tiny pieces of data. And you're updating a lot. You're inserting a lot. You're not doing like these huge scans and analyzing.

Starting point is 00:23:54 And this would be the huge scans and analyzing and maybe large updates or large inserts rather than many small updates. This would be a typical OLAP workload, so where you're basically trying to do some bigger data analysis for forecasting, whatever. And so this benchmark basically does both, and is targeted at systems that try to cover both at the same time. Then for very simple interfaces, like key value interfaces, there's the Yahoo

Starting point is 00:24:27 cloud serving benchmark. You will also find this in many publications where Yahoo at some point came up with just this very simple setup where you just have insert so the typical crud so create read update delete operations so you're inserting data you're updating data you're reading data and you're deleting data all with just a key value or interface so you have a little bit of a schema but doesn't really matter so it's all what's required is a key and a value and you're just going to work with this you're not not gonna scan through the data on values. You're not gonna select or filter something. You're just gonna get to the keys.

Starting point is 00:25:12 You might even do scans, but just over the keys. And there's a single key. There's no different kind of relations or something like that. But it's super popular also because it's super easy to use. So there's a lot of different interfaces already given for a lot of different key value stores. So if you want to benchmark any kind of key value store, this is actually one of the go-to benchmarks.

Starting point is 00:25:36 And if you would want to build like a key value store system, then for sure you want to be good in these benchmarks. And this benchmark also comes with different configurations, like different workloads, where it basically changes in terms of access patterns, saying how many inserts, how many deletes, how many updates and reads, and let's say 100% reads, 100% inserts, and anything in between with different data

Starting point is 00:26:08 characteristics. And finally, the last one that I want to tell you about, because I also think it's quite interesting, completely different level is the join order benchmark. So this is based on the IMDB database or dataset. So IMDB is the Internet Movie Database, also used a lot in database research because it's a real dataset. There's a lot of data and you can access it

Starting point is 00:26:41 and you can do a lot of queries on that. And they basically came up with 113 different join queries and it's really just about join complexity. So there's many queries. I don't remember how much the many the most or what's the largest number of joins in the single query, but it's in the tens. So they are quite complex and this mainly tests join optimization, meaning how well can your system optimize join orders. So this won't necessarily be relevant for us, for example,

Starting point is 00:27:14 because we're not talking about query optimization in this lecture. We're talking about low-level processing. So query optimization would be on top of that. So I'm basically I'm expecting a good query plan in the first case or in the first step, and then I will start optimizing this on hardware. If the query plan is really crappy, optimizing on hardware doesn't really make that much sense or it doesn't, won't, like a bad query plan won't get perfect just by using better hardware, right? So we will still be inefficient even if we're less inefficient than not optimizing for the

Starting point is 00:27:57 hardware. And you can get like very different gains. So having a bad join order will give you orders of magnitude, less performance. And like many orders of magnitude, it's hard to get out of hardware. You can get actually some orders of magnitude, but say if your intermediate results explode,

Starting point is 00:28:19 you will have to write to disk. Then, well, being efficient in memory won't help you anymore because still you're disk bound. And of course we also can take real applications. So we take a real application, we put it on top of our system and we see how it runs. And this is great because there's many different applications. You can pick one that's geared towards your system and you will see some of the characteristics that you might necessarily not see in a benchmark. So it's not as simplified or not just an academic view on the problem. But, well, it's usually at the same time this is a problem because we might not necessarily know what's going on in the system.

Starting point is 00:29:12 So unless we have a very good understanding of the workload, there might be some problems that we don't understand or some intricacies, which, of course, is also good because, well, this will also happen in real world. And the problem is basically, well, there's many different applications that we have to, or there's so many that it's hard to choose unless you really have an application that your research is geared to. But from a system point of view, often we're trying to be be generic at least somewhat more generic than just using one application and then often it's hard like if the data sets are proprietary or confidential

Starting point is 00:29:55 then you cannot share it nobody else can really validate what you're doing and this is a huge problem in research right so i don't really want to publish numbers on something that nobody else can verify. And this is getting increasingly important. So while in the past you might have gotten away with this in a paper, today people will basically ask you, can you please share this? I mean, you still might say, well, it doesn't work. this is our application, but we have a company here that we're working with, say, for example, and this is, or it's really confidential or like private information that we cannot share because we, the user's confidentiality needs to be saved or guarded. So you can say that, but of course it would be nicer for other researchers if you can share everything.

Starting point is 00:30:48 And of course it's not scalable. So if we're talking about performance, we somehow need to be able to also scale up, scale down in something. Usually data set comes in a certain size. We might be able to sample from it and still have meaningful data, but increasing it often is difficult, especially if we're talking about data characteristics that we want to see. So then the data set all of a sudden becomes much more artificial. So what we often do is we analyze applications for benchmarks and generate a synthetic benchmark out of it. So say Big Bench, for example, uses distribution, the same for TPCDS, uses distributions from

Starting point is 00:31:33 the Census Bureau, so the US the benchmarks will use this and specify and design the data set after this, or Amazon data set, etc. And then we're using a data generator to synthesize the data and make sure that the data somehow is reproducible. Or you can even do this yourself. So there's data generators out there. If the data generators are publicly available, other people will be able to also use this. And hopefully, at least if you're taking some care, the data generator or the synthetic

Starting point is 00:32:19 data doesn't have any privacy, security or scalability issues. So that's actually useful. You're losing some of the intricacies of the realistic data, but at least you can experiment and you can share. So that's quite useful. And again, a good thesis, if you're talking about your research, eventually you will have to do your thesis. Right. And if it's in any way about performance and if it's in any way about system, it will be in any way about performance,

Starting point is 00:32:50 then you should have micro benchmarks, application level benchmarks and some real data. So this is typically what makes a paper or a thesis, a great thesis or a paper, by just making sure that on the one hand, you compare it to other systems. So this would be the application level benchmark. You show that it works in real world. So this is you're using real data. And you're actually evaluating what's going on inside. Where does my time go?

Starting point is 00:33:23 This would be the micro benchmarks, right? And then basically then people that read your thesis will understand, OK, well, this is better than what was there before, or at least as good as the related work in the situations that we're benchmarking or even not as good. That's also fine, right? Theses can also have like a negative result. We're basically just evaluating different aspects.

Starting point is 00:33:50 But then we also want to know why. And this is why we want to have these micro benchmarks. We really want to see where does time go in there. And this is why I'm kind of stressing this. If you do this, you're going to do really nice research. If you don't do this, if you're just producing numbers that compare something, you might just be off, right? You might just be wrong.

Starting point is 00:34:12 I'm going to show you some examples in a few minutes. Okay, if you're comparing to other systems or other work, especially to other researchers' work, you may want to make sure that all systems have equal functionality and you're able to reproduce original numbers. Often this is hard because people are gaming everywhere, right? Even if you talk about other people's research, well, they might not try all the conditions.

Starting point is 00:34:42 Your hardware might be slightly different. They really optimize everything for their hardware. Then you're not gonna get the same, exact same performance. But hopefully they've done their job well. And if then you're off by a lot in terms of performance, then well, maybe ask the original authors, right? So maybe, or ask the people that provide the system, that build the system for some guidance in how to tune,

Starting point is 00:35:11 because you might just miss some obvious stuff. And all of a sudden, the performance is really detrimental. And most people are helpful and are happy to help you if you're interested in their work. So if you want to test something, just ask, basically. And with that, I want to give you some examples. This is by a presentation from Mark Grasfeld and colleagues from CWI.

Starting point is 00:35:40 And this was DBTest. So DBTest is a benchmark data testing workshop at SIGMOD every second year. And they did a nice presentation and some nice examples. This is why I want to show this to you. So it's all about fair benchmarking. So if you're comparing two systems, it's more about the traps that you can fall in. And these are easy things that you can get wrong. So, well, there's many, I mean, of course, they take a negative approach to it. So they basically start with some example. I don't bring the examples because it was a concrete paper of also a colleague that we know where they basically showed, okay, the benchmarking here is just completely unfair. So, there is a lot of problems

Starting point is 00:36:33 in database benchmarking in industry. We basically don't see many standardized benchmarks anymore, unfortunately. I mean, we hope it's going to get up again, but right now it's mostly white papers with some numbers that are hard to compare. And in academia, and this is 2018, so it's a bit less, and there's been a huge push towards making this better, right? So there's a lot of initiatives in conferences, et cetera, in making sure that numbers are getting more reproducible again.

Starting point is 00:37:10 And so but typically you get something, you get some numbers, you don't know the details, you don't have access to the code, you don't know the data exactly. So the numbers are completely unproducible. Often you don't even know the data set sizes, for example. If you just give me the number of tuples, I don't know the size of an individual tuple. I have no idea what the performance should be. If a single tuple is one byte, it's a huge difference from one kilobyte, right? So one might be cache bound, one might be memory bound, etc. So it's

Starting point is 00:37:47 very hard. And well, a lot of results are published, but still, and but as I said, it's getting better, still, few are not, few of them are useful and the problem typically are benchmarking games and i wouldn't say necessarily all people are are let's say malicious in this basically doing this out of malicious reasons just to get better numbers but a lot of this actually happens just out of not knowing about performance not really digging deep enough what would be the difference in performance if we're doing different kind of configurations or different kind of setups.

Starting point is 00:38:36 So typical games, and as I said, doesn't necessarily have to be games, could just be out of not knowing. Would be different configurations, hardwired optimizations, specification that is biased really just to one system. This, I would say, will happen a lot anyway if you're building a targeted system where there's not a lot going on. So there's no other specialized system for your configuration, for your workload, etc. Then, well, the specification automatically aspires to that workload.

Starting point is 00:39:13 But you might even tweak it even more, right? You might just mingle with the data or change the data or the workload in a way that it perfectly fits your system. And you're leaving out some stuff that would's perfectly fits your system and you're leaving out some some stuff that would not work with your system say for example you're you're assuming a read-only workload in an application where you definitely will have a little bit of rights and that again will change a lot right as soon as you have to deal with updates things will change a lot in comparison to complete read-only workload. Or say you have a synchronized workload queue.

Starting point is 00:39:51 This is something that I also often see. Basically then you're typically you're basically limiting your driver. Either you're just benchmarking your driver or you're limiting your driver and making sure that you don't really see the effects in an open world setup. So the open world setup would be, I cannot control my workload, at least from a system point of view, the workload is just coming in. In a closed world setup, and this is basically, I mean, open world setup is you sitting on your computer typing your latest tweets or you're clicking on Amazon something.

Starting point is 00:40:29 So Amazon or Twitter cannot control you. I mean, let's say to a little extent only. At least we hope you still do the decisions when to type something or not. And assuming this is the case, then they have to deal with the workload as it comes, rather than they can say, now we're ready to get more workload. And this is the real difference between a synchronized workload queue

Starting point is 00:41:01 and an unsynchronized workload queue. And that means in a benchmark benchmark if you are doing like Synchronization there, then you are not basically working with Like in a realistic set up. Then of course arbitrary Work loads are problems. Very small benchmarks are a Problem because you are just going to be in caches all of a Sudden rather than going to memory. Running something very shortly, say in Java,

Starting point is 00:41:26 you won't get hit by garbage collection. Or your system is a bit more efficient in garbage collection and you won't get hit, but the other system will see garbage collection right away in your short benchmark, then your performance is like way better than the other systems performance. Or you basically manually translate something like, again, hard code stuff for performance. And why does this happen? Because this is something you want to see right in in your thesis in a paper and here i mean there i would say in your thesis you're still very free and you're you're don't like there's no no way of somebody stopping you publishing your thesis as a master thesis um if the numbers are not perfect, right? So you should not really strive for this kind of chart. Especially, a lot of it is really about framing, right?

Starting point is 00:42:31 And this is something, if you're doing your thesis, you can basically frame anything, and I'm going to evaluate if this or that is better, rather than, I have this great idea, I think it's better, and my results show it's actually not better because my assumptions were wrong. So this is often what we see, right? So we come up with a cool idea.

Starting point is 00:42:51 We just think about my paper recipe. We come up with a cool idea. We try it out. It doesn't really work. And we're all way too far in the process to to basically start from scratch again so then just framing it differently right just looking at the problem from a different angle if i don't know what's better or worse if i just evaluate different things i can't basically have a negative result because evaluation is always a good result, right? Unless you make very poor assumptions about the evaluation in the beginning,

Starting point is 00:43:29 you're kind of trying to compare apples and oranges. Okay, but unfortunately, in conferences, if you build a new system, you need something like this. You need to show that your system actually makes sense to use. Otherwise, it won't get accepted or your product won't get sold. So people will try to see something where, okay, the difference here means money in a product. And then I have an exponential number of customers.

Starting point is 00:44:02 So I will get exponentially rich very quickly. This is what investors want to see. And if you cannot produce this in one way or the other, your transaction or your system won't be sold. Okay, so this is the problem. We know this and now how basically what happens, what can happen in benchmarking so that people get to this without basically playing fairly. And again it might just be not

Starting point is 00:44:38 necessarily you're maliciously trying to game the system but you just don't know how to compare two systems. Okay, so common pitfalls in benchmarks and general system benchmarks and we'll see some of them. So four of them I'll present then three of them are just things that you might also see and that I will also will to some degree we will see in in the benchmarks that I show you during the lecture these micro benchmarks so there's non reproducably failure to optimize apples versus oranges and incorrect results and then things that we will also see a lot and you for your small benchmarks, you can really see this very well.

Starting point is 00:45:25 It's this hot versus cold runs, meaning is data in cache or not? And the difference is huge, especially for small benchmarks, especially for this micro stuff. Data pre-processing can be a huge difference and overly specific tuning can be a difference, right? If I know very well about the data set, I can tune the hell out of it and this will never be generic enough for other kind of applications. Okay, but let's go for the first four as a start. And well, non-reproducibility is a frequent problem, this is basically,

Starting point is 00:46:06 you're writing some code, you're hiding it somewhere, you don't make it publicly available, you don't make it available to your supervisors, etc. In terms of papers, frequently this has little consequences. Although ACM, so SIGMOD and VLDB, they now ask you to give a clear statement why you cannot publish the code. And if you don't have a good answer to that, this might lead to paper rejection. In the past, there was nothing. So 2018, there was nothing. So 2018 there was nothing. There was already the SIGMOD reproducibility effort. So you can say there's artifacts available, there's results reproduced or results replicated. So I don't remember which one is which. So one is basically I'm using the same kind of setup as the authors, like say even their server, to try this in their code and everything.

Starting point is 00:47:12 And do I get the same numbers if I run this? That would be results replicated, for example. And as I said, they have different terminology. Unfortunately, also VLDB and SIGMOD mixed up their terminology. So it has different meanings. And then another one, however, would be like the even stronger one would be another group basically takes the paper, takes the architecture, whatever, re-implements them and gets the same numbers. That would be even stronger.

Starting point is 00:47:43 Of course, for a system that doesn't necessarily make sense because it would be super complex, makes more sense for smaller, like for an algorithm or for smaller parts of a system. Say I do this kind of buffer management, somebody else reproduces the numbers, then the results would be replicated or reproduced completely.

Starting point is 00:48:07 And artifacts available just means, okay, you give access to the code and data so that somebody else might be able to try this as well. So that's good. And you will actually see this on papers. So if you go to ACM, to the digital library, you download the papers. The papers that have been verified in one way or the other, or where the artifacts are available, will have these batches on them. And, well, I mean, let's look at some examples how you can get to non-reproducibility, or how you can basically game your results.

Starting point is 00:48:48 So this is an example of TPC-H, scale factor one, so one gigabyte of data, query one, it's a simple scan mainly with the filtering on a single table. And you can see you have MariaDB, which is like one engine, I think, behind MySQL, and you have Postgres. And like in the first comparison, you see it will run 12.18 seconds on MariaDB, then 9.73 on Postgres. Then in a second comparison,

Starting point is 00:49:26 we're testing Postgres versus SQLite. You can see the SQLite is even a bit better. And finally, we're testing SQLite versus MariaDB. And MariaDB all of a sudden almost has a three times better performance. Another question is, what do you think the authors changed here? Like what would be

Starting point is 00:49:49 like an assumption in terms of what do we need to change in MariaDB? So just that in any kind, like anything, right? So we're not changing the code, but any kind of tuning that we do such that this database could be that much faster. Just shout out some ideas. So what can we do for databases to tune them? Simple stuff. Add an index?

Starting point is 00:50:25 At an index would be something, yeah. At an index would be an idea. They didn't do this. What else? Tuning the configuration, right? So that's something. We could give it more or less buffer space for different kind of buffers. So say, join buffer could be larger or smaller. We could artificially limit the memory size for MariaDB and then increase it or

Starting point is 00:50:54 limit it for the others. It didn't do that. Okay, so same configuration parameters. Something else we can change is compilation flags. And this is something that you can easily overlook, right? So basically just by compiling like your C++ program with minus O3, like all optimizations, it will be much faster than minus 02 or nothing, right? Or even in debug mode. So if you run your code in debug mode, it will be much slower.

Starting point is 00:51:30 But it's the same, right? So they didn't change that. Same version number, right? So you might just go to an older version of the database and that might be slower or some other change in the database. But it's still the same. However, the difference is it's a different schema. And the difference in schema is really just,

Starting point is 00:51:55 rather than using decimals, here they're using doubles. And that's perfectly fine even in terms of configuration. Even for the benchmark, it's fine. So a decimal, you know, like a decimal is basically a number which will adhere to decimal arithmetics to a certain to a certain position, right? So say for example, we're talking about euros or something and cents and

Starting point is 00:52:30 say, I, for example, we're talking about euros or something and cents and, say, I don't know, percent of cents or something like that. And we'll have correct decimal arithmetics. And, of course, this is not efficient on a CPU because a CPU thinks in floats or in integers. And in floats, you will basically, or doubles, you say, for example, you cannot say 0.1 exactly. You cannot store this exactly. 0.1 is not a number that you can represent exactly in binary, like our decimal number. And this means in decimal, there needs to be some extra work

Starting point is 00:53:08 however the results will still be correct to tpch but will give you a almost factor three performance improvement here if you do this for all the systems it might be fair right but if you do it just for one system all of a sudden things are not comparable at all anymore. I always find this quite interesting. Then of course, failure to optimize. So I already mentioned this, right? So, well, if you're optimizing your system, you also have to optimize competition. And a simple one would be, how do you compile this? Or how do you configure the buffers? And well, an easy way to fix this, talk to the other people, right? If you're working, say with a PhD student on your thesis, and you're supposed to improve

Starting point is 00:54:03 their algorithm, make sure that you know how their algorithm is used most efficiently, otherwise the comparison won't be good. And well, this is lots of work, but it's getting, I mean, many people will be happy to help you, especially if you tell them, you know, I want to compare your system to my system. And the numbers that I get are actually like my system is way faster. How can you can I basically improve your system to make a fair comparison? And then there are some people actually out there.

Starting point is 00:54:41 If you publish a number of their benchmark of their system that's that they don't think is as good they will hassle you a lot like you will get lots and lots of emails etc so sometimes it actually makes sense to talk to them for some systems you don't even know I'm not going to get into rant mode here so involve your competition, helps a lot, make sure that you optimize in the same way. And of course, optimize all of the systems in the same way. Then, well, here you can see like different results for different kind of configuration, right?

Starting point is 00:55:21 So here, if we change the buffers, you can see that we can get like almost a factor two of speed up. If we use different compilation flags, so debug mode versus optimized, you can get a factor of two almost in terms of performance. We will see this in a bit as well. Then of course comparing a standalone small implementation versus a full system again is not a fair comparison. So say you implement your new join operator and you compare it to Oracle, something like that. Oracle has to do a lot

Starting point is 00:56:01 of additional stuff to make sure things work well. And you can optimize the hell out of it. So, I mean, you have to deal, you would have to think about all the corner cases, the rights management, et cetera, fault tolerance, all of that is built Consider this in your setup, then you're doing an apple to orange comparison. So say like a simple feature would be overflow checking. So if you're checking if some buffers can overflow or just some numbers can overflow, if you're missing this in your implementation, then it won't work. Or if you don't have transactions and you're doing some OLTP workload,

Starting point is 00:56:55 well, you're going to do a lot less work than the actual system. Fixing this means integrating algorithm into systems, say for example. Try it. If you want to compare to a full system, Mixing this means integrating algorithms into systems, say for example. If you want to compare to a full system, you have to use a full system. A simple way for many database workloads would be, say for example, integrate your optimization into Postgres. And with still having the same features, then in many cases this works. It's still a lot of work, but you can directly show if it works in the system or not.

Starting point is 00:57:32 And you will get all of the other parts that the database actually needs to do. And you can also see if it's really like a problem, like in a real system, because often things that we optimize for, again, this comes to the micro benchmark, right? I might have a good idea for optimizing, I don't know, some kind of alignment of the data, etc. But if it's not performance critical in an actual system, it won't give like an end-to-end performance improvement. And then, well, if you're hand tuning or hard coding your solution, again, this is something where you can get lots and lots of benefits or better performance and

Starting point is 00:58:24 you're not comparable. So this is basically TIMDB, which is a hard coded implementation of query one of TPC-H. It makes sense to do these things, to basically see what performance, like what's the upper bound for performance that I can think of. And then I can think about where,

Starting point is 00:58:44 like in a realistic setup, how could I get there? But it's not a full system, right? Saying like a hard coded C program is a full system won't. It's not true. So comparing these two doesn't really make sense. And so it's not like a hard-coded C program. It's not a database end-to-end. If you have some query compilation,

Starting point is 00:59:10 you might get close to this, but then also, again, like for a complete workload, the compilation part needs to be factored in as well. And, well, even worse, what often happens, so you also have to check your results, right? And you might think this is not a problem, but it often is like most, not most, like many, many systems will produce incorrect results. And with incorrect results, you can actually be arbitrarily fast, right? So by just making sure,

Starting point is 00:59:49 or yeah, I mean, you can be slow if you have bugs, but sometimes they are super fast. I can actually show you later another example in the practical example. So you always need to check the results. So always make sure that the results are correct in one way or the other. Otherwise, you might just do nothing.

Starting point is 01:00:13 Like even if in your code, for example, you're not checking the result, then the compiler might even optimize your code away. So then in the micro benchmark, you might see, oh, this is super fast this way I implemented this, but it's super fast just because it's not executed at all. And well, then it makes sense to basically make sure that you compare the results with a real system.

Starting point is 01:00:43 So that's an easy way. Say for example, in the database, run Postgres with the same query, you get the result, compare it with that and you should be fine. But say for example this type of implementation, well, any kind of implementation won't be able to beat this. So they have a complete rubric of things that you can check. This is kind of, well, it's not complete, but it's something that you can also go through. So you can basically go

Starting point is 01:01:14 to the paper. You can also look on the slide later on and see if you're doing the same thing, then you're somewhat safer. In the end, the best thing is you understand the performance of your system because then you know, okay, where does time go? Do I actually factor everything in? You understand not only your system, but also the application because then you know are you really comparing to like the real things here. Okay, so summarizing this part, well important to me, right? So and this is why I also take a lot of time on this. This is things that you will to some degree need in the tasks here is basically if you're doing optimizations, make sure that you know what's going on. And for this back of the envelope calculations help you a lot, right? So with back of the envelope calculations, on the one hand, you can figure out if your

Starting point is 01:02:16 idea on how to improve something actually makes sense. And also you can figure out if the results that you see in any way make sense. So often we get some performance numbers. And if I don't have any clue why the performance is like it is, then I should do a back of the envelope calculation. I should try to figure out why this number is as it is. So often, like in a classical database, often I can just think about disk accesses, right? So, or in the main memory database,

Starting point is 01:02:52 a lot of things are mainly memory bound. So then I can think about how much data do I need to read for something? What would be, well, what would I expect how long this takes? If I read multiple times, I do multiple processing, I can still factor this in. If this is completely off of what I want to see, so say, for example, I need to read at least one gigabyte of data and I know how long this will take.

Starting point is 01:03:20 This will take multiple, or it will take, what did we say? Well, one millisecond or something like that. Don't quote me on the exact number. Look up the latencies. If I can basically, if I see something drastically faster, for example, then I know something is off, right? So then I know this can't be. I'm making some,

Starting point is 01:03:45 like if there's some mistake in my code, this helps a lot. And then of course, if you compare to something, make sure that you're trying your best to make it as fair as possible. Okay, so for this, we're going to switch to CPU architecture. We're going to start this in five minutes, but before we do the break, do we have questions so far? Yes? You mentioned a lot of database benchmarks earlier. Yes. And it makes sense because we are on a database, but I wonder whether there are some more general benchmarks that, let's say, test the computational power of your distributed system. The only thing I can think of is let's sort a terabyte or petabyte of data.

Starting point is 01:04:32 And often I saw like authors come up with calculating the biggest prime number or multiplying matrices. Are there? Yes. So the question is do we have other benchmarks for distributed systems, not necessarily database benchmarks? And yes, I mean, as you said, sorting would be one. This is like this also the gray sorting benchmark website or gray sort so there's a sort benchmark website where it's different kind of sorting challenges for distributed systems also for energy efficiency etc that would be an example and then there's many spec benchmarks so spec the system performance evaluation corporation is a standardization organization that has different kind of benchmarks for distributed systems, for CPU, so CPU spec, no, spec CPU something, something, would be a package of benchmarks

Starting point is 01:05:35 particularly targeted at getting CPU performance. Then for high performance computing, there are different kind of kernels. So the Berkeley dwarfs would be one, and then there's, I forgot the name, Jack Dongara recently, I think, got a Turing Award for this kind of performance evaluation around this. So there is other kinds of benchmarks out there, say for distributed systems, for high performance computing. These often don't look like this application level, but they

Starting point is 01:06:16 are more like individual programs. So they call them kernels typically, so a small program that would do some integer computation, some floating point operations. So typically, if you think about like a high performance, like a new supercomputer, they will say this has this and that many flops, right? And this is benchmarked exactly through a standardized set of these kind of kernels that will then give you this number. So there's stuff out there for like targeted, like distributed systems, it's a broad field.

Starting point is 01:06:55 So say Hadoop kind of stuff, Big Bench would be something, there's High Bench, there is HighBench, there is Terasort. Then for other kind of distributed systems, I can't think of anything off the top of my head, but there's a lot. In the CPU. So, well, we've left the intro and we'll now start with CPU architecture and caching. So in this session, well, in the remaining minutes, I will give you a bit of an overview of CPU architecture.

Starting point is 01:07:34 And then we'll especially look at the different caching layers, how these work, and what this also means for performance. And then later on, we'll look at the instruction execution, so how individual instructions are broken down and executed, and what is performance critical there, before going into other structures on the CPU, and how to use them efficiently. So this part of the lecture, which is also a major part of

Starting point is 01:08:06 the overall lecture will all be within the CPU package here. So this is the CPU package and then first we're going to just mainly look at on this top level here, also look at the RAM and then go for multiple CPUs as well. So we're already a bit behind but it's going to be fine, right? We have enough time. So first I want to give you an overall overview and this is probably as much as I can do today and then we'll go more into the memory access, virtual memory caches, data layout and alignment. And this is really core CPU architecture or computer architecture. If you want to know more about it, there's two nice books. These are also massive books where you can read a lot about the

Starting point is 01:08:57 intrinsic of the CPU architecture. And this is really complicated stuff, but it's also really, really interesting stuff, at least from my point of view. So I'm actually enjoying reading these books whenever I have time to read books. Okay, so computer architecture. High level, I already showed this to you. This is what on a high level, typical CPU and computer architecture looks like. And you will also find this on the motherboard.

Starting point is 01:09:29 So if you open up a server, you can find these different components. You will find the CPUs, so individual CPU packages, which then we would assume everything is on a single die. This might also not be true for certain architectures, but say an Intel CPU, you have multiple cores on a single CPU, you have multiple levels of caches on a single CPU, then you have DRAM directly connected to the CPUs through memory controllers, and you have some kind of PCI controller that connects to all the peripherals. So anything that's not on the motherboard typically would be connected to PCI Express actually.

Starting point is 01:10:14 And then you might have multiple CPUs and these are connected through proprietary interconnects. So in Intel, this would be UPI and the universal processing interface and AMD this would be Infinity Fabric and in Power it would be the X, what's it called, XPath, I think compute path or something like that. I might have it on a slide later on. So, but this is really how things are connected. And of course, the further away we go from the CPU, the more the latency, the lower the bandwidth typically. And the closer we are in the CPU and the more we are directly on the die, the more expensive, because this is getting smaller and smaller.

Starting point is 01:11:09 And so space on the die actually means, well, I have to invest into this. And of course, I mean, we have a certain die size and the chip producer, well, they won't say this size costs me this and that much euros or dollars. But in the end, they basically need to be performant. So customers won't buy the chip if it's not highly efficient or performant, mainly performant in the past, to some degree also efficient. So if the performance is not good, they won't buy it. So the manufacturers really have to think, how do I actually put stuff on there so it

Starting point is 01:11:52 works well? And there's a lot of trade-offs, of course, that always need to be made. And this is what we see right now. So of course, and over the years, there's different trends. Some chips will try to be just like simpler. So you might have to risk architectures where you have like quite simple chip design, but then you can put more compute maybe on there. Intel chips, they have more like many different instructions, which means, well, the instruction, decoding, etc., you need more

Starting point is 01:12:28 space on the chip to deal with that, right? So this basically costs you more chip space. At the same time, the chip becomes more universal. Some of the more complex operations might be faster because they can be done in hardware rather than having them being decoded into multiple operations. And well this, so some of the stuff here is sort of standardized, say here PCI Express, these things are more or less standardized and so we have many different vendors speaking the same language here this kind of stuff will be highly proprietary so this is really dependent on the individual chip while say even mac would be would understand PCI Express, how the chips are connected internally, this is proprietary. And you know already, you've seen some of the processor trends or the hardware trends already. One other trend that I want to show to you is the performance gap in terms of processor versus

Starting point is 01:13:42 memory. And this gap, I mean, it's not necessarily increasing that much anymore because the processors don't get that much faster anymore. However, still we have some improvement, especially through parallelization, but we don't have this kind of improvement on the DRAM, at least not in the same level. And well, this means that the CPU will spend a lot of time waiting on memory loads and stores. So same thing as we had like this problem with disk and let's say CPU package everything up there, where we would have to go to disk,

Starting point is 01:14:27 we have kind of a similar problem, not as big as with disk, but similar with memory and the CPU. So the CPU, and we've already had this, right? So reading one item from memory, the CPU will have to wait probably thousands of maybe thousands of cycles. So it's not going to be like instantaneous as it would be reading out of the registers. So this is something that we need to optimize for. We need to make sure that of course we will get hit by this, right?

Starting point is 01:15:01 So we have the data in memory. We have to read from memory. But the question is in memory, we have to read from memory. But the question is how often do we need to read from memory? Does every individual instruction mean we have to go completely to memory or are we going to be more efficient than that? Do we somehow make sure that while we're computing on something, already other stuff is loaded or when we load something, it will already be good for multiple cycles on the CPU. So a bit more details about memory. So there's different types of memory and DRAM, dynamic RAM.

Starting point is 01:15:40 This is very simple structure. So this is basically a capacitor where the state is stored. You know what a capacitor is. Basically, you load it. It will keep the energy for some time that you put into it. But this basically also has leakage. Any kind of capacitor will, after some time, basically lose the power that you put into it. And this means we need to refresh it. And this is basically the cycle,

Starting point is 01:16:12 that's like basically the frequency that the DRAM works in, right? So you know your DRAM, I don't remember, like two, what's the typical DRAM frequency right now for DDR3? Megahertz, right? So, 2400 megahertz. It's slower, like a factor of 2 or something below what the CPU would do. But it's super small.

Starting point is 01:16:41 And so, it's cheap also. Because it's small, it's cheap, you can pack it densely. Cheap in relation to more or faster types of storage, not cheap in relation to SSD, right? SSD or disk would even be much cheaper than that, but something like SRAM, so static RAM or like a stable RAM, basically, is much more expensive because it's more complex and you don't need to really understand what this means. But this would be like a diagram of how static RAM is actually implemented on the chip. And that means, well, this can store zero or one, just like the capacitor on the upper part.

Starting point is 01:17:33 But it's stable. You don't need to refresh it. And it's faster. It needs more space. And this will basically be used on the CPU. So, and it's kind of counterintuitive, right? So you think on the CPU, we want to have something that's super small, but effectively, basically, we want something that's fast and stable, and we will pay for a bit more space there than having these capacitors that will be slower that we cannot really

Starting point is 01:18:06 use with the CPU in the same kind of frequency. So we packed the small and slower and need to be refreshed dynamic RAM in the DIMMs and we packed the static RAMs that need a bit more space in smaller caches on the CPU. And we're going to have multiple level of caches so we can organize the data in different ways. And so we have like DRAM is comparably slow. I mean, slow in contrast to the static RAM, right? The static RAM is basically as fast as the CPU. So the registers in the CPU can be read and written

Starting point is 01:18:51 in a single clock cycle. Of course, larger caches, we need to do more. So we want to have the page addresses and whatnot, all this information we want to have there. This will take, like calculating this this will take us a couple of cycles and reading and writing back. But in the register, this is instantaneous, right? So we can read these values instantaneously.

Starting point is 01:19:16 But for the DRAM, well, this needs to be refreshing. We have to refresh periodically. So I don't even know if we need to do this in every clock cycle, but we need to refresh. Then if we refresh, well, we have to first discharge and then recharge the cell. And that takes time. So it's not something that I can say, well, in a single clock cycle, I will be able to do this. And then we need to address the write.

Starting point is 01:19:49 So we basically need to somehow build a circuit to get to the write address, to some XOR connection, for example, and find the write cell. And then get this output. And the output needs to be amplified. So the output, to be amplified. So the output, this is somewhat stochastic here. We need to basically make sure if we sum here, we have something that's in between zero or one. So we need to boost it up to get where is my mouse?

Starting point is 01:20:21 Let me use a laser pointer. Don't find it. Here. So, if we're somewhere here, we're in this kind of not clean state, so we need to boost it up to have like a clear signal. And, well, overall, we're talking about depends 200 CPU cycles up to 1000 CPU cycles for a single access. Right. But well, on the other hand, this is something it's not like we don't have to address a single cell in there. Right? We don't going to address a single bit, but we will address multiple bits, even multiple bytes at a time. Also, this is

Starting point is 01:21:13 not really like a single array that we're addressing, but it's set up in a two-dimensional array and might even be three-dimensional depending on the technology. And this can then be done rather than on a single bit. We're going to do it for complete rows. We can actually get the whole data all in one row or in one cache line. And this basically means we're reading all of this at once. And then we can also use this at once. And so, and rather than doing this for one row at a time, we can even do this for multiple rows at a time

Starting point is 01:21:53 and use it and do this in parallel. So this will give us additional performance and we can use this so the CPU also knows this that we, if we access like this. So if we're doing a large sequential access, the CPU will load all of the rows in parallel and we get all of them at the same time. And so then the CPU basically has a lot of data to work on rather than a single byte, single bit or something.

Starting point is 01:22:23 And this will, if we do this efficiently enough, then we can basically completely or almost completely hide the latency that we have to work on. So we can do this in a pipeline fashion while we're working on this data, we're loading the next data in the caches. And so we're just being more or less cache bound rather than being memory bound. If we're not doing much on the data, so if we're just reading through it, probably will

Starting point is 01:22:51 still be memory bound, but at least to some degree we can hide this latency rather than having for every individual access completely to pay this complete round trip. Otherwise, if we're doing this byte by byte, we would have to pay the complete round trip. Otherwise, if we're doing this byte by byte, we would have to pay the complete memory access for every individual access. Okay, so very briefly, SRAM is super fast. So this is transistors, right? They are directly on the chip.

Starting point is 01:23:21 This is basically almost instantaneous. This is while the processor within the clock cycle, essentially. But it's more expensive, expensive in terms of how do we put this on the chip. A chip, of course, producing a larger chip will actually be more expensive, but also chip means, well, what do we put there? If we put more SRAM, we can put less compute units. For example, we have less space for decoders or other operations. So this is a huge trade off. So we can not put that much or we have to trade it off with other things.

Starting point is 01:23:59 And different chips use different trade offs. And we will see this. And in order to somehow not spend too much money on just a single cache, we can organize in a hierarchy and make sure we kind of try to, let's say, reduce the effects of the memory accesses as soon as we run out of cache. We organize the caches in the memory in hierarchy. We have some cache that is super fast where we directly can access. We have next level cache,

Starting point is 01:24:36 which is a bit slower, but still reasonably fast, will be larger than the next level cache, etc. So we can actually do multiple levels there. And in the end, this will always be beneficial as long as some of the workloads will stay either in the upper level caches or just marginally run out of this. So there we basically get a huge effect. And if we have kind of larger granularities on the lower levels, meaning say I'm loading a larger block from memory into my caches,

Starting point is 01:25:15 then I can work on the in this data. I'm just going to pay the cache latencies and I don't have to go to memory right away again. So then this makes sense. If I'm just running through the data once, so I do a complete scan of the data, every item will just be touched once. The caches don't help me at all.

Starting point is 01:25:37 However, they will help me to some degree because the code will also be in the caches, the code that I'm running. And if the code is in the caches, at least that I don't have to go to disk and read parts of the code again somewhere else. Okay, with that, we're actually out of time. So next time we will continue on this on the CPU architecture. Yeah, especially on virtual memory. So with that, any questions? Yes? So the question is basically, do we need more power if I use more ones or more zeros?

Starting point is 01:26:33 Good question. I don't know. But practically, I think it doesn't really make a huge difference, because I still refresh everything in the cycles, basically. OK. It's to that question. It. So there is a difference here. They saw that multiplications that had more 1s in it consumed more power.

Starting point is 01:27:06 OK, interesting. So there is a difference here. Also, at a certain point, you will have overflows, which will be, again, more costly, et cetera. But there might be slight differences, but I'm assuming it doesn't make that much of a difference. RAM will always consume power while you're running something basically because it

Starting point is 01:27:29 just will be refreshed all the time. There was another question? Why does the CPU need a stable kind of memory for a schedule? So why does the CPU need a stable memory? I mean, technically it probably doesn't, but it wants to have like this fast memory. So if you would say, well, you're rewriting the registers all the time anyway, then you could probably also use an unstable. But the refreshing disorganization takes additional time. So the static RAM doesnization takes additional time. So the static RAM doesn't need this time. So it's just much faster.

Starting point is 01:28:16 Other questions? No more questions? Well, then thank you very much. See you all next week.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.