Storage Developer Conference - #127: Object Storage Workload Testing Tools

Episode Date: June 9, 2020

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 127. Hello, everybody. My name is John Harrigan. I work for Red Hat, and I work on a team called Performance and Scale Engineering. to get a bit closer and expand testing activities within Red Hat beyond functionality into actual usability and into some what I would consider day two operations. So things beyond just looking at a capability from a raw performance standpoint. But as we move into some of the technologies,
Starting point is 00:01:26 particularly in the software-defined storage area, I work with Seth. It's a fairly complicated system. There's a number of background activities, and those can affect the quality of service at the client level. So I'm basically presenting today to talk about some of the tooling I've done, some of the tooling that is available out on GitHub for more thoroughly evaluating object storage
Starting point is 00:01:55 with customer representative workloads. So I thought a little bit about the, as I was putting this presentation together, I thought a little bit about the takeaways and what kind of things I was trying to impart on attendees. And I came up with these three bullets here, trying to increase awareness and knowledge around designing the workload. When I say workload, as you see as we go through the slides, in the performance and scale team, we not only look at a workload from a standpoint of a description of a specific benchmark exerciser, but we also look at it from the standpoint of how can we approximate
Starting point is 00:02:43 some of the activities that happen at an operational level in a production environment. Then the next bullet talks about the specific storage workload tools that I've posted out on GitHub. And then finally some information about analyzing the results that come back from these runs and from these tools. So I'm going to start off by talking about the motivation, how you go about effectively designing workloads, the tools themselves, and then we'll have a demo and open it up for questions.
Starting point is 00:03:27 The slide deck also does include some backup slides that have some additional detail on eye charts, particularly when it comes to analyzing the results. So on the topic of motivation to develop, it was basically driven out of initial customer experience with the Red Hat Ceph storage product and identification of a need to start to incorporate workload-driven requirements into the overall development process as well as the documentation aspects, test and release processes. And we also wanted to be able to effectively and efficiently reapply this test methodology in a continuous fashion through future releases. So we didn't look at this as a one-off exercise.
Starting point is 00:04:21 So we knew that automating and investing in some automation and knowledge transfer with other organizations would be time well spent. The premise behind this was to, as closely as possible, given our hardware footprints, simulate customer production environments. And by that I mean not do benchmarking, right? We were not looking to create publishable numbers here. What we're looking to do is identify deficiencies or issues within the product around unexpected changes in performance and unexpected increases, particularly in client latencies that would be found in production environments. We worked with a number of key customers to develop the workloads
Starting point is 00:05:17 and then run those on scale-out footprints. And in addition to looking at the results that were coming back from the workload generator, we also wanted to capture things like system resource utilization statistics, so have an idea of what kind of memory and CPU were being used by the various Ceph processes and background services. So a few things here on this testing methodology slide, some of which may be redundant to the previous one, but the idea is to develop a comprehensive workload profile. And again, as closely as possible simulate production environments by that, right off the bat
Starting point is 00:06:03 pre-filling the clusters. So rather than running tests on brand-new deployments with no capacity taken from pools, we run tests where we actually have specific activities around filling the cluster, aging the cluster, performing these kind of situations and applying events like failures to see how much the client performance is impacted. And the results of these tests have led to a steady progression of improvement in CEF
Starting point is 00:06:44 relative to the impact of these background services and the disruption to the client performance. So the workload generation that we use for this testing is an open source project that Intel sponsored called CosBench, and I dropped the URL in there. We chose that because it's fairly full-featured in terms of the statistics it gathers relative to the client experience, and also because it has a very rich definition language around workloads that allowed us to put together workloads that, again, come closer to simulating production environments. And I'll talk a little bit about more with that.
Starting point is 00:07:40 So the topic of designing workloads, I tried to kind of break this up into a couple of different modeling areas. The first one for consideration is the layout, specifically the number of buckets or containers, as well as the number of customers were actually using Ceph with extremely high numbers of objects per bucket and with unexpectedly small object sizes. So some of the performance testing that had been done previously left exposure points around some of the use cases that actually some of our largest customers were deploying with. The other thing around modeling, effectively modeling workloads, is the types of operations that are performed, and in particular, thinking about a mixture of operations rather than segmenting things off and testing them individually. We saw a dramatically different reaction from the software-defined storage Ceph as we started turning knobs here, applying mixtures of operations
Starting point is 00:09:03 and applying histograms of object sizes. And then the final point there is the throughput per day, mapping that back to something that's realistic in a production environment. And by that, I have a separation between what's being accessed and what's being modified, which then in turn comes up with your operation mix. So let's talk a little bit about those final two considerations. First up, the sizing for capacity, which is one approach, and then the sizing for performance. So from a capacity standpoint, the metric we want to get to
Starting point is 00:09:49 is thinking about the number of objects. And in this aspect, we're looking for the test conditions that are listed here in terms of available capacity. What's the desired percent cluster fill, the object size, which again may or may not be a constant or a histogram, and then the number of buckets. This then gets us to a point where if we factor in the replication type, we can predetermine how full the cluster will be. For most of our testing, we run the clusters at least 30% full and in most cases 50% full.
Starting point is 00:10:31 And then the sizing for performance. This part of defining a workload, really it needs to be done for you to identify a reasonable saturation point on the object storage itself. And the way we usually approach this is we have a target latency in terms of quality of service. And a lot of this is set by having discussions with customers around their production environments.
Starting point is 00:10:57 And then what we're looking for is extreme variations off of that latency. Working with CosBench, we size the cluster depending on the footprint and then the target latency, and we usually run a handful of these what I'll call sizing runs to determine that we're using the correct number of workers or think of them as number of drivers from a workload generation standpoint.
Starting point is 00:11:31 So I mentioned on an earlier slide about production simulation versus micro benchmarks. And a fair amount of work has been done with Ceph, you know, in addition to other storage systems with what I would label as micro-benchmarks. They tend to have a single operation type typically and most likely a single object size. And what we found with Ceph
Starting point is 00:12:02 was that a fair amount of unexpected performance happened as you expanded beyond that kind of footprint. There's most definitely rules of thumb relative to object sizes and what kind of use cases those fit, which are listed on this slide. And then there are definitely operation types that are typically evaluated individually as opposed to in a mixed fashion. And again, those are listed here as well. And then the final point about a microbenchmark is, as a characteristic, it likely has a rather short duration time, a short run time. And it's run in kind of what I would consider a lab environment
Starting point is 00:12:52 in terms of a freshly deployed cluster without any kind of expected interruptions from background services or, for that matter, any kind of infrastructure failures. This is what is frequently available for storage, and this is what frequently sets expectations and can lead to rather upset customers because as you bring in the reality of the production environments, you can see dramatically different results and some unexpected consequences. And that leads to identifying issues that need to be resolved
Starting point is 00:13:37 in the implementation of the architecture, and that eventually leads to a more robust and more definable quality of service for your clients. So I've mentioned the production simulation. We directly took feedback from a couple of our key customers. And some of the more challenging ones I mentioned before were using Ceph with what we would consider to be extremely small object sizes.
Starting point is 00:14:08 Here's one such example. Based on a 48 hour collection period we saw 50% of the object sizes were at 1K. And we came up with a histogram here that we've stuck to quite a bit
Starting point is 00:14:29 as we've done our release-to-release testing here looking for regressions within Ceph. And you can see how that's laid out here in terms of small object sizes. Then the 15% distribution amongst the 64K, 8 meg, and 64 meg, and then finally a small 5% footprint around the large object sizes. And that workload, the definition of the object sizes on the top,
Starting point is 00:14:59 the definition of the types and mixture of operations is on the bottom. So again, this was observed in a customer production environment and used as a baseline for a lot of our testing relative to identifying quality of service. Yes? Do you use the same operation or type and mix for each one of the object sizes? Yes? So in CosBench, you're allowed to specify,
Starting point is 00:15:35 literally give it a histogram of object sizes. And so that histogram is randomly applied to each of the operation types. Now, some of the operation types don't require, in your workload specification, an object size, right? If you're doing a list on a specific container with a specific bucket, with a specific object, you're not handing that operation an object size. But certainly with the rights, you are. From understanding your customer base, is the 60% read in a real-world environment
Starting point is 00:16:16 that's applicable to the one-hit object as it is to the one-hit object? Yes. Yep. So when we do a pre-fill of the cluster, we use that object size histogram to lay it out. Okay. The question is, when you collected the information from the production monitoring 48 hours,
Starting point is 00:16:39 and you observed the 60% B, was the 60% B equally distributed across all of those object sizes? And the answer is yes. Yes. Yes. Sure. Yeah. Yep.
Starting point is 00:17:01 The tooling I'll show you gives you that flexibility to be able to put together workload specifications with that degree of specifics. Yep. Now, one topic I want to cover briefly here is around workload generators. As I mentioned, we're using Cosbench. It's a Java-based application. It's separated into a client-server kind of model or what's considered a controller along with drivers. It supports multiple drivers.
Starting point is 00:17:36 The drivers are fairly heavyweight Java processes that require fairly significant resources from your workload generation size. So you need to be aware of the driver processing overhead, and fortunately, CosBench provides something called the mock storage driver, which basically is equivalent to kind of like a dev null storage driver, so you can specify the actual delay period. By default, it's 10 milliseconds.
Starting point is 00:18:09 I did some testing where I set that to zero. And my objective there was to try to understand how fast could I drive Cosbench or how fast and hard could Cosbench drive infinitely quick storage. And the observations I came up with are listed here. could CosBench drive infinitely quick storage, right? And the observations I came up with are listed here. So your biggest bottleneck from a CosBench throughput standpoint are write operations. And I eventually was able to look at this enough where I came up with an understanding of what I would consider.
Starting point is 00:18:44 I use the word optimal here for ratio. Maybe it's better as a suggested ratio. But you really want to be careful about throwing too many drivers per hardware node and then getting very misleading results back from CauseBench. So a rule of thumb here is to ensure that the workload generation side is not a bottleneck. And for Cosbench, that ratio works out to be four drivers per client node, and each of those drivers themselves can in parallel support three worker threads.
Starting point is 00:19:24 If you try to push too much further than that, can in parallel support three worker threads. If you try to push too much further than that, you're going to be introducing some artificial delays from the workload generation size. This is with DevNull? Do you have the target device for... It avoids the communication to the storage device itself. So it's not exactly dev null. It's skipping quite a bit of code inside of the driver.
Starting point is 00:19:52 Any of the code that's inside of Cosbench specific to S3 or Swift or whatever is supported. So this is the driver processing overhead independent of having to touch storage. And when you say more clients than three workers, I'm thinking more processes and then three threads? You literally, I'll show you how it's specified, but literally what you do is you tell Cosbench, you configure Cosbench for how many drivers you want it to be configured with and on what systems those drivers should be running.
Starting point is 00:20:35 So let's just go through some of the semantics of setting that up. This is another way of stating the system resource utilization on a per-driver aspect. And you can run CosBench in any number of fashions relative to either starting with co-location of the controller and driver, or you can have one controller and have it communicating out to multiple drivers. And this provides basically the syntax of a key configuration file that's in CosBench called controller.conf. This example here shows if you were to have two drivers and they were both located on the local host you list them out inside of this controller file and you'll notice that the port offset
Starting point is 00:21:31 is by 100. So when you use CosBench to basically say start driver.sh with two it starts two separate driver processes and offsets the port count by 100. So you could do 3, 4, whatever in there. You'd need to manually edit the controller.conf
Starting point is 00:21:58 so that the controller is aware of the IP addresses and the port numbers where each of those drivers reside. And then a quick visual of kind of a testbed configuration and components. In our configurations, we tend to use dedicated networks for the back end of the storage, which is at the bottom there, the Ceph cluster network. And then we have kind of the Ceph public network. And the CosBench workload generators are at the client level, driving workload into the RADOS gateways, which are denoted with RGWs.
Starting point is 00:22:48 So each of these nodes in the middle here are SEP nodes. We usually run with at least 10 gig Ethernet, more commonly 25 or 40. Both? Both. Yeah. How many racks and what's the rack for? So we have a scale lab that has, at this point, I think there's probably about 20 racks of equipment in there. Most of the testing we do at the scale is within one rack, if possible,
Starting point is 00:23:40 because the scale lab is a shared resource, so we don't run tests at the full footprint. Most of the tests that I would be doing would be probably involving between 12 and 16 storage nodes, each with 24 to 36 drives, be they NVMe or hard drives or a combination of both. So this is the inventory of the actual tools themselves and the URL out to the GitHub. The first tool is it merely generates the workload files.
Starting point is 00:24:21 So it's kind of a standalone tool that helps you generate workload files for CosBench. Generating workload files for CosBench can be a bit error prone or fat finger prone. It's XML syntax. So this is just a way of being able to specify the various characteristics of the workload and have it broken down into different work stages. And then there's RGW test, which is kind of the real workhorse. That runs the workloads and also does monitoring and captures logs with all of the system resource utilization. And it also has polling capabilities that can monitor things like garbage collection. And then the final two I just put in there for people's knowledge.
Starting point is 00:25:18 This software can run on Kubernetes, right? So there's a GitHub repo for that. And then there's a separate repo that focuses on injecting failures into Ceph OSD nodes and allows us to isolate recovery periods for whether the scenario is an individual disk or device going away as an OSD, or in some situations we take an entire OSD system offline while we're applying a workload. So GenXMLs, again, it's a very kind of simple and straightforward shell script. There's a series of template files for the XML workloads, and based on the variables and the way the variables are set inside the script,
Starting point is 00:26:09 it creates the workload files for you. It also includes a handy utility, something called CB Parser, which is a Python script that allows you to look at and produce CosBench reports outside of the CosBench controller GUI. So that's somewhat handy. So pretty straightforward usage. You edit the script. The key variables are listed here, right?
Starting point is 00:26:43 The access key and the secret key and the endpoint. You give things a test name so you have some way of identifying them afterwards, and then they're time-stamped. And then the key elements for the workload itself, the runtime, the object sizes, the number of containers, and the number of objects. And this produces the following workload files. But, I mean, it's just very straightforward bash code that's easily modifiable if you wanted to create alternative workload files.
Starting point is 00:27:18 But these create a sequence of workloads that we frequently use where we're doing the cluster pre-fill. Then we can run any of the sequential or random ops, which run in isolated mode as two separate workstages. But the workload that we use most frequently is called mixed ops, which performs this pre-defined mixture of operations that I mentioned before, all simultaneously. And then finally, there's an empty cluster that you typically run as the last workload.
Starting point is 00:28:01 Now, I mentioned RGW test. This is the harness that will run the workloads. And it actually also has capabilities for generating workload files as well. But you can take workload files from GenXMLs and run those here. The key thing that RGW test buys you is some additional logging over and above the results that CosBench provides. The monitoring that happens at the system resource level
Starting point is 00:28:37 is all done inside of a polling script and the frequency of that polling is user configurable. There's also heavily commented code around adding additional polling capabilities in there. So again,
Starting point is 00:28:58 yes, question? Does the test script always create a new poll? When you run the reset script, it does. But you don't have to necessarily do that. And, in fact, we try to resist doing that because we're trying to simulate production environments, so we want certain aging aspects to come through. A usage step here in terms of write XML, which produces the XML files for RGW tests.
Starting point is 00:29:29 Reset RGW, which is what recreates the pools. And then a series of these run IO workloads where you go through the scenario of pre-filling, aging, and measuring the cluster performance. So these are the three workload files that RGW test produces, but again, you can pull in workloads that were created with the GenXML tooling as well. I have two slides here that talk about the key variables relative to RGW test configuration. Since it reaches out to the Ceph systems to identify system resource utilization. It needs to know their system names, right? So there's a Mon host name, an RGW host name.
Starting point is 00:30:34 It needs to know where CosBench has been installed because it issues the command to start the workload to CosBench itself. And then there's authentication information in there, endpoint information, whether you're using three-way replication or some form of erasure coding. And then there's, for Ceph purposes, there's PGNUM information that's applied to the pools as they're created. And the second set of key variables here having to do with the workload definition, things like object sizes, number of containers, number of objects, et cetera. And then I spoke before about sizing up the cluster for identifying a performance threshold,
Starting point is 00:31:21 and that's where these worker counts come in. There's support for individually specifying the number of workers that you want to be used for the cluster fill versus the actual measurement workload. I just wanted a quick slide in here to let people know that this is extendable, so while it supports certain workloads based on pre-existing templates, you can define new workload specifications by either modifying existing templates or creating new ones. So let's take a quick look at a log file.
Starting point is 00:32:18 I mentioned that RGW test squirrels away system resource information by polling periodically in a background process. And by default, it collects those statistics on a one-minute interval. And the types of things it grabs are listed here. And all of the entries are time-stamped. So as an example excerpt from a log file run, this would be something where we'd be looking at the start and the end, and specifically the information that was parsed out from the results of running CephDF.
Starting point is 00:33:00 Now, this testing harness was used to do a performance evaluation a while back on the transition from file store to BlueStore support within Ceph. And this table here shows the performance advantage that was measured using this workload approach for BlueStore. So in addition to the COSBench results, which are seen in the table up above, the log file statistics provided a deeper level of insight because they recorded things like load average
Starting point is 00:33:35 as well as CPU usage for the various Ceph processes. Here's a view of what we call the hybrid workload. So the previous one was for cluster fill. This is for the mixture of operations. And again, you can see the percent change with BlueStore.
Starting point is 00:34:02 And again, the log file statistics. So let's see how I do on time here. I'll put the spectacles on. All right, so what I've got running on my laptop is a containerized version of Ceph with a fairly small footprint. And what I thought I would do real quickly, just generate some workload files and then run them through the scripting. So I've got a collection of, so this is the repo from GenXMLs, and you can see the CB parser script there as well as genxml.sh.
Starting point is 00:35:08 So that then created these workload files that we see here preceded with test one. So if we just wanted to take a look at the fill workload. Again all the script does is reach into the XML templates which has a template file for each of these types of workloads, and it fills it in with information that came out of the shell script. So it's got the access key and the secret key and all the authentication information in there.
Starting point is 00:35:41 And in this case, we're running on the laptop, so we're keeping things pretty small. We've got two workers specified, and then using a total of five buckets, each with 500 objects, and at a constant object size for demo purposes. So let me bring up... This is the CosBench aspect of the demo that's running.
Starting point is 00:36:06 And for those who haven't seen it, there's a CLI way to submit the workloads, but there's also a way to do that through this GUI, this controller GUI. So if we press Submit here, this should take about a minute to run, this particular workload. Press submit here. This should take about a minute to run, this particular workload. This controller GUI in CosBench does show you what's called a performance graph that just refreshed here. And it shows that in terms of throughput as well as response time and bandwidth. All of the information that CosBench is collecting here, it puts in an archive directory. And so we can use that CB parser Python script to basically get a textual version of this
Starting point is 00:37:02 after the run is over. Let's see. I think I put it up here. Yeah. And this would be in the archive directory and should be alright so that job completed and go back to the index we'll see that listed in our inventory now as w1
Starting point is 00:37:41 and we can we can see that, again, just this Python script just goes through that archive information and creates something that Cosbench has called the general report. It shows information about average response time and processing time throughput. And then we frequently look at not only at average response time, but we also pay very close attention to things like 99th percentile response time to make sure that there aren't dramatic spikes in some of the higher latency counts. John?
Starting point is 00:38:22 Yeah. The queen that you showed, is it part of the CosBench? Correct. Okay. Yep, yep. And if we go over to, right, so let me go. So I have this little shell script that just basically does a S3 command output and then does a CephDF.
Starting point is 00:38:48 So we can see that the cluster fill did hit, you know, use up some amount of the storage on the containerized Ceph. If we take a look at RGW test. so what we see here is that there's several what are called results directories, these are time stamped the utils directory contains the probe or polling. So there's a script in there called poll.sh, which is what grabs the various C. So we can go and run.
Starting point is 00:39:48 Let's run an empty. So I'm going to reuse the workloads that were created in that GenXML's directory here. And so this is the syntax that you can hand to run IO workload. And as this runs, it'll do some monitoring in the background. So everything you see going across the screen right now is going into a results file that's time stamped with the job ID associated. And we can see here some of the information that's being captured by the polling. It's got the percent raw used. It's got information about memory footprint as well as processor activity going on for
Starting point is 00:40:37 the various set processes. And when this wraps up, it'll tell us the name of the log file that captured all the results here. So we frequently do runs that may run for days in order to simulate aging, and a lot of what we're looking for is, okay, let's take a fresh deployment. Let's do what we consider a measurement run to get a feel for what the performance is. Then we'll do some amount of aging, and that may include an infrastructure failure. In addition to just running workloads for an extended period, there may be other operations that are injected into the cluster.
Starting point is 00:41:25 And then we'll rerun a measurement run, which typically we run for about an hour or two, looking for any significant deltas off of what we started with as an initial installation to help identify expectations around degradation as the cluster ages. And this analysis and work has, I think, as I mentioned earlier, has led to a number of improvements, particularly in RGW garbage collection and some of the efficiencies for some of the CEF background services that have to do with operational events. So at this point, that completed.
Starting point is 00:42:11 You can see the name of the log file there listed. And you can also see, if we go back up and refresh this Cosbench GUI, we can see that there has now been a fill load run as well as an empty load. And if we go in, the statistics are provided there within that GUI, but they could also be seen using that CB parser utility as well. So that's about it questions? the output files we were looking at
Starting point is 00:42:50 seem human-renewable do you have something else that goes behind and pulls them into anastasia frames or something interesting to make them more easily analyzed? yeah so we have we typically do things with Prometheus and Grafana more easily analyzed? Yeah, so we have...
Starting point is 00:43:05 We typically do things with Prometheus and Grafana. So... But that whole infrastructure isn't integrated into what's on GitHub, right? But it could certainly be adapted for that. One other question. When you went through the customer view of their data set and what they have, I don't know if Cosbench can do it or not,
Starting point is 00:43:36 but there was no mention of the compressibility or dedupability of the data set. Yeah, actually it's something I've started working with recently. By default, CosBench supports two streams of what I would consider pre-fill for the objects. And by default, it uses, this is set with a flag called content equals. By default, content equals what CosBench considers to be randomized. But if you really look carefully at the code,
Starting point is 00:44:11 what you find out is that it's a repeatable randomized pattern. And it's only generated from a small subset. It's generated from the lowercase characters of A through Z. So that's been extended by the storage provider Nexanta. They have a GitHub repo with a version of CosBench that does truly randomized input with non-repeatable chunks. And so we've been starting to utilize that now as we're taking closer looks at dedupe and compression.
Starting point is 00:44:44 Yes? That's a good segue to the concern I have. So the Cosmic that is, you can tell, it seems like it hasn't been updated. Yes, yeah. I even tried to contact the distributor. Yeah. Is there an industry-wide kind of consensus on maybe having a commitment that people start following? Yeah, I wish there were.
Starting point is 00:45:13 A number of people have taken it in different directions, like I mentioned. Unfortunately, there's no single place where all of this is being rolled up. So it is stale. I've worked through the specifics of how to rebuild it, and that takes a fair amount of heavy lifting. And it's something that I plan to put out on GitHub, basically the steps that I was successful with. There was one fellow who basically submitted a series of pull requests specific to the latest release, which is 0.4.2, which out of the box will not work, right? And it will not build. So he submitted a series of pull requests to get that to the point where you could build it.
Starting point is 00:46:02 And I verified certainly that is correct. But you have to go to several different places to get the information in order to rebuild this. So I put together a brief document, not so brief document, that talks about using those pull requests as well as how to rebuild it, you know, with success using the 0.4.2 code base. So that's something I'll be putting up on GitHub as well. And is there a performance as a sensitive topic when it's published and I'm like, okay, whatever, and if you can't even agree on the tool
Starting point is 00:46:40 that generated those maps or whatever, versus, you know, which generated those maps. Right. Yep. Any other questions? All right, thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.