Storage Developer Conference - #127: Object Storage Workload Testing Tools
Episode Date: June 9, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 127. Hello, everybody. My name is John
Harrigan. I work for Red Hat, and I work on a team called Performance and Scale Engineering. to get a bit closer and expand testing activities within Red Hat beyond functionality into actual
usability and into some what I would consider day two operations. So things beyond just
looking at a capability from a raw performance standpoint. But as we move into some of the technologies,
particularly in the software-defined storage area,
I work with Seth.
It's a fairly complicated system.
There's a number of background activities,
and those can affect the quality of service at the client level.
So I'm basically presenting today to talk about some of the tooling I've done,
some of the tooling that is available out on GitHub
for more thoroughly evaluating object storage
with customer representative workloads.
So I thought a little bit about the, as I was putting this presentation together,
I thought a little bit about the takeaways and what kind of things I was trying to impart on attendees.
And I came up with these three bullets here,
trying to increase awareness and knowledge around designing the workload.
When I say workload, as you see as we go through the slides, in the performance
and scale team, we not only look at a workload from a standpoint of a description of a specific
benchmark exerciser, but we also look at it from the standpoint of how can we approximate
some of the activities that happen at an operational
level in a production environment.
Then the next bullet talks about the specific storage workload tools that I've posted out
on GitHub.
And then finally some information about analyzing the results that come back from these runs and from these tools.
So I'm going to start off by talking about the motivation,
how you go about effectively designing workloads, the tools themselves,
and then we'll have a demo and open it up for questions.
The slide deck also does include some backup slides that have some additional detail on
eye charts, particularly when it comes to analyzing the results.
So on the topic of motivation to develop, it was basically driven out of initial customer experience with the Red Hat Ceph storage product
and identification of a need to start to incorporate workload-driven requirements
into the overall development process as well as the documentation aspects, test and release processes.
And we also wanted to be able to effectively and efficiently reapply this test methodology
in a continuous fashion through future releases.
So we didn't look at this as a one-off exercise.
So we knew that automating and investing in some automation and knowledge transfer with
other organizations would be time well spent. The premise behind this was to, as closely as
possible, given our hardware footprints, simulate customer production environments. And by that I mean not do benchmarking, right?
We were not looking to create publishable numbers here.
What we're looking to do is identify deficiencies or issues within the product
around unexpected changes in performance and unexpected increases,
particularly in client latencies that would be found in production environments.
We worked with a number of key customers to develop the workloads
and then run those on scale-out footprints.
And in addition to looking at the results that were coming back from the workload generator,
we also wanted to capture things like system resource utilization statistics,
so have an idea of what kind of memory and CPU were being used by the various Ceph processes and background services.
So a few things here on this testing methodology slide, some of which may be redundant to the previous one, but the
idea is to develop a comprehensive workload
profile. And again, as closely as possible
simulate production environments by that, right off the bat
pre-filling the clusters.
So rather than running tests on brand-new deployments with no capacity taken from pools,
we run tests where we actually have specific activities around filling the cluster,
aging the cluster, performing these kind of situations
and applying events like failures
to see how much the client performance is impacted.
And the results of these tests
have led to a steady progression of improvement in CEF
relative to the impact of these background services and the disruption to the client performance.
So the workload generation that we use for this testing is an open source project that Intel sponsored
called CosBench, and I dropped the URL in there.
We chose that because it's fairly full-featured
in terms of the statistics it gathers relative to the client experience,
and also because it has a very rich definition language around workloads
that allowed us to put together workloads that, again, come closer to simulating production environments.
And I'll talk a little bit about more with that.
So the topic of designing workloads, I tried to kind of break this up into a couple of different modeling areas.
The first one for consideration is the layout, specifically the number of buckets or containers, as well as the number of customers were actually using Ceph with extremely high numbers of objects per bucket
and with unexpectedly small object sizes.
So some of the performance testing that had been done previously left exposure points around some of the use cases that actually some of our
largest customers were deploying with. The other thing around modeling, effectively modeling
workloads, is the types of operations that are performed, and in particular, thinking about a mixture of operations rather than segmenting things off and testing them individually.
We saw a dramatically different reaction from the software-defined storage Ceph
as we started turning knobs here, applying mixtures of operations
and applying histograms of object sizes.
And then the final point there is the throughput per day,
mapping that back to something that's realistic in a production environment.
And by that, I have a separation between what's being accessed
and what's being modified, which then in turn comes up with your operation mix.
So let's talk a little bit about those final two considerations.
First up, the sizing for capacity, which is one approach, and then the sizing for performance.
So from a capacity standpoint, the metric we want to get to
is thinking about the number of objects.
And in this aspect, we're looking for the test conditions that are listed here
in terms of available capacity.
What's the desired percent cluster fill, the object
size, which again may or may not be a constant or a histogram, and then the number of buckets.
This then gets us to a point where if we factor in the replication type, we can predetermine
how full the cluster will be. For most of our testing, we run the clusters at least 30% full
and in most cases 50% full.
And then the sizing for performance.
This part of defining a workload,
really it needs to be done for you to identify
a reasonable saturation point on the object storage itself.
And the way we usually approach this is we have a target latency
in terms of quality of service.
And a lot of this is set by having discussions with customers
around their production environments.
And then what we're looking for is extreme variations off of that latency.
Working with CosBench,
we size the cluster depending on the footprint and then the target latency,
and we usually run a handful of these
what I'll call sizing runs
to determine that we're using the correct number of workers
or think of them as number of drivers
from a workload generation standpoint.
So I mentioned on an earlier slide
about production simulation versus micro benchmarks.
And a fair amount of work has been done with Ceph,
you know, in addition to other storage systems with what I would label as micro-benchmarks.
They tend to have a single operation type
typically
and most likely a single object size.
And what we found with Ceph
was that a fair amount of unexpected performance happened as you expanded beyond that kind of footprint.
There's most definitely rules of thumb relative to object sizes and what kind of use cases those fit, which are listed on this slide. And then there are definitely operation types that are
typically evaluated individually
as opposed to in a mixed fashion. And again, those are
listed here as well. And then the final point about
a microbenchmark is, as a characteristic,
it likely has a rather short duration time, a short run time.
And it's run in kind of what I would consider a lab environment
in terms of a freshly deployed cluster
without any kind of expected interruptions from background services
or, for that matter, any kind of infrastructure failures.
This is what is frequently available for storage,
and this is what frequently sets expectations and can lead to rather upset customers
because as you bring in the reality of the production environments,
you can see dramatically different results and some unexpected consequences.
And that leads to identifying issues that need to be resolved
in the implementation of the architecture,
and that eventually leads to a more robust
and more definable quality of service for your clients.
So I've mentioned the production simulation.
We directly took feedback from a couple of our key customers.
And some of the more challenging ones I mentioned before were using Ceph
with what we would consider to be
extremely small object sizes.
Here's one such example.
Based on a 48 hour
collection period
we saw 50%
of the object sizes
were at 1K.
And we came up with a histogram here
that we've stuck to quite a bit
as we've done our release-to-release testing here
looking for regressions within Ceph.
And you can see how that's laid out here
in terms of small object sizes.
Then the 15% distribution
amongst the 64K, 8 meg, and 64 meg,
and then finally a small 5% footprint around the large object sizes.
And that workload, the definition of the object sizes on the top,
the definition of the types and mixture of operations is on the bottom.
So again, this was observed in a customer production environment
and used as a baseline for a lot of our testing
relative to identifying quality of service.
Yes?
Do you use the same operation or type and mix
for each one of the object sizes? Yes?
So in CosBench, you're allowed to specify,
literally give it a histogram of object sizes.
And so that histogram is randomly applied to each of the operation types. Now, some of the operation types don't require,
in your workload specification, an object size, right?
If you're doing a list on a specific container with a specific bucket,
with a specific object, you're not handing that operation an object size.
But certainly with the rights, you are.
From understanding your customer base,
is the 60% read in a real-world environment
that's applicable to the one-hit object as it is to the one-hit object?
Yes.
Yep.
So when we do a pre-fill of the cluster,
we use that object size histogram to lay it out.
Okay.
The question is, when you collected the information
from the production monitoring 48 hours,
and you observed the 60% B,
was the 60% B equally distributed across all of those object sizes?
And the answer is yes.
Yes.
Yes.
Sure.
Yeah.
Yep.
The tooling I'll show you gives you that flexibility
to be able to put together workload specifications with that degree of specifics.
Yep.
Now, one topic I want to cover briefly here is around workload generators.
As I mentioned, we're using Cosbench. It's a Java-based application.
It's separated into a client-server kind of model
or what's considered a controller along with drivers.
It supports multiple drivers.
The drivers are fairly heavyweight Java processes
that require fairly significant resources
from your workload generation size.
So you need to be aware of the driver processing overhead,
and fortunately, CosBench provides something called the mock storage driver,
which basically is equivalent to kind of like a dev null storage driver,
so you can specify the actual delay period.
By default, it's 10 milliseconds.
I did some testing where I set that to zero.
And my objective there was to try to understand how fast could I drive Cosbench
or how fast and hard could Cosbench drive infinitely quick storage.
And the observations I came up with are listed here. could CosBench drive infinitely quick storage, right?
And the observations I came up with are listed here.
So your biggest bottleneck from a CosBench throughput standpoint are write operations.
And I eventually was able to look at this enough
where I came up with an understanding of what I would consider.
I use the word optimal here for ratio.
Maybe it's better as a suggested ratio.
But you really want to be careful about throwing too many drivers per hardware node
and then getting very misleading results back from CauseBench.
So a rule of thumb here is to ensure that the workload generation side
is not a bottleneck.
And for Cosbench, that ratio works out to be four drivers per client node,
and each of those drivers themselves can in parallel support three worker threads.
If you try to push too much further than that, can in parallel support three worker threads.
If you try to push too much further than that,
you're going to be introducing some artificial delays from the workload generation size.
This is with DevNull?
Do you have the target device for...
It avoids the communication to the storage device itself.
So it's not exactly dev null.
It's skipping quite a bit of code inside of the driver.
Any of the code that's inside of Cosbench specific to S3 or Swift or whatever is supported.
So this is the driver processing overhead independent of having to touch storage.
And when you say more clients than three workers,
I'm thinking more processes and then three threads?
You literally, I'll show you how it's specified,
but literally what you do is you tell Cosbench,
you configure Cosbench for how many drivers you want it to be configured with
and on what systems those drivers should be running.
So let's just go through some of the semantics of setting that up.
This is another way of stating the system resource utilization on a per-driver aspect.
And you can run CosBench in any number of fashions relative to either starting with co-location of the controller and driver,
or you can have one controller and have it communicating out to multiple drivers.
And this provides basically the syntax of a key configuration file that's in CosBench called controller.conf.
This example here shows if you were to have two drivers and they were both located on the local host
you list them out inside of this controller file
and you'll notice that the port offset
is by 100.
So when you use CosBench
to basically say start driver.sh
with two
it starts two separate driver processes
and offsets the port count by 100.
So you could do 3, 4, whatever in there.
You'd need to manually edit the controller.conf
so that the controller is aware of the IP addresses
and the port numbers where each of those drivers reside.
And then a quick visual of kind of a testbed configuration and components.
In our configurations, we tend to use dedicated networks for the back end of the storage,
which is at the bottom there, the Ceph cluster network.
And then we have kind of the Ceph public network.
And the CosBench workload generators are at the client level,
driving workload into the RADOS gateways, which are denoted with RGWs.
So each of these nodes in the middle here are SEP nodes.
We usually run with at least 10 gig Ethernet, more commonly 25 or 40.
Both?
Both.
Yeah.
How many racks and what's the rack for?
So we have a scale lab that has, at this point, I think there's probably about 20 racks of equipment in there.
Most of the testing we do at the scale is within one rack, if possible,
because the scale lab is a shared resource,
so we don't run tests at the full footprint.
Most of the tests that I would be doing
would be probably involving between 12 and 16 storage nodes,
each with 24 to 36 drives,
be they NVMe or hard drives or a combination of both.
So this is the inventory of the actual tools themselves and the URL out to the GitHub.
The first tool is it merely generates the workload files.
So it's kind of a standalone tool that helps you generate workload files for CosBench. Generating workload files for CosBench can be a bit error prone or fat finger prone.
It's XML syntax. So this is just a way of being able to specify the various characteristics of the workload and have it broken down into different work stages.
And then there's RGW test, which is kind of the real workhorse.
That runs the workloads and also does monitoring
and captures logs with all of the system resource utilization.
And it also has polling capabilities
that can monitor things like garbage collection.
And then the final two I just put in there for people's knowledge.
This software can run on Kubernetes, right?
So there's a GitHub repo for that. And then there's a separate repo that focuses on injecting failures into Ceph OSD nodes
and allows us to isolate recovery periods for whether the scenario is an individual disk
or device going away as an OSD,
or in some situations we take an entire OSD system offline while we're applying a workload.
So GenXMLs, again, it's a very kind of simple and straightforward shell script.
There's a series of template files for the XML workloads,
and based on the variables and the way the variables are set inside the script,
it creates the workload files for you.
It also includes a handy utility, something called CB Parser,
which is a Python script that allows you to look at and produce
CosBench reports outside of the CosBench controller GUI.
So that's somewhat handy.
So pretty straightforward usage.
You edit the script.
The key variables are listed here, right?
The access key and the secret key and the endpoint.
You give things a test name so you have some way of identifying them afterwards,
and then they're time-stamped.
And then the key elements for the workload itself, the runtime, the object sizes,
the number of containers, and the number of objects.
And this produces the following workload files.
But, I mean, it's just very straightforward bash code
that's easily modifiable if you wanted to create alternative workload files.
But these create a sequence of workloads that we frequently use
where we're doing the cluster pre-fill.
Then we can run any of the sequential or random ops,
which run in isolated mode as two separate workstages.
But the workload that we use most frequently
is called mixed ops,
which performs this pre-defined mixture of operations that I mentioned before, all simultaneously.
And then finally, there's an empty cluster that you typically run as the last workload.
Now, I mentioned RGW test.
This is the harness that will run the workloads.
And it actually also has capabilities for generating workload files as well.
But you can take workload files from GenXMLs and run those here. The key thing that RGW test buys you is
some additional logging over and above
the results that CosBench provides.
The monitoring that happens at the
system resource level
is all done inside of a polling
script and the frequency
of that polling is user configurable.
There's also heavily
commented code around
adding additional
polling capabilities in there.
So again,
yes, question?
Does the test script always create
a new poll?
When you run the reset script, it does.
But you don't have to necessarily do that.
And, in fact, we try to resist doing that because we're trying to simulate production environments,
so we want certain aging aspects to come through.
A usage step here in terms of write XML, which produces the XML files for RGW tests.
Reset RGW, which is what recreates the pools.
And then a series of these run IO workloads where you go through the scenario of pre-filling, aging, and measuring the cluster performance.
So these are the three workload files that RGW test produces,
but again, you can pull in workloads that were created with the GenXML tooling as well.
I have two slides here that talk about the key variables relative to RGW test configuration.
Since it reaches out to the Ceph systems to identify system resource utilization.
It needs to know their system names, right?
So there's a Mon host name, an RGW host name.
It needs to know where CosBench has been installed because it issues the command to start the workload to CosBench itself.
And then there's authentication information in there,
endpoint information,
whether you're using three-way replication or some form of erasure coding.
And then there's, for Ceph purposes, there's PGNUM information that's applied to the pools
as they're created. And the second set of key variables here having to do with the workload definition,
things like object sizes, number of containers, number of objects, et cetera.
And then I spoke before about sizing up the cluster for identifying a performance threshold,
and that's where these worker counts come in.
There's support for individually specifying the number of workers
that you want to be used for the cluster fill
versus the actual measurement workload.
I just wanted a quick slide in here to let people know that this is extendable,
so while it supports certain workloads based on pre-existing templates,
you can define new workload specifications by either modifying existing templates or creating new ones.
So let's take a quick look at a log file.
I mentioned that RGW test squirrels away system resource information by polling periodically in a background process.
And by default, it collects those statistics on a one-minute interval.
And the types of things it grabs are listed here.
And all of the entries are time-stamped.
So as an example excerpt from a log file run,
this would be something where we'd be looking at the start
and the end, and specifically the information that was parsed out from the results of running
CephDF.
Now, this testing harness was used to do a performance evaluation a while back
on the transition from file store to BlueStore support within Ceph.
And this table here shows the performance advantage that was measured
using this workload approach for BlueStore.
So in addition to the COSBench results,
which are seen in the table up above,
the log file statistics provided a deeper level of insight
because they recorded things like load average
as well as CPU usage for the various Ceph processes.
Here's a view of what we call
the hybrid workload. So the previous one
was for cluster fill. This is for
the mixture of operations.
And again, you can see
the percent change
with BlueStore.
And again, the log
file statistics.
So let's see how I do on time here.
I'll put the spectacles on. All right, so what I've got running on my laptop is a containerized version of Ceph
with a fairly small footprint.
And what I thought I would do real quickly, just generate some workload files
and then run them through the scripting. So I've got a collection of, so this is the repo from GenXMLs,
and you can see the CB parser script there as well as genxml.sh.
So that then created these workload files that we see here preceded with test
one. So if we just wanted to take a look
at the fill workload. Again
all the script does is reach into the XML templates which
has a template file for each of these types of workloads,
and it fills it in with information that came out of the shell script.
So it's got the access key and the secret key
and all the authentication information in there.
And in this case, we're running on the laptop,
so we're keeping things pretty small.
We've got two workers specified,
and then using a total of five buckets,
each with 500 objects,
and at a constant object size for demo purposes.
So let me bring up...
This is the CosBench aspect of the demo that's running.
And for those who haven't seen it, there's a CLI way to submit the workloads,
but there's also a way to do that through this GUI, this controller GUI.
So if we press Submit here, this should take about a minute to run, this particular workload. Press submit here.
This should take about a minute to run, this particular workload.
This controller GUI in CosBench does show you what's called a performance graph that just refreshed here.
And it shows that in terms of throughput as well as response time and bandwidth. All
of the information that CosBench is collecting here, it puts in an archive directory. And
so we can use that CB parser Python script to basically get a textual version of this
after the run is over.
Let's see.
I think I put it up here.
Yeah.
And this would be in the archive directory and should be
alright so that job completed
and go back to the index
we'll see that listed in our inventory now as w1
and we can
we can see that, again, just this Python script just goes through that
archive information and creates something that Cosbench has called the general report.
It shows information about average response time and processing time throughput.
And then we frequently look at not only at average response time,
but we also pay very close attention to things like 99th percentile response time
to make sure that there aren't dramatic spikes in some of the higher latency counts.
John?
Yeah.
The queen that you showed, is it part of the CosBench?
Correct.
Okay.
Yep, yep.
And if we go over to, right, so let me go.
So I have this little shell script that just basically does a S3 command output
and then does a CephDF.
So we can see that the cluster fill did hit, you know,
use up some amount of the storage on the containerized Ceph.
If we take a look at RGW test. so what we see here is that
there's several what are called results
directories, these are time stamped
the utils directory contains
the probe or polling. So there's a script in there called
poll.sh, which is what grabs the various C. So we can go and run.
Let's run an empty.
So I'm going to reuse the workloads that were created in that GenXML's directory here.
And so this is the syntax that you can hand to run IO
workload. And as this runs, it'll do some monitoring in the background. So everything you see going
across the screen right now is going into a results file that's time stamped with the job ID
associated. And we can see here some of the information that's being captured by the polling.
It's got the percent raw used.
It's got information about memory footprint as well as processor activity going on for
the various set processes.
And when this wraps up, it'll tell us the name of the log file that captured all the results here.
So we frequently do runs that may run for days in order to simulate aging,
and a lot of what we're looking for is, okay, let's take a fresh deployment.
Let's do what we consider a
measurement run to get a feel for what the performance is. Then we'll do some amount of
aging, and that may include an infrastructure failure. In addition to just running workloads
for an extended period, there may be other operations that are injected into the cluster.
And then we'll rerun a measurement run, which typically we run for about an hour or two,
looking for any significant deltas off of what we started with as an initial installation
to help identify expectations around degradation as the cluster ages.
And this analysis and work has, I think, as I mentioned earlier,
has led to a number of improvements, particularly in RGW garbage collection
and some of the efficiencies for some of the CEF background services
that have to do with operational events.
So at this point, that completed.
You can see the name of the log file there listed.
And you can also see, if we go back up and refresh this Cosbench GUI,
we can see that there has now been a fill load run as well as an empty load.
And if we go in, the statistics are provided there within that GUI, but they could also
be seen using that CB parser utility as well.
So that's about it
questions?
the output files we were looking at
seem
human-renewable
do you have something else that goes behind
and pulls them into anastasia frames
or something interesting to make them
more easily analyzed?
yeah so we have
we typically do things with Prometheus and Grafana more easily analyzed? Yeah, so we have...
We typically do things with Prometheus and Grafana.
So...
But that whole infrastructure
isn't integrated into what's on GitHub, right?
But it could certainly be adapted for that.
One other question.
When you went through the customer view of their data set and what they have,
I don't know if Cosbench can do it or not,
but there was no mention of the compressibility or dedupability of the data set.
Yeah, actually it's something I've started working with recently.
By default, CosBench supports two streams
of what I would consider pre-fill for the objects.
And by default, it uses,
this is set with a flag called content equals.
By default, content equals what CosBench considers to be randomized.
But if you really look carefully at the code,
what you find out is that it's a repeatable randomized pattern.
And it's only generated from a small subset.
It's generated from the lowercase characters of A through Z.
So that's been extended by the storage provider Nexanta.
They have a GitHub repo with a version of CosBench
that does truly randomized input with non-repeatable chunks.
And so we've been starting to utilize that now
as we're taking closer looks at dedupe and compression.
Yes?
That's a good segue to the concern I have.
So the Cosmic that is, you can tell, it seems like it hasn't been updated.
Yes, yeah.
I even tried to contact the distributor.
Yeah.
Is there an industry-wide kind of consensus on maybe having a commitment that people start following?
Yeah, I wish there were.
A number of people have taken it in different directions, like I mentioned.
Unfortunately, there's no single place where all of this is being rolled up.
So it is stale. I've worked through the specifics
of how to rebuild it, and that takes a fair amount of heavy lifting. And it's something
that I plan to put out on GitHub, basically the steps that I was successful with. There
was one fellow who basically submitted a series of pull requests specific to the latest release, which is 0.4.2, which out of the box will not work, right?
And it will not build.
So he submitted a series of pull requests to get that to the point where you could build it.
And I verified certainly that is correct.
But you have to go to several different places to get the information in order to rebuild this.
So I put together a brief document, not so brief document,
that talks about using those pull requests as well as how to rebuild it, you know,
with success using the 0.4.2 code base.
So that's something I'll be putting up on GitHub as well.
And is there a performance as a sensitive topic when it's published
and I'm like, okay, whatever, and if you can't even agree on the tool
that generated those maps or whatever, versus, you know, which generated those maps. Right.
Yep.
Any other questions?
All right, thank you.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.