Storage Developer Conference - #129: So, You Want to Build a Storage Performance Testing Lab?
Episode Date: July 13, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 129.
So welcome to So You Want to Build a Storage Performance Test Lab.
I'm Nick Principe. I've been involved in storage performance my whole career.
How long now? I don't know. Since 2005, I guess.
Right now I'm at IX Systems, the makers of FreeNAS, and I'm the storage performance supervisor.
I've had a lot of fun over the past two years
building a performance organization from the ground up.
So there's a few lessons learned that I can share
from my past life and my current role.
My Twitter, GitHub, and email are up there.
Feel free to ping me anytime.
And I apologize for any of my code on GitHub.
I'm not a developer.
I just play one on TV.
So why would you want to build a performance test lab?
There's a lot of reasons.
If you're a company,
generally the motivation for spending that amount of money
is going to come from marketing purposes.
So that can be anything, you know, sizing can be marketing.
Generally, that's on the more helpful side of marketing.
It can be getting the 10 million IOPS number that, you know, we just got from the SPDK guys here.
It can be for development purposes, too.
And one of the most important things for me in the past two years has been testing new
technologies.
Things like NVDIMMs.
We hadn't really played with that.
In fact, one of our developers wrote a driver for FreeBSD for it.
So we had to test the functionality of it, but then we also wanted to see, okay, is it
really better than an SSD?
It ought to be. It's
memory versus a SAS SSD. And testing these new technologies is something where
generally you're going to be working with a vendor that's going to give you samples.
So you kind of have to do it in-house. So you have to have something in-house that can actually
push that new component to the limit. And if you're a storage vendor or someone that consumes storage
that wants to try out something like an NVDIMM or Optane memory,
you want to bring that in and you want to have a system
that's capable enough to actually push that component to its limit
and see if it will help you.
No matter what your reasons are for building a performance test lab,
you will be converting large amounts of money
into rapidly depreciating capital assets. And if it's for your home lab, you'll be spending
a lot on eBay probably, because that's where all the good stuff is and they can't limit
you to three hard drives per transaction. Another important message that I picked up early in my career,
thank you to Ken and his director at the time,
is you get what you measure.
And this applies to more than just performance.
It applies to pretty much everything.
My favorite example is an HR example.
And the best example I have
is how you measure a team.
So if your measurement of your IT team
is how many tickets they close per quarter,
week, year, whatever,
in the end you're going to get a lot of closed tickets.
But is that really the key performance indicator for how well the job
is getting done? Or, you know, do you need more people on the team? Do you need fewer people on
the team? If that's your only metric by which you're measuring the team, usually the team will
fulfill that metric to the best of their ability. Similarly, if you want to test a storage array,
like if I go to any storage vendor and say,
I want a storage array that can do 4K ops,
as many as possible,
and I only want to select the vendor that gets the biggest number.
Well, you're going to get a lot of really good cache hit 512 byte,
or no, I said 4K, sorry.
If you put a 4K limit on vendors, you'll get really good cache hit 4K reads, because that's the fastest thing that the arrays can do.
So yes, you'll get some differentiation, depending on the architectures, but that's probably not your workload. So it may be part of the metrics
that you want to evaluate the solution,
but if that's the only metric you go by,
you're probably shooting yourself in the foot.
So it's just a cautionary tale to encourage thought
about what are you really measuring
and are you getting a full picture?
So I would say in the realm of performance
testing, carefully consider
two things. Your
benchmark slash your
load generation tool and
the workloads you're using.
And also carefully consider
the configuration of your solution
under test.
And that includes both your physical infrastructure and your virtual infrastructure.
I'm going to focus on the second part of this talk, and like I said before the talk began,
I gave a talk at MeetBSD last year, and the link is here, it's on YouTube, and the PDF
of the slides is available.
So I'd encourage reviewing that if you're at all interested by this talk today.
So let's start off with virtual load generators.
Ever since virtualization became a thing,
I think that performance teams have struggled with,
can we use VMs for load generation?
Because they're so darn convenient.
This probably is true for containers as well, but I just haven't played with containers very much.
It's lightweight, it's easier to manage.
You don't have to re-image a physical host, you just, you have snapshots, you re-clone
stuff.
It's a lot better.
It's a lot easier to manage.
So, yes, we want good performance as a performance engineering team, but we also want to be lazy, because we're humans, that's what lot better. It's a lot easier to manage. So, yes, we want good performance as a performance engineering team,
but we also want to be lazy, because
we're humans. That's what we do.
So,
can you use virtual load generators?
Yeah.
You can. Kinda. Maybe.
No. The answer kind of
depends. I think we're trending towards yes.
I know you can.
Depending on what you want to do with the results,
maybe you don't want to.
So for publicly disclosed results,
I don't think you can right now
if you have the same requirements I do.
For internal testing, yes, VMware works great.
But we did a round of testing with a lot of hypervisors.
We found deficiencies in all.
The big deficiency with any hypervisor appears, from my perspective and the way I do performance testing, to be the network stack.
So I found a lot of problems.
Now, I did limited testing, and I'm not an expert at using all of these.
So I may be wrong.
Things may have gotten fixed since I tested them,
because this was over the course of like six months last year.
So let me know if I'm wrong.
Especially chime in when I'm going over that slide,
so people on the recording will know that I was wrong.
I'm okay being wrong, because we can learn the correct answer.
But we probably won't go into a deep discussion
of how to configure stuff if that comes up,
just for time reasons.
So testing various hypervisors.
First, I started with Beehive on FreeBSD,
given the heritage of FreeDAS.
It made sense to start with a BSD solution.
Beehive is actually very elegant.
I quite like Beehive.
There are some limitations.
First limitation that is pretty obvious
is the bridge and tap virtual networking.
It's just too slow.
You get limited to about one gigabit per second max.
Now, a lot of people use containers on FreeBSD.
They use bridge and tap,
and it's great for their applications.
And a lot of applications are probably fine with 1 gigabit per second max.
I'm not.
I need somewhere north of 10 gigabits per second
and somewhere south of 25 gigabits per second to really be happy.
So, okay, virtual networking is my problem.
Fine.
Well, we live in the modern world.
Why don't you just use SRIOV
and make virtual functions on your NICs
and pass those through PCI Pass-through?
And I think that would actually work pretty well
for Linux and BSD guests.
But unfortunately for Windows right now,
there's an issue with the base address registers
in Windows, the way it probes them.
So that doesn't really work.
I worked with some BSD people.
We sort of got a hack that worked, but the interface wouldn't link up.
So more work is needed there.
I know work is ongoing, especially for GPU pass-through with Beehive.
So I think there's promise there.
If you only wanted FreeBSD or Linux VMs, probably this is a wonderful solution.
I haven't tested that because I found Windows was my limiting factor, and I need Windows
VMs because I want to test SMB with the Windows SMB client, because most people are going
to be doing that. I also haven't tried ePair, which is another way to do a tap, but I suspect I would run into the same limit
with the bridge interface. But I haven't tried that one yet. So moving on, next is KVM. KVM
is obviously very popular, especially on Linux. It's a little less popular on FreeBSD. So
I just, I spun up Debian. This looked very promising to me,
going through the virtual network switches.
I was getting 10 gigabits per second bidirectionally.
I was very happy.
But when I added a second VM on the same switch,
one of the VMs would win the battle.
The other one would get K per second.
The other one would get basically 10 gigabits per second.
So things were being completely unfair.
Maybe there's one quick setting to fix this.
I don't know.
I didn't have the luxury of time at this point.
So I said, eh, not going to work right now.
The nice thing is SRIOV probably would work fine.
And I believe from reviewing the wonderful interwebs that I think SRIV into Windows would work fine on Linux.
Because I think they actually made a patch
for the same problem that Beehive is having right now.
I didn't keep investigating this at this time.
So, KVM on Debian?
Maybe.
Yeah.
I'd like to understand, before we get into further. Yeah, yeah. yeah yeah yeah
yeah
right so the question is
I'm trying what am I actually trying to do here
with this testing so what I'm trying
to do is I'm trying to
probably got a bit ahead of myself
so what I'm trying to do is I'm trying to, and I probably got a bit ahead of myself. So what I'm trying to do is I'm trying to test
just a storage appliance with,
what was I trying to test at the time?
I think iSCSI lines.
So I was just trying to see,
okay, what can I do if I put a VM on a physical host
instead of using the physical host bare metal?
What can I get for performance from that host?
Because in an ideal world,
the VM on top of the host
should be able to push the same amount of traffic with a benchmark
as the bare metal host.
And so I want to assess the difference between that
to see if I can use virtual load generators
instead of physical load generators for ease of management.
Is that why it's high-speed? Mm-hmm. use virtual load generators instead of physical load generators for ease of management. But fundamentally, the problem that I ran into is networking.
So you can just do network tests and see what the results are.
Then your target doesn't matter as much.
Yeah, and I did test all this with just iPerf as well,
because I don't want to run into a limit
where the iSCSI initiator or the NFS client on the OS
was a limit.
I wanted to make sure I knew that the networking
was my first problem, so I wanted to focus on that.
And that turned into be, that was the key differentiator,
really, was whether the networking performance
was adequate.
So what's our next hypervisor on the roulette?
Zen! In this case, actually, XCPNG. It's
appealing because it's easy, like
ESXi. In the end, probably
KVM is easier, once you learn it, because
it's easier to script and all that.
XCP may be great too.
I just, I don't know that much about it.
In this case, the virtual switch networking was too slow.
I couldn't saturate a 10 gig link.
And also the SRIV support at the time was not acceptable.
I don't know what it is now.
There were only a few like Intel 10 gig adapters
listed as supported.
So, and I couldn't seem to get a driver,
get SRIV to turn on
for the network drivers we were using.
So if more cards are added,
then this may be a very promising solution as well.
It was just when I was testing it,
it didn't work out for me.
ESXi is the default answer
for a lot of virtualization needs.
The virtual network performance is actually very good.
I didn't find a need for SRIOV.
Obviously, it supports it.
But being able to do this, do a lot of network traffic without SRIOV is nice
because not all hosts can support SRIOV,
especially if you're at home or you're strapped for cash
and you're buying older servers, they might not support SRIOV.
So it's nice to have the flexibility to just use basic virtual networking. I know that older ESXi
is 5.5 to 6. They had very similar network performance to physical hosts. You know,
when you set everything up and avoid oversubscription. We'll get into that a little
bit later. With 6.5, I did some more testing,
and I noticed a few more degradations where the virtual was a little bit slower
or the response time was a little bit higher.
So I'm not sure if that's still true.
I intend to revisit that at some point,
but I haven't had a chance.
The main issue I have with VMware is the EULA.
I am not a lawyer, so this is not legal advice,
and this is not a legal interpretation of what the EULA says.
But if you read the EULA, you'll come to this, a link to this document.
Actually, this is the link to the document that is, yes, here in the EULA. So basically, depending on the product of VMware that you're using,
any results you disclose to a third party have to be vetted by VMware.
So I don't know if a lot of you have read the EULA for VMware,
but it's something to be aware of.
Unfortunately, knowledge is irrevocable, so you know this now.
But I don't know how easy or hard this process is because I haven't tried it
so I'm not knocking VMware
I understand why they have this clause in here
but I also, I'm lazy
and I don't want to submit every study I do
to another vendor for approval
before I would put it on a marketing blog
or something like that
so this is why I sort of like
stay away for now from my case.
So I'm not going to use, I use VMware for a lot of testing, but not for anything I want to
publicly disclose. And for me, almost everything I do, I want to publicly disclose it or at least
disclose it to a third party, like a customer. So that's the wrapup for the hypervisors.
Let's talk about the best practices
for configuring virtualized load generators.
This isn't an exhaustive list,
but it's a good start to get you thinking
in the right direction
about keeping yourself out of danger
and, as I say at the bottom,
away from inconsistent performance.
Basically, what you're doing is you're avoiding oversubscription.
You don't want to oversubscribe your resources if you're doing performance testing
because the goal of performance testing is to push things to the limit.
So oversubscription counts on you not using everything to the limit.
So it's sort of weird because virtualization, the whole point was to use oversubscription to
your advantage to run more servers on one server than you really, because the server was more
capable than you needed for that one application. Regardless, if you're going to use VMs for low
generation, disable hyperthreading. Performance people were disabling hyperthreading before it was cool from all the security guys
to disable hyperthreading.
A thread is good, but a thread is not a real core.
So just turn it off.
Don't mess with it.
The total number of powered-on vCPUs
that you're using for your testing
needs to be less than or equal
to the number of real CPU cores,
not threads,
in each host.
Pretty clear.
You're just not oversubscribing.
And this is mostly geared towards ESXi.
On ESXi, yes, the hypervisor's gonna use some,
but it's pretty low, so the inconsistency is low.
But you also can't pin your hypervisor
to a single core, I don't think.
So what's the point?
It's going to get you anyway.
Virtual memory, of course,
the total powered on allocated virtual memory
needs to be less than the total physical memory
in the system.
So leave a buffer for the hypervisor.
Yes, VMware will like dedupe and all that,
but just pretend it won't
and avoid the headache of,
in case you run into that case where you do use all the memory for caching,
then you don't want to get into the point where you're swapping
because that will ruin your performance.
Another thing on the network side is really,
you don't have to, but I would use one uplink port
on each of your vSwitches
or equivalent in your hypervisor.
Yes, you can do a lag LACP,
but then you're counting on the hashing algorithm
to fairly distribute.
So if you want to guarantee everything's
not oversubscribed,
everything's going to distribute properly,
do it by hand.
Make multiple vSwitches and round robin your VMs across them.
It's not that hard, especially with PowerShell now,
so that's really what I would advise you do.
Then you're not counting on LACP to do the right thing.
And another point, if you're using iSCSI,
just use the software initiator in your guest OS,
especially on Linux.
It's perfectly fine, Overhead's very low,
and it's less of a hassle to set up
than like RDMs in VMware
or the equivalent in your hypervisor.
If you're testing Fiber Channel,
you don't really have a choice
unless you make the, I guess, maybe SRIOV.
You could do that, but I haven't tested that, soRIOV, you could do that.
But I haven't tested that, so I don't want to talk to that.
But basically the key message is avoid oversubscription
because inconsistent performance is worse than bad performance.
Because if it's inconsistent, every time you test it,
you might get a different answer.
If it's just bad, you can test it
and then try to figure out why it's bad.
So that's the whole point here. You really want to be as consistent as you can test it and then try to figure out why it's bad. So that's the whole point here.
You really want to be as consistent as you can.
So if you move on to general guidance, whether you're using virtual load generators or physical.
Yeah, question.
So how about using a processor that is self-insured using VMs?
Because even if we are not multiplying the resources of the whole system, Right.
So the question is, why not just use bare metal
and run multiple threads in your load generator?
And that's a very fair point.
I actually, I'm still using bare metal just because it's convenient.
But the one catch there is that operating systems are a lot better than they were, but they're still not perfect. So you can still run into a case where if you have multiple VMs on one host,
you can get better performance
than if you have just a bare metal OS instance on that host
because you can get a hot lock in the OS
or maybe the SMB client.
Maybe it can't scale on a single host
because, again, probably a hot lock or something like that.
So it can be advantageous,
especially if you have a very large load generator.
You can really want to run VMs on it
because you're going to get better performance.
So that's something that can get you out.
So it's a really good question.
So in general for the network,
this seems obvious,
but the total network bandwidth of all your load generators, it needs to exceed that of the system that you're testing.
I mean, you have to have, it all comes down to where you're putting the bottleneck, and we'll go through this later.
But you have to make sure that the bottleneck is where you want it to be.
Because otherwise you're going to get,
you could either get inconsistent performance or you could go back to just getting bad performance.
Then you have to figure out why.
Avoid lags on your load generators,
just like with virtual switches.
If you have multiple network ports on a load generator,
you're going to have a bad time.
That's another reason VMs are great.
Because then you don't have to deal with routing
or setting up multiple subnets or any of that.
Because otherwise,
your traffic will not be symmetric in both directions.
Typically, on a default setup,
if you have multiple ports on a single OS.
If they're on the same subnet.
So avoid lags on the load generators,
but really also avoid having multiple ports if you can,
at least through which you're testing performance.
In your test environment, avoid switch hops.
Keep everything on the same switch.
If you can't do that, minimize switch hops
and ensure you have sufficient bandwidth
between your switches
so you don't run into a bottleneck
going from the switch that your load generators are on
to the switch that your system under test is on.
What else did I want to say about switches?
Yeah, well, okay.
So what I wanted to say was that
networking is probably the most common performance
or most common factor that induces a performance problem in customer sites.
I think most storage vendors will agree.
So don't hang yourself on the network hook in your own testing.
Minimize that so you know what the maximum is.
Because most people are going to run through multiple switches.
But it doesn't really matter because it's going out to thousands of people on a campus or whatever. So each person only expects a
small slice of the performance. But for what you're telling people your system can do,
you want to say, this is the big number that I can get without network limitations.
And of course, MTU can catch you out very easily. If you're going to increase your MTU to 9,000,
which generally helps with CPU quite a bit,
I mean, less so with all the NIC offloads,
but I still find that it helps.
If you're going to change your MTU probably to 9000,
make sure it is consistent across all your virtual switches,
all of your physical switches,
all of your operating systems,
all of your filers, everything. The operating systems, all of your filers, everything.
The fun part is with virtual switches on VMware,
at least back in the day, you could set your MTU,
you could do a ping with no fragment
and set the size higher than 1500
and ping another VM on that same vSwitch,
say don't fragment,
the vSwitch MTU is still 1500,
and it'll pass an 8K packet just fine,
even though no fragment is set.
As soon as something goes into or out of that vSwitch
onto a real switch, then you'd run into a problem.
So you have to be very careful with MTU
if you're changing it from the default.
I've been caught out myself multiple times.
For memory, typically this
matters if you're doing VMs, right? You can set whatever memory you want. On Windows,
like 8 to 16 gigs generally is what I would recommend. On Linux, there's no real reason
to go above 4 unless there is. So like I have here, more memory may or may not help you.
It sort of depends on what you're doing.
If you're using workloads that are doing direct IO,
so most of your block testing
is going to use the O direct flag
or the equivalent on your operating system.
Or if you're doing basic file testing
with stuff like FIO.
Now by default, FIO does not set direct equals one.
Typically you will want to
if you're doing...
For most simple file storage testing you will want to set ODirect one, unless you don't. I'm not going to go over that here.
With more complex file testing, especially with stuff like the spec SFS software build or EDA
workload, those do not use ODirect and there's a lot of metadata.
So metadata caching, it tends to simplify the workload to the storage system,
and Ken is actually going to talk about that in his talk, or our talk,
about the AI image processing and genomics workloads in spec.
So that's on Thursday.
So come to that if you want to see that.
He actually has a chart about that
that I think is in the final version.
But those workloads, the metadata's gonna be cached.
You can't really stop that from happening.
So that's gonna take memory and processing power
on your load generator.
So you just have to be aware of that.
Also, delegations in NFS v4 and SMB leases,
depending on how the data's structured
and the capabilities of the system you're testing,
it could get a lease that just says,
do whatever you want,
and the client could write a significant amount of data
without ever sending that back to the filer.
There's policies for when it would.
If it did a flush, I know it would write back.
Again, I haven't played with those very much. But there is the capability in newer versions of the protocol to actually write data
to a network share that does not immediately go back to that filer. So you have to be aware of
that. You probably don't want that if you're trying to test the performance of the filer at the end.
But if your customer is going to do that,
and that's perfectly valid,
you may actually want to test the performance of that.
But that's probably less likely.
So you have to be aware of that.
So more memory may or may not be good,
depending on what you're trying to do.
And the better performance that comes with it
may or may not be what you want to see in your test.
So if no one said performance testing was easy,
that's why people like me and Ken and Ryan have jobs.
Question?
VVOLs?
So the question is,
do I have any opinion on VVOL testing?
I don't.
I haven't played with it very much.
I suspect, well, you'd have to be using VMs.
I suspect the best thing to do there would be to attach a separate volume
from the boot volume to your VMs and test that directly.
I believe that is possible with vVols, but I'm by no means an expert.
I'm very interested in them because I think they're cool,
but I haven't played with them.
You were saying that on a Linux host, all it needs is one gigabyte of memory. I'm interested in them because I think they're cool. I haven't played with them..
In a lot of, if you're setting up a performance storage
test, or storage performance testing lab,
and you're running synthetic benchmarks
with low generation tools, yes.
Would you ever do it in your data center running a real app?
Probably not anymore, right?
Yeah.
Yeah.
Just minimum. Just minimum.
Just minimum.
More or less may or may not help you.
But if you're going to run mostly like FIO or VDBench,
and you're generating load and you can use VMs for it,
then you shouldn't need more than that. Alright, last best practices slide
for interconnects.
You know,
I don't know that this one is too helpful
because I say, oh,
avoid unintended bottlenecks.
Great, well that's easy to say, right?
I'll show you a little bit more about that.
What I would recommend,
and this comes in waves of being important and not,
is check your servers,
check the block diagrams for your servers
if you can get them from your vendors
for PCIe switches.
And also check all your PCIe ports
because a lot of them are by 16 physically,
but they're electrically less.
Usually it's silkscreened on the board,
even in, I think all vendors do that now. So, you know, don't put a buy 16 card in a buy 8 slot,
because you're instantly going to limit your performance there. Same for a buy 8 and a buy 4.
It matters, you know, if you have a PCIe 4 board and card, then it's gonna matter a lot less.
But it's something to watch out for. Just make sure you have enough bandwidth
throughout your whole solution.
The other thing is to check your maximum
effective data rates for all the interconnects
in your data path.
So all the way from the disks,
all the way to the load generator.
Make sure you don't have any unintended bottlenecks there.
And we're gonna go through a few scenarios with that.
The other thing to check are data sheets
for the various controllers, like your NICs,
and probably more often your HBAs.
Because most HBAs have way more capability
on the front end and back end of them
than the controller can actually handle.
So that's another bottleneck to be aware of.
So here's a logical but strange-looking picture of a bottleneck analysis in a solution under test for performance.
So your first thought is probably,
storage solutions make strangely-shaped bottles. And I would agree.
But you could probably get a lot of money for that at a flea market.
So the most amount of bandwidth and performance capability you have
should really be your load generator memory and CPU.
Because you have a lot of load generators,
in aggregate they have a lot of memory bandwidth.
And the CPUs are fast and match the memory bandwidth.
In this solution, we have the least amount of bandwidth with our hard drives.
But you can see there's a lot of bumps along the way.
So first of all, we met one of the rules, but barely, by my drawing.
The load generator network is just barely more than the network on the filer.
So that's good.
We're not limited there.
But you can see, in the end, this solution, this fictional solution,
it's designed to be bottlenecked on the performance of its hard drives,
and you can see in this case it is.
But you really have to check all the components,
going from all the way from your disks at the end
all the way up here to where that load is being generated.
And make sure you don't have any bottlenecks
where you don't intend them.
Maximum effective data rates.
I have no idea if this is a standard term or not,
but I made it up and I like it.
So sorry if it doesn't match the meaning that you expect.
I consider this the rate of effective data transmission and I like it, so sorry if it doesn't match the meaning that you expect.
I consider this the rate of effective data transmission over an interconnect.
So it's lower than the raw signaling level,
depending on, or signaling rate, sorry.
Depending on which interconnect we're talking about,
that number may match or it may not.
It might be Ethernet.
It might be SAS.
So in SAS, I'm going to make fun of SAS
because it's 8B, 10B.
So there's a very high overhead there.
So, oh, yay, 6 gigabit SAS,
but you're not going to get 6 gigabits through that.
There's also other overheads, protocol overheads.
I don't count those because the percent overhead Gigabits through that. There's also other overheads, protocol overheads.
I don't count those because the percent overhead that that's going to induce is variable depending on the payload size.
And I'm just going to assume, if I'm pushing performance, generally the payload size will be large.
Not always, but I'm just going to assume the payload size will be large, so the protocol overhead will be a percent or two.
So we'll just call that even.
It's just variable.
I can't put it on a table.
I can't use it for a rule of thumb guess.
But it is there.
So I've only tried to exclude the physical encoding overheads where I could, where I could figure out what they are.
Thank you, Wikipedia.
We have a table that I try to keep updated
on the IX Systems blog with the maximum effective data rates. We have more than just I try to keep updated on the ixsystems blog
with the maximum effective data rates
We have more than just SAS and PCI
but this is all I can fit on the slide
So I've listed it both
megabytes and megabytes depending on your
persuasion, whichever one you like
I generally go by megabytes but there are a lot
of people that prefer to use megabytes
so fair enough
I also have gigabits because a lot of times for,
if you're thinking about media,
you generally think in bit rates.
So sometimes it's helpful to know,
okay, how many 4K streams can I stream
over a 40 gig ethernet connection?
So these are just helpful rules of thumb.
Again, it's all on the website.
SAS is fun because as you can see. I
don't really have the example math, but let's see. Oh, I'm not even going to try to do it live. The other cool thing is the NVMe M.2 interfaces, because sometimes they're by two,
sometimes they're by four. That can catch you out. But you can use these as you
walk through and look at that, you know, remember that weird shape. You can analyze at every
step in your data path, you know, what is the maximum speed I can get. So let's look
at a generic storage server. I'm going to assume that the CPU and RAM are fast enough,
you know, like in my diagram that was the biggest part. So we're going to ignore those for now. But let's walk through
this storage solution. So we have an HBA, SAS HBA right here, connected to the CPU via PCIe 3.0 by
8. So that's the maximum bandwidth we're going to get in a single direction out of a PCIe 3x8 slot.
Now, the next step, after we get from the CPU down here,
is our actual controller chip.
Well, the data sheet says that can only do 5700 megabytes per second.
Okay, so we've already moved down in terms of the maximum performance we can get.
Now, in this case, we have a JBOD here with 24 SAS HDDs, 7.2K RPM.
So each of those drives, according to its specs, you can do 192 megabytes per second.
That adds up to 4,600, an even lower number. So we're riding the wave down. We're trying to find that bottleneck. That JBOD is only connected to this SAS HPA with a single x4 SAS cable. So our theoretical
maximum there, maximum effective, is about 4,500. So we've hit a new low in terms of performance.
So theoretically, that's our bottleneck. But what about the front end? Right?
This is a storage server.
So it's not generating any data.
The data's coming from outside.
So we have to look at the front end.
So the front end's going to come from our NIC.
Our NIC, again, is connected at PCI 3 by 8.
So 7,500 maybe bytes per second is possible there.
And we're only using one of the 40 gig NIC ports. So in this point, we can exclude the NIC controller.
In this case, we know that that single port
can do about 4,700 megabytes per second
because it's 40 gig Ethernet.
So again, theoretically, this is our bottleneck,
this SAS connection right here.
That has the lowest maximum effective data rate
in this solution.
Right?
So if I sell you this, I'll tell you,
hey, you can do 4,500 megabytes per second
when you're writing to it.
You're good, right?
No!
We're writing to a pool of mirrors.
So all the incoming data
will have to be duplicated
before it's written to the drives.
So, let's go through that.
What is our slowest component
in the path to the drives?
That's that SAS port right there.
This is the maximum amount of data
that we can send to the drives.
But we know we have to send twice that amount of data. So, divide it by two, 2288 MeV per second.
Now, if we do the math, our limit still, I mean, it would be, for all the hard drives,
is less than that.
I mean, the hard drives are capable of more than that,
because we're still writing this amount to the disks.
HBA is fine, but we're still writing this amount
to the disks, and we're still writing this amount to the disks, and we're still writing this amount hard drives is less than that. I mean, the hard drives are capable of more than that,
because we're still writing this amount to the disks. HBA is fine. PCI to the HBA is fine.
What about the front end? Well, now we can only accept a maximum of 2288 megabytes per second,
and our 40 gig NIC and its connection to the PCIe bus are more than adequate for that.
So it's not just your physical components that limit you.
It's also logically what's happening.
Make sense?
All right.
Let the record show there were some nods.
So I recently went on a search for the perfect load generation hardware.
Perfect. No problems whatsoever.
Think I succeeded?
You're right. I failed.
But I think I did pretty good.
So, let me walk you through this.
First of all, does it matter what hardware you use for your load generators?
To some extent, yes. As always with performance, it depends.
It matters a lot less for synthetic testing. So this is getting back to your question, right?
When I'm running my synthetic benchmarks, those pieces of software are designed to push
a lot of load with very little impact to the system
that's generating the load.
So for synthetic testing, it doesn't matter that much.
It still matters, but not as much.
Except for single thread testing,
that's where if you need one thread to go faster,
your CPU clock's gonna matter.
So there are some exceptions.
For real application testing,
obviously you're
going to need more performance there. Let's talk about synthetic first a little bit more.
So like I said, synthetic benchmark load generation tools are designed to generate load.
And a lot of it without using a lot of resources on the host. So that's why it matters less with
synthetic workloads.
But again, like I said before,
some of the workloads actually do involve the OS more than others.
If there's more metadata, if you're not using ODirect,
then the client, the VFS layers in the client,
the OS in the client is going to be more involved in that data flow to the server or from the server.
So if you have very metadata-heavy workloads,
you're going to have to have enough oomph
on your load generators to absorb that.
And you have to keep in mind all the previous stuff
about making sure there's no bottlenecks in your data path,
physical bottlenecks in your data path as well.
For real applications, they're very annoying.
They actually want to do real work.
I don't know why.
And we can't tell them,
just don't do that stupid real work stuff
and generate the load that you would
if you were doing real work.
It's almost like they're designed to achieve something.
So when I'm running my video editor,
it actually wants to decompress the video.
I just want it to read the video from my storage array.
I don't care what it does with it,
but it actually has to decode it and display it.
Fair enough.
So in this case, there's gonna be compute time.
The resources matter, your CPU matters, your GPU matters,
the amount of memory you have matters,
the speed of the memory matters.
So we have a separate lab with a lot fewer machines in it,
but they're much higher spec.
So for stuff like synthetic single threaded testing,
that's where we do that testing,
because we have high clock rates.
When we have to do a real application testing,
run DaVinci Resolve or something like that,
well, then we fire up the iMac Pro
or the high-power Windows workstation
with Xeons in it.
But it beats having to buy 12 iMac Pros to cover all the cases, right?
12 iMac Pros is a lot.
So I began my search for new load generators.
I wanted new load generators just for at-scale synthetic testing.
So typically I wanted about 12 of them.
That's typically what one of our testbeds runs.
I wanted to balance cost, obviously,
convenience, and compatibility.
I want to be able to test whatever I want to test in the future.
And I want the same performance or better than my current ones,
which are running E3 1270 V5s,
which are, you know, decent little E3 chips,
pretty high clock rate.
I wanted to be able to use 25 gigabit Ethernet because I have 100 gig switches.
Thank you.
And you get much better switch utilization per channel
if you can run them at 25 gig.
Obviously, I want the performance too.
And I also wanted out-of-band management and IPMI
because rebooting the systems is a lot easier
when you can serial into them.
Here are my candidates.
So we started with, this is our baseline system
with the E3s in them, obviously, 3.6 gigahertz,
turbo up to four, that's pretty cool.
I tested an older Xeon D, two different atoms,
an eight core atom and a four core atom,
both at 2.2 gigahertz.
A Ryzen 5 that actually was standing in
for the intended Ryzen 3 2200. You can see
the Ryzen obviously is great on the clock rate, but in all cases I just wanted 16 gigs of memory.
Again, I just found I didn't need more than that, and a 16 gig DIMM is about the lowest I can buy. So I ran some iSCSI tests.
I ran some SMB tests.
In all cases, it was an older all-flash-free NAS,
and I just configured it to maximize performance.
So I did horrible things that I would never tell a customer to do to make it go faster.
Because I wanted the bottleneck, unlike all my other testing,
I wanted the bottleneck at the load generator.
I wanted to shift it up.
I wanted to see how fast this load generator can do and which one of these I should buy so that then my bottleneck will
firmly remain at the storage system for all my future testing. So I used a small active
data set size. All the reads are going to be cached on the free NAS just by nature of
how it works. For iSCSI, I ran CentOS 7.
So what did we see in the tests?
This is all above zero.
It's better than the E3.
Below, it's worse than the E3, the existing load generators.
We can see Ryzen's not looking bad for both reads and a 50% read mix, but for writes, we're, what, 15% down.
For some reason, the atoms were quite good at reading,
but they were very bad at a mixed workload,
but they were okay for writing.
It's weird, but it is consistent,
because I ran the test multiple times.
Very strange, didn't deep dive into it.
Oh, and sorry.
This was for a sequential 1 meg workload.
We're going to go to large block random 32k random IO.
Again, over iSCSI.
We can see, again, the atoms sort of fall apart here.
They're not too bad at the reads,
but when writes are involved,
I'm seeing significantly less.
The z on d was holding its own.
I would prefer plus or minus 5%,
but within 10% is still okay,
especially when the mixes seem quite worse on the atoms.
On the Ryzen, everything was fine.
You'll see that the Ryzen 3 results
generally are pretty low with the writes.
I believe that's my SSDs dying
as I'm running these tests more and more.
So I think that's actually a test artifact,
so we should probably ignore that,
because that doesn't repeat
if I run it on a different system.
Otherwise the results are fairly consistent.
For random 4K, things are looking very positive
for the Ryzen system.
The Xeon D looks quite good.
I mean, minus seven for reads,
but it's pretty close to five.
And the atoms, well, they're a little slow on the reads,
and it seems like that's bleeding into the read mix as well.
So I wasn't really that happy with any of those results. It's kind of a mixed
bag, right? Sort of no clear winner. I hate that. So I switched to SMB. Maybe Windows
10 would tell me something different, give me a good reason to go with one of them over
another or at least eliminate some of them. So I switched over to Windows 10. And let's start with the sequential
1 meg IOS size testing. It told me I probably didn't want, and it's a different set of,
I didn't run all the tests again. I just tested the four core atom. And that told me that, yeah,
probably don't want to do sequential reads on the atom. It's actually pretty cool. What turned out
to be the case here was it was actually core-bound
on a single core doing these
reads.
So the Z on D looks
okay here, but when I actually
dove a little bit deeper, I believe it
was starting to get core-bound too.
So that made me very nervous about those platforms.
Because SMB performance is a very important thing.
And I know these are at-scale tests,
but I still, when I see a limit like that
with a single core, I start to get very nervous
about that's a pretty low limit.
Again, I want to exceed 10 gigabits per second.
And I'm definitely not doing it there.
In this case, both the Ryzens I tested
were within the 5% variance.
I mean, if we exclude that blip right there,
the Ryzen 5 was fine.
And again, the Ryzen 3, that I believe is a test artifact.
What about random?
Well, the Atom looks decent on this,
but look at the scale of the chart.
Everything's within plus or minus 5%.
So really, no real difference between any of the platforms.
Same for 4K.
It's even tighter.
We're between plus and minus 3%.
So no real differentiator there.
The sequential reads for SMB turned out to be the real big differentiator here.
So I know the XGuzzy data wasn't perfect, but we just decided to go with the Ryzen 3 2200G.
Annoyingly, AMD decided to release a new line of processors
that replaced these in the middle of this,
so we actually switched to the 3200G.
It should be better anyway.
You know, better architecture.
And, you know, it's just an equivalent part, right?
Didn't have to change anything but the CPU.
Same motherboard worked.
So we rolled out the Atom just because of the core boundedness.
It was under 900 megabytes per second
on that sequential read test.
Had concerns about the Xeon-D in the same scenario.
That was an older Xeon-D.
Maybe a newer Xeon-D with a higher clock would be fine,
but I just didn't have any of those in a form
factor that met our needs.
And that Ryzen-based load generator
that we wound up with, with the 25
gigabit Ethernet, it can do about
1600 megabytes per second
in the same test. So
it's almost double what the four-core atom could do.
So that's pretty cool. And it's more than what we were getting
with our old load generators.
So just for an example,
I mean, so once we got those results,
we spun it up.
This is what one of our test beds looks like.
You know, 12 load generators,
a single 100 gig switch,
and then a TrueNAS on the other side.
So we actually make servers.
So we manufactured a set of 12 of these load generators,
set them up in our test lab,
and then we compared them to our older load generators.
Now unfortunately, we changed two things at once.
Never change two things at once.
We had a new OS image,
just because it was a new Ryzen chipset
and I wanted to make sure the drivers were up to date.
So I made a new image with a newer Windows 10
and a different and newer Linux distribution.
So how did we do?
This is just, you know, this is some quick testing.
Thank you Ryan for doing the testing.
With 12 of these load generators at scale,
how do we compare on different workloads?
So we've random 4K, 16K, 32K, sequential 1 meg.
So everything's plus or minus 10%. You know, I really want these
to be five. So I'm probably going to go back and look at these workloads and see, are there new
driver updates that I need to do? Is there a new BIOS? Again, there's a lot of change in the Ryzen
BIOS space, a lot of patches, a lot of stuff changing. So we need to make sure we're on the
latest, revalidate that. And then I don't know how sequential 1 meg mix could be much worse
when reads and writes are the same.
So.
That's 50-50, right?
Yeah, it is.
It is 50-50 for sequential, yeah, yeah.
Oh, it's 50% for random.
Yeah, it's 70% for random, actually,
and then 50% for sequential.
Oh, okay.
So I don't know how this can be
when these are like that, but again, something to look into.
So we didn't get a perfect result,
but we got a decent result.
So...
Right, yeah, we haven't done any real tuning
on the client side.
We tend not to do anything.
I mean, the normal stuff, like turning off the firewall
and turning off antivirus, stuff like that.
All the great stuff security people love.
So in conclusion,
choose your equipment carefully,
configure your equipment carefully,
avoid unintended bottlenecks,
and avoid things that can cause inconsistent performance.
Pay attention to the maximum effective data rate
for everything in the data path.
Virtualized load generators can work, probably, but more care and attention is needed.
Load generator hardware, it varies.
How much the hardware matters varies, mostly depending on what you're doing.
For synthetic testing, like I said, you don't need to break the bank.
Our load generators are fairly inexpensive.
The network card is over half the cost.
And then, of course, if you're going to change something,
if you can, always do an A-B test
because you can never be sure what little change
may actually make a really big difference.
And don't change more than one thing at once.
And I think that's everything, so I think I have a question in the back.
Yeah. So what do you suggest about the server settings of the performance of the CPUs?
Yeah.
Like, when it is very old, then probably you'll not have, I mean, for the optimal performance,
you want to set it.
Right, right.
And then, like, it keeps changing the frequency,
so it may or may not be consistent.
That's a good point.
So we have a good point on the turbo settings.
Obviously, turbo is going to cause inconsistent performance.
And it probably will.
I haven't gotten to the point where I ever turn it off,
and that may be to my detriment.
So I would say it's something to monitor.
Very rarely will anyone actually turn turbo off
in the real world.
So you kind of got to roll with it.
But it is something I would be interested to see
the difference if I did turn it off.
You know, maybe that's what's causing some of the wobbles.
If you set the frequency at the same time,
then you know that that's the maximum. Right. Yeah. some of the wobbles.
Right.
Yeah.
And I do set,
I do set,
we do set the load generators to like maximum performance
so they shouldn't be
throttling down aggressively
because that will impact
your response time.
I've seen that pretty,
pretty intense.
Yeah.
Good point.
Any other questions?
Yeah.
Yeah. I just line, what do you do about G-State?
Yeah, I just sort of put everything on max performance and hope for the best.
Again, I try not to tune things too aggressively.
And we do get very consistent results in our lab,
just with max performance in BIOS and in the OS.
We've seen temperature differences between the tongue and the bottom of the OS.
Yes, temperature differences can significantly impact you as well.
It's true.
Yeah, we've had debates on benchmarking committees if we need to disclose whether you have the dry ice
sitting next to the server or not.
Yeah.
Yeah. Yeah, temperature definitely can cause an effect. So ideally you have a nice
ASHRAE certified data center that maintains consistent temperatures. Any other questions?
Okay. I think we're good here. Thank you very much for taking the time to attend.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.