Storage Developer Conference - #129: So, You Want to Build a Storage Performance Testing Lab?

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 129. So welcome to So You Want to Build a Storage Performance Test Lab. I'm Nick Principe. I've been involved in storage performance my whole career. How long now? I don't know. Since 2005, I guess.

Starting point is 00:00:59 Right now I'm at IX Systems, the makers of FreeNAS, and I'm the storage performance supervisor. I've had a lot of fun over the past two years building a performance organization from the ground up. So there's a few lessons learned that I can share from my past life and my current role. My Twitter, GitHub, and email are up there. Feel free to ping me anytime. And I apologize for any of my code on GitHub.

Starting point is 00:01:29 I'm not a developer. I just play one on TV. So why would you want to build a performance test lab? There's a lot of reasons. If you're a company, generally the motivation for spending that amount of money is going to come from marketing purposes. So that can be anything, you know, sizing can be marketing.

Starting point is 00:01:49 Generally, that's on the more helpful side of marketing. It can be getting the 10 million IOPS number that, you know, we just got from the SPDK guys here. It can be for development purposes, too. And one of the most important things for me in the past two years has been testing new technologies. Things like NVDIMMs. We hadn't really played with that. In fact, one of our developers wrote a driver for FreeBSD for it.

Starting point is 00:02:18 So we had to test the functionality of it, but then we also wanted to see, okay, is it really better than an SSD? It ought to be. It's memory versus a SAS SSD. And testing these new technologies is something where generally you're going to be working with a vendor that's going to give you samples. So you kind of have to do it in-house. So you have to have something in-house that can actually push that new component to the limit. And if you're a storage vendor or someone that consumes storage that wants to try out something like an NVDIMM or Optane memory,

Starting point is 00:02:49 you want to bring that in and you want to have a system that's capable enough to actually push that component to its limit and see if it will help you. No matter what your reasons are for building a performance test lab, you will be converting large amounts of money into rapidly depreciating capital assets. And if it's for your home lab, you'll be spending a lot on eBay probably, because that's where all the good stuff is and they can't limit you to three hard drives per transaction. Another important message that I picked up early in my career,

Starting point is 00:03:27 thank you to Ken and his director at the time, is you get what you measure. And this applies to more than just performance. It applies to pretty much everything. My favorite example is an HR example. And the best example I have is how you measure a team. So if your measurement of your IT team

Starting point is 00:03:51 is how many tickets they close per quarter, week, year, whatever, in the end you're going to get a lot of closed tickets. But is that really the key performance indicator for how well the job is getting done? Or, you know, do you need more people on the team? Do you need fewer people on the team? If that's your only metric by which you're measuring the team, usually the team will fulfill that metric to the best of their ability. Similarly, if you want to test a storage array, like if I go to any storage vendor and say,

Starting point is 00:04:28 I want a storage array that can do 4K ops, as many as possible, and I only want to select the vendor that gets the biggest number. Well, you're going to get a lot of really good cache hit 512 byte, or no, I said 4K, sorry. If you put a 4K limit on vendors, you'll get really good cache hit 4K reads, because that's the fastest thing that the arrays can do. So yes, you'll get some differentiation, depending on the architectures, but that's probably not your workload. So it may be part of the metrics that you want to evaluate the solution,

Starting point is 00:05:06 but if that's the only metric you go by, you're probably shooting yourself in the foot. So it's just a cautionary tale to encourage thought about what are you really measuring and are you getting a full picture? So I would say in the realm of performance testing, carefully consider two things. Your

Starting point is 00:05:31 benchmark slash your load generation tool and the workloads you're using. And also carefully consider the configuration of your solution under test. And that includes both your physical infrastructure and your virtual infrastructure. I'm going to focus on the second part of this talk, and like I said before the talk began,

Starting point is 00:05:53 I gave a talk at MeetBSD last year, and the link is here, it's on YouTube, and the PDF of the slides is available. So I'd encourage reviewing that if you're at all interested by this talk today. So let's start off with virtual load generators. Ever since virtualization became a thing, I think that performance teams have struggled with, can we use VMs for load generation? Because they're so darn convenient.

Starting point is 00:06:30 This probably is true for containers as well, but I just haven't played with containers very much. It's lightweight, it's easier to manage. You don't have to re-image a physical host, you just, you have snapshots, you re-clone stuff. It's a lot better. It's a lot easier to manage. So, yes, we want good performance as a performance engineering team, but we also want to be lazy, because we're humans, that's what lot better. It's a lot easier to manage. So, yes, we want good performance as a performance engineering team, but we also want to be lazy, because

Starting point is 00:06:48 we're humans. That's what we do. So, can you use virtual load generators? Yeah. You can. Kinda. Maybe. No. The answer kind of depends. I think we're trending towards yes. I know you can.

Starting point is 00:07:04 Depending on what you want to do with the results, maybe you don't want to. So for publicly disclosed results, I don't think you can right now if you have the same requirements I do. For internal testing, yes, VMware works great. But we did a round of testing with a lot of hypervisors. We found deficiencies in all.

Starting point is 00:07:29 The big deficiency with any hypervisor appears, from my perspective and the way I do performance testing, to be the network stack. So I found a lot of problems. Now, I did limited testing, and I'm not an expert at using all of these. So I may be wrong. Things may have gotten fixed since I tested them, because this was over the course of like six months last year. So let me know if I'm wrong. Especially chime in when I'm going over that slide,

Starting point is 00:07:59 so people on the recording will know that I was wrong. I'm okay being wrong, because we can learn the correct answer. But we probably won't go into a deep discussion of how to configure stuff if that comes up, just for time reasons. So testing various hypervisors. First, I started with Beehive on FreeBSD, given the heritage of FreeDAS.

Starting point is 00:08:19 It made sense to start with a BSD solution. Beehive is actually very elegant. I quite like Beehive. There are some limitations. First limitation that is pretty obvious is the bridge and tap virtual networking. It's just too slow. You get limited to about one gigabit per second max.

Starting point is 00:08:36 Now, a lot of people use containers on FreeBSD. They use bridge and tap, and it's great for their applications. And a lot of applications are probably fine with 1 gigabit per second max. I'm not. I need somewhere north of 10 gigabits per second and somewhere south of 25 gigabits per second to really be happy. So, okay, virtual networking is my problem.

Starting point is 00:09:01 Fine. Well, we live in the modern world. Why don't you just use SRIOV and make virtual functions on your NICs and pass those through PCI Pass-through? And I think that would actually work pretty well for Linux and BSD guests. But unfortunately for Windows right now,

Starting point is 00:09:18 there's an issue with the base address registers in Windows, the way it probes them. So that doesn't really work. I worked with some BSD people. We sort of got a hack that worked, but the interface wouldn't link up. So more work is needed there. I know work is ongoing, especially for GPU pass-through with Beehive. So I think there's promise there.

Starting point is 00:09:40 If you only wanted FreeBSD or Linux VMs, probably this is a wonderful solution. I haven't tested that because I found Windows was my limiting factor, and I need Windows VMs because I want to test SMB with the Windows SMB client, because most people are going to be doing that. I also haven't tried ePair, which is another way to do a tap, but I suspect I would run into the same limit with the bridge interface. But I haven't tried that one yet. So moving on, next is KVM. KVM is obviously very popular, especially on Linux. It's a little less popular on FreeBSD. So I just, I spun up Debian. This looked very promising to me, going through the virtual network switches.

Starting point is 00:10:29 I was getting 10 gigabits per second bidirectionally. I was very happy. But when I added a second VM on the same switch, one of the VMs would win the battle. The other one would get K per second. The other one would get basically 10 gigabits per second. So things were being completely unfair. Maybe there's one quick setting to fix this.

Starting point is 00:10:49 I don't know. I didn't have the luxury of time at this point. So I said, eh, not going to work right now. The nice thing is SRIOV probably would work fine. And I believe from reviewing the wonderful interwebs that I think SRIV into Windows would work fine on Linux. Because I think they actually made a patch for the same problem that Beehive is having right now. I didn't keep investigating this at this time.

Starting point is 00:11:15 So, KVM on Debian? Maybe. Yeah. I'd like to understand, before we get into further. Yeah, yeah. yeah yeah yeah yeah right so the question is I'm trying what am I actually trying to do here with this testing so what I'm trying

Starting point is 00:11:40 to do is I'm trying to probably got a bit ahead of myself so what I'm trying to do is I'm trying to, and I probably got a bit ahead of myself. So what I'm trying to do is I'm trying to test just a storage appliance with, what was I trying to test at the time? I think iSCSI lines. So I was just trying to see, okay, what can I do if I put a VM on a physical host

Starting point is 00:12:00 instead of using the physical host bare metal? What can I get for performance from that host? Because in an ideal world, the VM on top of the host should be able to push the same amount of traffic with a benchmark as the bare metal host. And so I want to assess the difference between that to see if I can use virtual load generators

Starting point is 00:12:22 instead of physical load generators for ease of management. Is that why it's high-speed? Mm-hmm. use virtual load generators instead of physical load generators for ease of management. But fundamentally, the problem that I ran into is networking. So you can just do network tests and see what the results are. Then your target doesn't matter as much. Yeah, and I did test all this with just iPerf as well, because I don't want to run into a limit where the iSCSI initiator or the NFS client on the OS was a limit.

Starting point is 00:12:54 I wanted to make sure I knew that the networking was my first problem, so I wanted to focus on that. And that turned into be, that was the key differentiator, really, was whether the networking performance was adequate. So what's our next hypervisor on the roulette? Zen! In this case, actually, XCPNG. It's appealing because it's easy, like

Starting point is 00:13:17 ESXi. In the end, probably KVM is easier, once you learn it, because it's easier to script and all that. XCP may be great too. I just, I don't know that much about it. In this case, the virtual switch networking was too slow. I couldn't saturate a 10 gig link. And also the SRIV support at the time was not acceptable.

Starting point is 00:13:35 I don't know what it is now. There were only a few like Intel 10 gig adapters listed as supported. So, and I couldn't seem to get a driver, get SRIV to turn on for the network drivers we were using. So if more cards are added, then this may be a very promising solution as well.

Starting point is 00:13:53 It was just when I was testing it, it didn't work out for me. ESXi is the default answer for a lot of virtualization needs. The virtual network performance is actually very good. I didn't find a need for SRIOV. Obviously, it supports it. But being able to do this, do a lot of network traffic without SRIOV is nice

Starting point is 00:14:15 because not all hosts can support SRIOV, especially if you're at home or you're strapped for cash and you're buying older servers, they might not support SRIOV. So it's nice to have the flexibility to just use basic virtual networking. I know that older ESXi is 5.5 to 6. They had very similar network performance to physical hosts. You know, when you set everything up and avoid oversubscription. We'll get into that a little bit later. With 6.5, I did some more testing, and I noticed a few more degradations where the virtual was a little bit slower

Starting point is 00:14:49 or the response time was a little bit higher. So I'm not sure if that's still true. I intend to revisit that at some point, but I haven't had a chance. The main issue I have with VMware is the EULA. I am not a lawyer, so this is not legal advice, and this is not a legal interpretation of what the EULA says. But if you read the EULA, you'll come to this, a link to this document.

Starting point is 00:15:15 Actually, this is the link to the document that is, yes, here in the EULA. So basically, depending on the product of VMware that you're using, any results you disclose to a third party have to be vetted by VMware. So I don't know if a lot of you have read the EULA for VMware, but it's something to be aware of. Unfortunately, knowledge is irrevocable, so you know this now. But I don't know how easy or hard this process is because I haven't tried it so I'm not knocking VMware I understand why they have this clause in here

Starting point is 00:15:50 but I also, I'm lazy and I don't want to submit every study I do to another vendor for approval before I would put it on a marketing blog or something like that so this is why I sort of like stay away for now from my case. So I'm not going to use, I use VMware for a lot of testing, but not for anything I want to

Starting point is 00:16:10 publicly disclose. And for me, almost everything I do, I want to publicly disclose it or at least disclose it to a third party, like a customer. So that's the wrapup for the hypervisors. Let's talk about the best practices for configuring virtualized load generators. This isn't an exhaustive list, but it's a good start to get you thinking in the right direction about keeping yourself out of danger

Starting point is 00:16:38 and, as I say at the bottom, away from inconsistent performance. Basically, what you're doing is you're avoiding oversubscription. You don't want to oversubscribe your resources if you're doing performance testing because the goal of performance testing is to push things to the limit. So oversubscription counts on you not using everything to the limit. So it's sort of weird because virtualization, the whole point was to use oversubscription to your advantage to run more servers on one server than you really, because the server was more

Starting point is 00:17:13 capable than you needed for that one application. Regardless, if you're going to use VMs for low generation, disable hyperthreading. Performance people were disabling hyperthreading before it was cool from all the security guys to disable hyperthreading. A thread is good, but a thread is not a real core. So just turn it off. Don't mess with it. The total number of powered-on vCPUs that you're using for your testing

Starting point is 00:17:40 needs to be less than or equal to the number of real CPU cores, not threads, in each host. Pretty clear. You're just not oversubscribing. And this is mostly geared towards ESXi. On ESXi, yes, the hypervisor's gonna use some,

Starting point is 00:17:56 but it's pretty low, so the inconsistency is low. But you also can't pin your hypervisor to a single core, I don't think. So what's the point? It's going to get you anyway. Virtual memory, of course, the total powered on allocated virtual memory needs to be less than the total physical memory

Starting point is 00:18:15 in the system. So leave a buffer for the hypervisor. Yes, VMware will like dedupe and all that, but just pretend it won't and avoid the headache of, in case you run into that case where you do use all the memory for caching, then you don't want to get into the point where you're swapping because that will ruin your performance.

Starting point is 00:18:37 Another thing on the network side is really, you don't have to, but I would use one uplink port on each of your vSwitches or equivalent in your hypervisor. Yes, you can do a lag LACP, but then you're counting on the hashing algorithm to fairly distribute. So if you want to guarantee everything's

Starting point is 00:18:59 not oversubscribed, everything's going to distribute properly, do it by hand. Make multiple vSwitches and round robin your VMs across them. It's not that hard, especially with PowerShell now, so that's really what I would advise you do. Then you're not counting on LACP to do the right thing. And another point, if you're using iSCSI,

Starting point is 00:19:18 just use the software initiator in your guest OS, especially on Linux. It's perfectly fine, Overhead's very low, and it's less of a hassle to set up than like RDMs in VMware or the equivalent in your hypervisor. If you're testing Fiber Channel, you don't really have a choice

Starting point is 00:19:39 unless you make the, I guess, maybe SRIOV. You could do that, but I haven't tested that, soRIOV, you could do that. But I haven't tested that, so I don't want to talk to that. But basically the key message is avoid oversubscription because inconsistent performance is worse than bad performance. Because if it's inconsistent, every time you test it, you might get a different answer. If it's just bad, you can test it

Starting point is 00:20:02 and then try to figure out why it's bad. So that's the whole point here. You really want to be as consistent as you can test it and then try to figure out why it's bad. So that's the whole point here. You really want to be as consistent as you can. So if you move on to general guidance, whether you're using virtual load generators or physical. Yeah, question. So how about using a processor that is self-insured using VMs? Because even if we are not multiplying the resources of the whole system, Right. So the question is, why not just use bare metal

Starting point is 00:20:43 and run multiple threads in your load generator? And that's a very fair point. I actually, I'm still using bare metal just because it's convenient. But the one catch there is that operating systems are a lot better than they were, but they're still not perfect. So you can still run into a case where if you have multiple VMs on one host, you can get better performance than if you have just a bare metal OS instance on that host because you can get a hot lock in the OS or maybe the SMB client.

Starting point is 00:21:22 Maybe it can't scale on a single host because, again, probably a hot lock or something like that. So it can be advantageous, especially if you have a very large load generator. You can really want to run VMs on it because you're going to get better performance. So that's something that can get you out. So it's a really good question.

Starting point is 00:21:40 So in general for the network, this seems obvious, but the total network bandwidth of all your load generators, it needs to exceed that of the system that you're testing. I mean, you have to have, it all comes down to where you're putting the bottleneck, and we'll go through this later. But you have to make sure that the bottleneck is where you want it to be. Because otherwise you're going to get, you could either get inconsistent performance or you could go back to just getting bad performance. Then you have to figure out why.

Starting point is 00:22:12 Avoid lags on your load generators, just like with virtual switches. If you have multiple network ports on a load generator, you're going to have a bad time. That's another reason VMs are great. Because then you don't have to deal with routing or setting up multiple subnets or any of that. Because otherwise,

Starting point is 00:22:30 your traffic will not be symmetric in both directions. Typically, on a default setup, if you have multiple ports on a single OS. If they're on the same subnet. So avoid lags on the load generators, but really also avoid having multiple ports if you can, at least through which you're testing performance. In your test environment, avoid switch hops.

Starting point is 00:22:56 Keep everything on the same switch. If you can't do that, minimize switch hops and ensure you have sufficient bandwidth between your switches so you don't run into a bottleneck going from the switch that your load generators are on to the switch that your system under test is on. What else did I want to say about switches?

Starting point is 00:23:19 Yeah, well, okay. So what I wanted to say was that networking is probably the most common performance or most common factor that induces a performance problem in customer sites. I think most storage vendors will agree. So don't hang yourself on the network hook in your own testing. Minimize that so you know what the maximum is. Because most people are going to run through multiple switches.

Starting point is 00:23:41 But it doesn't really matter because it's going out to thousands of people on a campus or whatever. So each person only expects a small slice of the performance. But for what you're telling people your system can do, you want to say, this is the big number that I can get without network limitations. And of course, MTU can catch you out very easily. If you're going to increase your MTU to 9,000, which generally helps with CPU quite a bit, I mean, less so with all the NIC offloads, but I still find that it helps. If you're going to change your MTU probably to 9000,

Starting point is 00:24:15 make sure it is consistent across all your virtual switches, all of your physical switches, all of your operating systems, all of your filers, everything. The operating systems, all of your filers, everything. The fun part is with virtual switches on VMware, at least back in the day, you could set your MTU, you could do a ping with no fragment and set the size higher than 1500

Starting point is 00:24:35 and ping another VM on that same vSwitch, say don't fragment, the vSwitch MTU is still 1500, and it'll pass an 8K packet just fine, even though no fragment is set. As soon as something goes into or out of that vSwitch onto a real switch, then you'd run into a problem. So you have to be very careful with MTU

Starting point is 00:24:59 if you're changing it from the default. I've been caught out myself multiple times. For memory, typically this matters if you're doing VMs, right? You can set whatever memory you want. On Windows, like 8 to 16 gigs generally is what I would recommend. On Linux, there's no real reason to go above 4 unless there is. So like I have here, more memory may or may not help you. It sort of depends on what you're doing. If you're using workloads that are doing direct IO,

Starting point is 00:25:29 so most of your block testing is going to use the O direct flag or the equivalent on your operating system. Or if you're doing basic file testing with stuff like FIO. Now by default, FIO does not set direct equals one. Typically you will want to if you're doing...

Starting point is 00:25:49 For most simple file storage testing you will want to set ODirect one, unless you don't. I'm not going to go over that here. With more complex file testing, especially with stuff like the spec SFS software build or EDA workload, those do not use ODirect and there's a lot of metadata. So metadata caching, it tends to simplify the workload to the storage system, and Ken is actually going to talk about that in his talk, or our talk, about the AI image processing and genomics workloads in spec. So that's on Thursday. So come to that if you want to see that.

Starting point is 00:26:24 He actually has a chart about that that I think is in the final version. But those workloads, the metadata's gonna be cached. You can't really stop that from happening. So that's gonna take memory and processing power on your load generator. So you just have to be aware of that. Also, delegations in NFS v4 and SMB leases,

Starting point is 00:26:43 depending on how the data's structured and the capabilities of the system you're testing, it could get a lease that just says, do whatever you want, and the client could write a significant amount of data without ever sending that back to the filer. There's policies for when it would. If it did a flush, I know it would write back.

Starting point is 00:27:03 Again, I haven't played with those very much. But there is the capability in newer versions of the protocol to actually write data to a network share that does not immediately go back to that filer. So you have to be aware of that. You probably don't want that if you're trying to test the performance of the filer at the end. But if your customer is going to do that, and that's perfectly valid, you may actually want to test the performance of that. But that's probably less likely. So you have to be aware of that.

Starting point is 00:27:33 So more memory may or may not be good, depending on what you're trying to do. And the better performance that comes with it may or may not be what you want to see in your test. So if no one said performance testing was easy, that's why people like me and Ken and Ryan have jobs. Question? VVOLs?

Starting point is 00:27:55 So the question is, do I have any opinion on VVOL testing? I don't. I haven't played with it very much. I suspect, well, you'd have to be using VMs. I suspect the best thing to do there would be to attach a separate volume from the boot volume to your VMs and test that directly. I believe that is possible with vVols, but I'm by no means an expert.

Starting point is 00:28:19 I'm very interested in them because I think they're cool, but I haven't played with them. You were saying that on a Linux host, all it needs is one gigabyte of memory. I'm interested in them because I think they're cool. I haven't played with them.. In a lot of, if you're setting up a performance storage test, or storage performance testing lab, and you're running synthetic benchmarks with low generation tools, yes. Would you ever do it in your data center running a real app?

Starting point is 00:28:49 Probably not anymore, right? Yeah. Yeah. Just minimum. Just minimum. Just minimum. More or less may or may not help you. But if you're going to run mostly like FIO or VDBench, and you're generating load and you can use VMs for it,

Starting point is 00:29:28 then you shouldn't need more than that. Alright, last best practices slide for interconnects. You know, I don't know that this one is too helpful because I say, oh, avoid unintended bottlenecks. Great, well that's easy to say, right? I'll show you a little bit more about that.

Starting point is 00:29:45 What I would recommend, and this comes in waves of being important and not, is check your servers, check the block diagrams for your servers if you can get them from your vendors for PCIe switches. And also check all your PCIe ports because a lot of them are by 16 physically,

Starting point is 00:30:01 but they're electrically less. Usually it's silkscreened on the board, even in, I think all vendors do that now. So, you know, don't put a buy 16 card in a buy 8 slot, because you're instantly going to limit your performance there. Same for a buy 8 and a buy 4. It matters, you know, if you have a PCIe 4 board and card, then it's gonna matter a lot less. But it's something to watch out for. Just make sure you have enough bandwidth throughout your whole solution. The other thing is to check your maximum

Starting point is 00:30:33 effective data rates for all the interconnects in your data path. So all the way from the disks, all the way to the load generator. Make sure you don't have any unintended bottlenecks there. And we're gonna go through a few scenarios with that. The other thing to check are data sheets for the various controllers, like your NICs,

Starting point is 00:30:51 and probably more often your HBAs. Because most HBAs have way more capability on the front end and back end of them than the controller can actually handle. So that's another bottleneck to be aware of. So here's a logical but strange-looking picture of a bottleneck analysis in a solution under test for performance. So your first thought is probably, storage solutions make strangely-shaped bottles. And I would agree.

Starting point is 00:31:26 But you could probably get a lot of money for that at a flea market. So the most amount of bandwidth and performance capability you have should really be your load generator memory and CPU. Because you have a lot of load generators, in aggregate they have a lot of memory bandwidth. And the CPUs are fast and match the memory bandwidth. In this solution, we have the least amount of bandwidth with our hard drives. But you can see there's a lot of bumps along the way.

Starting point is 00:31:54 So first of all, we met one of the rules, but barely, by my drawing. The load generator network is just barely more than the network on the filer. So that's good. We're not limited there. But you can see, in the end, this solution, this fictional solution, it's designed to be bottlenecked on the performance of its hard drives, and you can see in this case it is. But you really have to check all the components,

Starting point is 00:32:23 going from all the way from your disks at the end all the way up here to where that load is being generated. And make sure you don't have any bottlenecks where you don't intend them. Maximum effective data rates. I have no idea if this is a standard term or not, but I made it up and I like it. So sorry if it doesn't match the meaning that you expect.

Starting point is 00:32:45 I consider this the rate of effective data transmission and I like it, so sorry if it doesn't match the meaning that you expect. I consider this the rate of effective data transmission over an interconnect. So it's lower than the raw signaling level, depending on, or signaling rate, sorry. Depending on which interconnect we're talking about, that number may match or it may not. It might be Ethernet. It might be SAS.

Starting point is 00:33:11 So in SAS, I'm going to make fun of SAS because it's 8B, 10B. So there's a very high overhead there. So, oh, yay, 6 gigabit SAS, but you're not going to get 6 gigabits through that. There's also other overheads, protocol overheads. I don't count those because the percent overhead Gigabits through that. There's also other overheads, protocol overheads. I don't count those because the percent overhead that that's going to induce is variable depending on the payload size.

Starting point is 00:33:38 And I'm just going to assume, if I'm pushing performance, generally the payload size will be large. Not always, but I'm just going to assume the payload size will be large, so the protocol overhead will be a percent or two. So we'll just call that even. It's just variable. I can't put it on a table. I can't use it for a rule of thumb guess. But it is there. So I've only tried to exclude the physical encoding overheads where I could, where I could figure out what they are.

Starting point is 00:33:59 Thank you, Wikipedia. We have a table that I try to keep updated on the IX Systems blog with the maximum effective data rates. We have more than just I try to keep updated on the ixsystems blog with the maximum effective data rates We have more than just SAS and PCI but this is all I can fit on the slide So I've listed it both megabytes and megabytes depending on your

Starting point is 00:34:15 persuasion, whichever one you like I generally go by megabytes but there are a lot of people that prefer to use megabytes so fair enough I also have gigabits because a lot of times for, if you're thinking about media, you generally think in bit rates. So sometimes it's helpful to know,

Starting point is 00:34:33 okay, how many 4K streams can I stream over a 40 gig ethernet connection? So these are just helpful rules of thumb. Again, it's all on the website. SAS is fun because as you can see. I don't really have the example math, but let's see. Oh, I'm not even going to try to do it live. The other cool thing is the NVMe M.2 interfaces, because sometimes they're by two, sometimes they're by four. That can catch you out. But you can use these as you walk through and look at that, you know, remember that weird shape. You can analyze at every

Starting point is 00:35:14 step in your data path, you know, what is the maximum speed I can get. So let's look at a generic storage server. I'm going to assume that the CPU and RAM are fast enough, you know, like in my diagram that was the biggest part. So we're going to ignore those for now. But let's walk through this storage solution. So we have an HBA, SAS HBA right here, connected to the CPU via PCIe 3.0 by 8. So that's the maximum bandwidth we're going to get in a single direction out of a PCIe 3x8 slot. Now, the next step, after we get from the CPU down here, is our actual controller chip. Well, the data sheet says that can only do 5700 megabytes per second.

Starting point is 00:36:01 Okay, so we've already moved down in terms of the maximum performance we can get. Now, in this case, we have a JBOD here with 24 SAS HDDs, 7.2K RPM. So each of those drives, according to its specs, you can do 192 megabytes per second. That adds up to 4,600, an even lower number. So we're riding the wave down. We're trying to find that bottleneck. That JBOD is only connected to this SAS HPA with a single x4 SAS cable. So our theoretical maximum there, maximum effective, is about 4,500. So we've hit a new low in terms of performance. So theoretically, that's our bottleneck. But what about the front end? Right? This is a storage server. So it's not generating any data.

Starting point is 00:36:49 The data's coming from outside. So we have to look at the front end. So the front end's going to come from our NIC. Our NIC, again, is connected at PCI 3 by 8. So 7,500 maybe bytes per second is possible there. And we're only using one of the 40 gig NIC ports. So in this point, we can exclude the NIC controller. In this case, we know that that single port can do about 4,700 megabytes per second

Starting point is 00:37:11 because it's 40 gig Ethernet. So again, theoretically, this is our bottleneck, this SAS connection right here. That has the lowest maximum effective data rate in this solution. Right? So if I sell you this, I'll tell you, hey, you can do 4,500 megabytes per second

Starting point is 00:37:33 when you're writing to it. You're good, right? No! We're writing to a pool of mirrors. So all the incoming data will have to be duplicated before it's written to the drives. So, let's go through that.

Starting point is 00:37:55 What is our slowest component in the path to the drives? That's that SAS port right there. This is the maximum amount of data that we can send to the drives. But we know we have to send twice that amount of data. So, divide it by two, 2288 MeV per second. Now, if we do the math, our limit still, I mean, it would be, for all the hard drives, is less than that.

Starting point is 00:38:15 I mean, the hard drives are capable of more than that, because we're still writing this amount to the disks. HBA is fine, but we're still writing this amount to the disks, and we're still writing this amount to the disks, and we're still writing this amount hard drives is less than that. I mean, the hard drives are capable of more than that, because we're still writing this amount to the disks. HBA is fine. PCI to the HBA is fine. What about the front end? Well, now we can only accept a maximum of 2288 megabytes per second, and our 40 gig NIC and its connection to the PCIe bus are more than adequate for that. So it's not just your physical components that limit you.

Starting point is 00:38:50 It's also logically what's happening. Make sense? All right. Let the record show there were some nods. So I recently went on a search for the perfect load generation hardware. Perfect. No problems whatsoever. Think I succeeded? You're right. I failed.

Starting point is 00:39:15 But I think I did pretty good. So, let me walk you through this. First of all, does it matter what hardware you use for your load generators? To some extent, yes. As always with performance, it depends. It matters a lot less for synthetic testing. So this is getting back to your question, right? When I'm running my synthetic benchmarks, those pieces of software are designed to push a lot of load with very little impact to the system that's generating the load.

Starting point is 00:39:47 So for synthetic testing, it doesn't matter that much. It still matters, but not as much. Except for single thread testing, that's where if you need one thread to go faster, your CPU clock's gonna matter. So there are some exceptions. For real application testing, obviously you're

Starting point is 00:40:05 going to need more performance there. Let's talk about synthetic first a little bit more. So like I said, synthetic benchmark load generation tools are designed to generate load. And a lot of it without using a lot of resources on the host. So that's why it matters less with synthetic workloads. But again, like I said before, some of the workloads actually do involve the OS more than others. If there's more metadata, if you're not using ODirect, then the client, the VFS layers in the client,

Starting point is 00:40:38 the OS in the client is going to be more involved in that data flow to the server or from the server. So if you have very metadata-heavy workloads, you're going to have to have enough oomph on your load generators to absorb that. And you have to keep in mind all the previous stuff about making sure there's no bottlenecks in your data path, physical bottlenecks in your data path as well. For real applications, they're very annoying.

Starting point is 00:41:02 They actually want to do real work. I don't know why. And we can't tell them, just don't do that stupid real work stuff and generate the load that you would if you were doing real work. It's almost like they're designed to achieve something. So when I'm running my video editor,

Starting point is 00:41:16 it actually wants to decompress the video. I just want it to read the video from my storage array. I don't care what it does with it, but it actually has to decode it and display it. Fair enough. So in this case, there's gonna be compute time. The resources matter, your CPU matters, your GPU matters, the amount of memory you have matters,

Starting point is 00:41:34 the speed of the memory matters. So we have a separate lab with a lot fewer machines in it, but they're much higher spec. So for stuff like synthetic single threaded testing, that's where we do that testing, because we have high clock rates. When we have to do a real application testing, run DaVinci Resolve or something like that,

Starting point is 00:41:51 well, then we fire up the iMac Pro or the high-power Windows workstation with Xeons in it. But it beats having to buy 12 iMac Pros to cover all the cases, right? 12 iMac Pros is a lot. So I began my search for new load generators. I wanted new load generators just for at-scale synthetic testing. So typically I wanted about 12 of them.

Starting point is 00:42:19 That's typically what one of our testbeds runs. I wanted to balance cost, obviously, convenience, and compatibility. I want to be able to test whatever I want to test in the future. And I want the same performance or better than my current ones, which are running E3 1270 V5s, which are, you know, decent little E3 chips, pretty high clock rate.

Starting point is 00:42:45 I wanted to be able to use 25 gigabit Ethernet because I have 100 gig switches. Thank you. And you get much better switch utilization per channel if you can run them at 25 gig. Obviously, I want the performance too. And I also wanted out-of-band management and IPMI because rebooting the systems is a lot easier when you can serial into them.

Starting point is 00:43:03 Here are my candidates. So we started with, this is our baseline system with the E3s in them, obviously, 3.6 gigahertz, turbo up to four, that's pretty cool. I tested an older Xeon D, two different atoms, an eight core atom and a four core atom, both at 2.2 gigahertz. A Ryzen 5 that actually was standing in

Starting point is 00:43:24 for the intended Ryzen 3 2200. You can see the Ryzen obviously is great on the clock rate, but in all cases I just wanted 16 gigs of memory. Again, I just found I didn't need more than that, and a 16 gig DIMM is about the lowest I can buy. So I ran some iSCSI tests. I ran some SMB tests. In all cases, it was an older all-flash-free NAS, and I just configured it to maximize performance. So I did horrible things that I would never tell a customer to do to make it go faster. Because I wanted the bottleneck, unlike all my other testing,

Starting point is 00:43:59 I wanted the bottleneck at the load generator. I wanted to shift it up. I wanted to see how fast this load generator can do and which one of these I should buy so that then my bottleneck will firmly remain at the storage system for all my future testing. So I used a small active data set size. All the reads are going to be cached on the free NAS just by nature of how it works. For iSCSI, I ran CentOS 7. So what did we see in the tests? This is all above zero.

Starting point is 00:44:32 It's better than the E3. Below, it's worse than the E3, the existing load generators. We can see Ryzen's not looking bad for both reads and a 50% read mix, but for writes, we're, what, 15% down. For some reason, the atoms were quite good at reading, but they were very bad at a mixed workload, but they were okay for writing. It's weird, but it is consistent, because I ran the test multiple times.

Starting point is 00:45:02 Very strange, didn't deep dive into it. Oh, and sorry. This was for a sequential 1 meg workload. We're going to go to large block random 32k random IO. Again, over iSCSI. We can see, again, the atoms sort of fall apart here. They're not too bad at the reads, but when writes are involved,

Starting point is 00:45:22 I'm seeing significantly less. The z on d was holding its own. I would prefer plus or minus 5%, but within 10% is still okay, especially when the mixes seem quite worse on the atoms. On the Ryzen, everything was fine. You'll see that the Ryzen 3 results generally are pretty low with the writes.

Starting point is 00:45:43 I believe that's my SSDs dying as I'm running these tests more and more. So I think that's actually a test artifact, so we should probably ignore that, because that doesn't repeat if I run it on a different system. Otherwise the results are fairly consistent. For random 4K, things are looking very positive

Starting point is 00:46:04 for the Ryzen system. The Xeon D looks quite good. I mean, minus seven for reads, but it's pretty close to five. And the atoms, well, they're a little slow on the reads, and it seems like that's bleeding into the read mix as well. So I wasn't really that happy with any of those results. It's kind of a mixed bag, right? Sort of no clear winner. I hate that. So I switched to SMB. Maybe Windows

Starting point is 00:46:34 10 would tell me something different, give me a good reason to go with one of them over another or at least eliminate some of them. So I switched over to Windows 10. And let's start with the sequential 1 meg IOS size testing. It told me I probably didn't want, and it's a different set of, I didn't run all the tests again. I just tested the four core atom. And that told me that, yeah, probably don't want to do sequential reads on the atom. It's actually pretty cool. What turned out to be the case here was it was actually core-bound on a single core doing these reads.

Starting point is 00:47:10 So the Z on D looks okay here, but when I actually dove a little bit deeper, I believe it was starting to get core-bound too. So that made me very nervous about those platforms. Because SMB performance is a very important thing. And I know these are at-scale tests, but I still, when I see a limit like that

Starting point is 00:47:26 with a single core, I start to get very nervous about that's a pretty low limit. Again, I want to exceed 10 gigabits per second. And I'm definitely not doing it there. In this case, both the Ryzens I tested were within the 5% variance. I mean, if we exclude that blip right there, the Ryzen 5 was fine.

Starting point is 00:47:47 And again, the Ryzen 3, that I believe is a test artifact. What about random? Well, the Atom looks decent on this, but look at the scale of the chart. Everything's within plus or minus 5%. So really, no real difference between any of the platforms. Same for 4K. It's even tighter.

Starting point is 00:48:09 We're between plus and minus 3%. So no real differentiator there. The sequential reads for SMB turned out to be the real big differentiator here. So I know the XGuzzy data wasn't perfect, but we just decided to go with the Ryzen 3 2200G. Annoyingly, AMD decided to release a new line of processors that replaced these in the middle of this, so we actually switched to the 3200G. It should be better anyway.

Starting point is 00:48:38 You know, better architecture. And, you know, it's just an equivalent part, right? Didn't have to change anything but the CPU. Same motherboard worked. So we rolled out the Atom just because of the core boundedness. It was under 900 megabytes per second on that sequential read test. Had concerns about the Xeon-D in the same scenario.

Starting point is 00:49:00 That was an older Xeon-D. Maybe a newer Xeon-D with a higher clock would be fine, but I just didn't have any of those in a form factor that met our needs. And that Ryzen-based load generator that we wound up with, with the 25 gigabit Ethernet, it can do about 1600 megabytes per second

Starting point is 00:49:15 in the same test. So it's almost double what the four-core atom could do. So that's pretty cool. And it's more than what we were getting with our old load generators. So just for an example, I mean, so once we got those results, we spun it up. This is what one of our test beds looks like.

Starting point is 00:49:32 You know, 12 load generators, a single 100 gig switch, and then a TrueNAS on the other side. So we actually make servers. So we manufactured a set of 12 of these load generators, set them up in our test lab, and then we compared them to our older load generators. Now unfortunately, we changed two things at once.

Starting point is 00:49:51 Never change two things at once. We had a new OS image, just because it was a new Ryzen chipset and I wanted to make sure the drivers were up to date. So I made a new image with a newer Windows 10 and a different and newer Linux distribution. So how did we do? This is just, you know, this is some quick testing.

Starting point is 00:50:14 Thank you Ryan for doing the testing. With 12 of these load generators at scale, how do we compare on different workloads? So we've random 4K, 16K, 32K, sequential 1 meg. So everything's plus or minus 10%. You know, I really want these to be five. So I'm probably going to go back and look at these workloads and see, are there new driver updates that I need to do? Is there a new BIOS? Again, there's a lot of change in the Ryzen BIOS space, a lot of patches, a lot of stuff changing. So we need to make sure we're on the

Starting point is 00:50:41 latest, revalidate that. And then I don't know how sequential 1 meg mix could be much worse when reads and writes are the same. So. That's 50-50, right? Yeah, it is. It is 50-50 for sequential, yeah, yeah. Oh, it's 50% for random. Yeah, it's 70% for random, actually,

Starting point is 00:50:59 and then 50% for sequential. Oh, okay. So I don't know how this can be when these are like that, but again, something to look into. So we didn't get a perfect result, but we got a decent result. So... Right, yeah, we haven't done any real tuning

Starting point is 00:51:16 on the client side. We tend not to do anything. I mean, the normal stuff, like turning off the firewall and turning off antivirus, stuff like that. All the great stuff security people love. So in conclusion, choose your equipment carefully, configure your equipment carefully,

Starting point is 00:51:32 avoid unintended bottlenecks, and avoid things that can cause inconsistent performance. Pay attention to the maximum effective data rate for everything in the data path. Virtualized load generators can work, probably, but more care and attention is needed. Load generator hardware, it varies. How much the hardware matters varies, mostly depending on what you're doing. For synthetic testing, like I said, you don't need to break the bank.

Starting point is 00:52:03 Our load generators are fairly inexpensive. The network card is over half the cost. And then, of course, if you're going to change something, if you can, always do an A-B test because you can never be sure what little change may actually make a really big difference. And don't change more than one thing at once. And I think that's everything, so I think I have a question in the back.

Starting point is 00:52:28 Yeah. So what do you suggest about the server settings of the performance of the CPUs? Yeah. Like, when it is very old, then probably you'll not have, I mean, for the optimal performance, you want to set it. Right, right. And then, like, it keeps changing the frequency, so it may or may not be consistent. That's a good point.

Starting point is 00:52:50 So we have a good point on the turbo settings. Obviously, turbo is going to cause inconsistent performance. And it probably will. I haven't gotten to the point where I ever turn it off, and that may be to my detriment. So I would say it's something to monitor. Very rarely will anyone actually turn turbo off in the real world.

Starting point is 00:53:10 So you kind of got to roll with it. But it is something I would be interested to see the difference if I did turn it off. You know, maybe that's what's causing some of the wobbles. If you set the frequency at the same time, then you know that that's the maximum. Right. Yeah. some of the wobbles. Right. Yeah.

Starting point is 00:53:28 And I do set, I do set, we do set the load generators to like maximum performance so they shouldn't be throttling down aggressively because that will impact your response time. I've seen that pretty,

Starting point is 00:53:36 pretty intense. Yeah. Good point. Any other questions? Yeah. Yeah. I just line, what do you do about G-State? Yeah, I just sort of put everything on max performance and hope for the best. Again, I try not to tune things too aggressively.

Starting point is 00:53:57 And we do get very consistent results in our lab, just with max performance in BIOS and in the OS. We've seen temperature differences between the tongue and the bottom of the OS. Yes, temperature differences can significantly impact you as well. It's true. Yeah, we've had debates on benchmarking committees if we need to disclose whether you have the dry ice sitting next to the server or not. Yeah.

Starting point is 00:54:24 Yeah. Yeah, temperature definitely can cause an effect. So ideally you have a nice ASHRAE certified data center that maintains consistent temperatures. Any other questions? Okay. I think we're good here. Thank you very much for taking the time to attend. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #129: So, You Want to Build a Storage Performance Testing Lab?

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.