Storage Developer Conference - #50: Introducing the EDA Workload for the SPEC SFS Benchmark

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 50. Today we hear from Nick Principe, Principal Software Engineer with EMC, as he presents Introducing the EDA Workload for the SpecSFS Benchmark from the 2016 Storage Developer Conference.

Starting point is 00:00:50 So today I'm going to give a quick intro to SFS 2014 for any of you not familiar. And then Jay will take over and talk about what exactly EDA is and the characteristics of EDA workloads. And I will walk you through, he'll walk you through some trace data from real EDA environments. And I'll walk through how we defined the EDA workload in the SFS 2014 benchmark.

Starting point is 00:01:19 And I do have some example data for you. I wasn't sure I would. If you have the deck right now, we'll update it. I don't think the one has the example data, the one you can download now. So we'll get that updated. And then there are a few more enhancements that will actually come into the SFS 2014 load generator

Starting point is 00:01:37 as part of the update that will include EDA. So I'll cover those as well. What is SPEC? If you're not familiar, it's the Standard Performance Evaluation Corporation, which is a nonprofit established to produce industry standard benchmarks and probably most famous for SPEC CPU, I would say, but we have a whole suite of benchmarks for different products and industry interests. Quick disclaimer, we're going to be talking about SFS 2014 SP2 and the EDA workload,

Starting point is 00:02:15 which is part of that. And right now this is pre-release software, so everything in here is subject to change because everything is still under review and test. So a quick overview of SFS 2014. It is an industry standard storage solution benchmark. The whole lineage was an NFS server benchmark. In 2008 we expanded to CIFS and NFS.

Starting point is 00:02:47 And with 2014 we expanded to testing whole storage solutions. In 2014 we also branched out and added four workloads instead of just a workload per protocol. So we started with the database workload, the software build workload, VDA and VDI and now in SP2 we will be adding the EDA workload. SFS 2014 measures performance at the application level. So unlike previous versions, if you're familiar with that, it doesn't generate its own protocol packets. It actually goes through your operating system all the way to the final destination.

Starting point is 00:03:23 So it engages the whole storage stack from application to disk, although if you have enough cache somewhere in your storage stack, it won't make it to disk. That's the beauty of it. But it does test your storage solution all the way from what your application would say, which we feel is much more relevant to customers

Starting point is 00:03:39 because they want to know what their application performance will be. And because we have this flexibility and we're not generating our own packets on the wire, we can test a much broader range of products configurations. So you can think of this as regular hard disks, multi-tier hybrid arrays, and all flash, but you can also think cloud storage.

Starting point is 00:04:05 Someone just did a test and actually got it running in Amazon, believe it or not. And we have published results now and my colleague from IBM has published results on GPFS. So this is not just a NAS benchmark

Starting point is 00:04:20 anymore. You can test any fully featured file system. And I sit here wondering after this whole week, I'm like, hmm, so what about these ones that are almost POSIX? I bet some of the workloads will run fine on that, but I haven't tested that. That would be cool. So for a more extensive introduction to SFS 2014, we actually do have video recordings on the spec.org link or the YouTube link, and the slide decks are also available publicly from SNEA.

Starting point is 00:04:49 So there's one session that Spencer and I did from 2014 and one that Vernon and I did from 2015. So I encourage you to view those if you weren't able to attend. So now I'm going to hand it over to Jig, and he's going to talk about exactly what EDA is. Can you guys hear me? So how many have heard of EDA? Okay, so if you haven't heard of it, it's okay.

Starting point is 00:05:35 And the word or the acronym itself is not that important. It loosely represents a large number of things, right? The acronym itself stands for electronic design automation. But the way the word EDA is used is used to characterize a group of companies as well as a number of tools. So you can think of it as representing semiconductor companies. You can think of it as representing companies that actually manufacture chips. Or you can even think of it as companies that develop EDA tools or applications with it. So if we go line by line here, bullet by bullet, it represents software tools and workflows that are used in designing semiconductor chips. There is arguably over a hundred cent tools and it will take easily over dozens of tools to design a chip from its specification to fabrication.

Starting point is 00:06:45 It's a compute intensive process, and the compute grid has a world from just having risk-based processors to having thousands and thousands of cores that are x86 based, right? So it's a compute intensive process and the amount of concurrency required has increased over time as the technology node has shrunk to nanometer geometries.

Starting point is 00:07:17 Storage is often the bottleneck as the compute grid relies heavily on all data sets residing on NAS. So it's a shared storage infrastructure. All the nodes on the compute grid are uniformly configured to access all the data sets such as projects, scratch, tools, home directories, so on and so forth. So more often than not, you know, it's the file system and the protocol stack that becomes the bottleneck. If you were to

Starting point is 00:07:52 look at the data set characteristics, the data set consists of millions of small files. This is because if you think about the EDDA chip design flow, somebody decided that they're going to represent a circuit as a flat ASCII text file and represent another circuit as another file, another circuit as another file, so on and so forth, and create a POSIX directory hierarchy to define a block and bigger block and then ultimately a chip, right?

Starting point is 00:08:26 So it is the unstructured nature of the directory hierarchy and the way the chip is defined that results in millions to billions of small files. There is a small percentage of large files. The large files can be as large as hundreds of gigs, right? The end result is stored in a standard format called GDS, GDS2, and that GDS2 file represents your image of the chip that you send out to the foundry to be manufactured. And so there is a small percentage of large files. The characteristics of the I.O. are mixed and sequential, right? So mixed, random, and sequential, right? So as you go through various design phases,

Starting point is 00:09:19 you have both sequential I.O. for the larger files and random I.O. for the overall smaller files. What we did is we divided up the design phases into two high-level design phases front-end design and back-end design this is how it's actually carried out and known in the industry as well and what we're going to do is we're going to represent the workload and the storage characteristics of the workload by associating with the high-level design phases as well. The front-end design phases is where you have millions of small files.

Starting point is 00:10:09 The back-end design phases is where you get the larger files. The front-end design phases does generate a lot of transient data known as scratch data. And the scratch data in the EDA space, unlike HPC, is actually stored on a NAS. So a traditional HPC, what you would find is that traditional HPC compute grid will have some kind of interconnect, will have some kind of a distributed file system across it. across it but traditional EDA is you know all the data including the scratch data is on a NAS system you have lots and lots of jobs running concurrently I've heard numbers anywhere from hundreds of thousands to millions of

Starting point is 00:10:58 jobs per day a lot of the front-end jobs and when I say front-end and back-end I don't mean front-end and back-end in terms of storage right I the front-end jobs, and when I say front-end and back-end, I don't mean front-end and back-end in terms of storage, right? I mean front-end and back-end in terms of the EDH of design flow. The front-end workflow, you know, the jobs run only for two to three minutes, right? A single job can run for less than a minute sometimes, maybe a few minutes. And the back-end jobs can run for hours, right? A single job can run for less than a minute sometimes, maybe a few minutes. And the back-end jobs can run for hours, right? On the next slide here, I'll give you some more details into the workflow itself.

Starting point is 00:11:34 But that's kind of the high-level idea. Because of the deep and wide directory structure, the workload tends to be namespace or metadata intensive. And you'll see some of that factored in as we define the workflow. Any questions as I go along? So here's the details in terms of the workflow. You know, you will see a similar chart like this if you were

Starting point is 00:12:03 to Google for, you know for EDA chip design workflow. But I'll step you through it quickly and then I'll tie it into some of the storage characteristics. I'll use a simple example of an adder. Let's say you're trying to design an adder. That's your design specification. It's that you want to design an adder. So that's your design specification, you know, it's that you want to design an adder. Design capture, design is captured in what's known as RTL. RTL stands for register transfer level, I think, but in any case it's essentially representation of your circuit in a hardware description language, right, Either a WER log or HDL.

Starting point is 00:12:46 It's tiny in terms of size, just as I was telling you on the previous slide. It's essentially represented in a flat ASCII text file in a high-level programming language, right? After that, you know, you say, okay, I have described an adder then I'm going to make sure I actually get that to functionally work as design so I would say 2 plus 2 and I anticipate 4 I would say 2 plus 3 and I anticipate 5 so it goes through a functional verification stage then you synthesize the design synthesis could be either on the back end or the front end but think of synthesis as nowize the design. Synthesis could be either on the back end or the front end.

Starting point is 00:13:26 But think of synthesis as now bringing the design to the gate level, right? You're compiling your design. And, you know, you're compiling your design against a given set of standard libraries, standard cell libraries that the foundries give you like TSMCs of the world. These are the manufacturers of the chip, right? So once you compile the design,

Starting point is 00:13:51 there is more due diligence to be done, laws of physics kick in, and you would need to do timing analysis, right? If you had, you know, I simplified it and I said adder, but if you had cascaded blocks in your complex chip, then it's important that block A feeding into block B is not delayed. So that's where timing analysis comes in. Place and route is optimally using a given set of die size to optimally place all your blocks for you know heat

Starting point is 00:14:27 characteristics as well as timing characteristics. Extraction refers to the fact that even the interconnects matter right you know if you were in the earlier days like in the 90s this this wasn't as big of a deal but as the chips got smaller and smaller to nanometer geometries even the the wiring in between the blocks matters and the resistance or the capacity that it may induce could have an ill effect on your overall intended results, right, or expected results. So there is a workflow known as extraction or parasitic extraction. You then do physical verification and sign off,

Starting point is 00:15:15 and then you eventually tape out the chip to the foundry, right? So these are, at a high level, these are kind of the workflows. You can see, you know know a lot of loops going back right so it isn't necessarily sequential in nature there's a lot of going back and forth from one design team to another you know high level folks like to think of it in a sequential manner but it's not necessarily sequential right and then if we zoom out even further, again, we're going to stick with the front-end design and the back-end design theme.

Starting point is 00:15:52 And from a storage perspective, it works out well because the front-end design is where you're going to run hundreds to thousands of jobs concurrently, right? Because there's no such thing as a fully rarefied chip right you do the best due diligence you can which is why yes yes you are running against the same set of block and running different corners as they call it, right? Which essentially just means just as I was telling you,

Starting point is 00:16:31 if it was an adder, you would say 2 plus 2, 2 plus 3, 2 plus 4. So those are different iterations that you're going to run through the same block, right? And so that's, you know, you go through that type of a phase which is what induces the high levels of concurrency. And physical design is where, like for example, timing analysis. An output file for timing analysis could be as large as 26 gig.

Starting point is 00:17:00 I've seen examples of output files of, there's a specific application called Primetime. This application generates large files and so you, you know, essentially you have a mix of small and large files. Of course, the number of jobs that you run is much less in the back end design flows compared to the front end design flows. You're looking at a magnitude of hundreds to thousands of jobs for front end design. You're looking at a magnitude of tens to hundreds for the back end design flow to give you kind

Starting point is 00:17:38 of a perspective. Clear so far in terms of how we're approaching this. As you look at this, one of the things to keep in mind is it's rather complex, there's many tools, there's many workflows. The way the data is allocated by the storage admin doesn't take into account any of these things. They essentially just say, okay, here's a chip underscore FE

Starting point is 00:18:06 for all your front end work. Or chip underscore timing analysis or, you know, STA, static timing analysis, right? So, you know, there isn't a lot of due diligence done at the time the storage is allocated. And as a result, what the NAS is seeing is kind of a mix of all these activities. And combine that with the fact that a house like Broadcom or a large semiconductor house may be working on tens of chips simultaneously.

Starting point is 00:18:41 So you really have kind of a hodgepodge of things happening against the NAS. So this is the way to think of it. This captures what I was trying to tell you. Think of this as the EDA compute grid. The EDA compute grid consisting of tens of thousands of cores today. Someone like Intel would have 50,000 cores, right? It wouldn't be uncommon for someone of Intel scale to have 50,000 cores in their HPC compute grid, right? So start here, tool one through tool N. As I alluded to earlier,

Starting point is 00:19:20 you could have over a dozen some tools. Then each one of those tools, if you have a certain type of an input to that tool, you have a job, and that job will have a certain type of I.O. characteristics. So you couldn't necessarily even say, I'm going to take application 1 through application 50 and look at the I.O. profiles for each application because depending on what you input to the application, your output and your I.O. profile is going to be different.

Starting point is 00:19:56 So you have a large number of I.O. profiles that are generated per each job that you run. Of course, there is some similarities, right, per tool. Specifically, for example, I can say that if I look at prime time cases, prime time tends to be a read intensive workflow. If I look at VCS or NCSim, it tends to be metadata intensive workflow. There are a certain amount of similarities,

Starting point is 00:20:25 but the gating factor became that when I was a customer of storage solutions, I would reach out to the engineering communities and say, hey, can you give me a test case? And they would give me two or three test cases and what used to happen is that if I based my decisions on just those test cases I wouldn't necessarily see any difference in production and more often than not even when comparing that solution a to B to C or a storage solution in general, because Nick led in with a slide saying this is not just for NAS, right?

Starting point is 00:21:11 But so, you know, I wasn't really able to differentiate through synthetic benchmarking or even through specific test cases, which would just isolate a profile here or a profile here or a profile here, right? So I started exploring an alternative. That alternative was, okay, let me go on the NAS side, and let me see what the NAS is doing, and what the NAS is doing over time. And let me try to understand what type of a profile a given NAS is seeing,

Starting point is 00:21:43 and what another NAS is seeing, and and what another NASA scene in my environment. So I come from Broadcom so I had access to a large number of different NASA systems, right? So I did exactly that and I started looking at the consolidated IO profile and I said, okay, once I come up with the consolidated IO profile, how do I replay it? And in that journey, I wrote my own scripts. I looked at SFS 2008, and that got me closer than my own scripts. And then eventually SFS 2014 in terms of some of the client-side implications,

Starting point is 00:22:22 some of the mixing of the workloads, so on and so forth. So that's how we got to where we are. And so let's look at some traces to try to come up with this here, right? This is the goal here is to come up with the consolidated profile. So let's look at one trace over time and I actually had even more complex charts that I took out so I simplified this. In EDA most of the workflows are NFS 3 right so you have over a dozen some RPC calls if you were to condense that down you condense it down to reads, writes, and others, right?

Starting point is 00:23:06 And so you can look at this one over time, and you can see, okay, I'm doing certain, this particular NAS is doing certain amount of read operations, certain amount of write operations, and certain amount of metadata operations. You can certainly see there are bursts of reads and bursts of writes or bursts of metadata at least. There are certain bursts of writes but you can't, you know, in the scale of NFS ops, you can't really see it. And those bursts are actually important

Starting point is 00:23:40 because that will skew the overall normalized profile that we formulate. But the idea became that if I strike a line through here, through the reads, through the whites, and through the middle on the metadata, then I would have some type of an EDA, sustained average normalized profile over time. And the larger the sample you take, the better the results you get. And that's kind of the whole premise around how the profile was formulated.

Starting point is 00:24:15 And if I improve a storage solution, what it's going to do is it's going to shift this chart up. It's going to give the ability of the storage system to be able to deliver a greater amount of these type of operations. The variation will not change, but the amount of reads, the amount of writes, and the amount of metadata that you can do will be shifted up through optimizations that you may perform right so with that in mind let's look at the largest sample that I collected I looked at 20 nest systems

Starting point is 00:24:55 that had been up for over 300 days in aggregate these 20 NAS systems had generated 1.84 trillion NFS operations. And I said, okay, what does the spread of those operations look like? So if you look at the entire NFS3 stack, you have this type of a spread, 39% get adder, 23% writes, 18% reads, 12% access, so on and so forth. If you were to simplify that, you can say around 60% metadata around 20% reads around 25% writes, right? That's one EDA customer. Then I said, okay, let me rinse and repeat this exercise for over a dozen some EDA customers. Because essentially, they're all going through the same flow.

Starting point is 00:25:50 Right? So I did exactly that. And I said, okay, here's peer two. Right? Here's another customer. 41% get at her, 21% writes, 11% access, 8% look up, so on and so forth. So the other chart was 59% metadata, this one 64% metadata. 15% reach, 21% rights. Similar, right? And so keep doing this and so the rest of the slides are going to look similar but there are data points from different EDA customers around the globe right so some are in Korea some are in the US some are in

Starting point is 00:26:30 Japan so on and so forth right so here's another one seven nest systems up 420 days 188 billion NFS operations, strikingly similar, right? 64% metadata, 21% writes, 15% reads. I think almost identical. Of course, metadata is metadata, right? So you look at get adders here at 29%, over here is 41%. But when you think of it as metadata, you know, it's 64% metadata. Of course, we factor in

Starting point is 00:27:10 the unique characteristics of the metadata when we formulate the profile. Here's peer four, percent writes 16 percent reads 62 percent metadata pier 5 65 percent writes 16 percent reads 19 percent writes so on and so forth right so pie charts look pretty and they look nice but what I did eventually is I tabularize them which is coming up next but just to kind of you know reiterate the point and emphasize the point if you put Pier 1 which is a US EDA house and Pier 5 which is a EDA house based out in Korea and you put them side by side and look at how strikingly similar their profiles are right 39% get out of versus 38% get out or 10% access versus 12% access 19% rights versus 23% right so on and so forth now you're getting

Starting point is 00:28:13 into the noise level difference of 5% and you can start start claiming that it's 60 to 65 percent metadata 15 to 20% reads 20 to 25% writes right and so let's tabularize them there is also some uniqueness at certain EDA shops which tend to blend hardware engineering workflows with software engineering workflows. So what this means is it means the chip design flow that I described to you is explicitly just the hardware engineering. It's the chip design process. Software is becoming more and more

Starting point is 00:29:00 of a necessity along with the hardware. So a lot of these EDA shops have a lot of software builds as well. When you factor that in, what it does is it increases the metadata reads. So a lot of these metadata reads that are at 83%, 91%, those are shops that have a blend of hardware and software engineering workflows mixed on their NAS system but the aim here for this profile that's being defined is specifically for the hardware engineering only there is a software

Starting point is 00:29:37 build profile already right so what I'm highlighting are the shops that are purely hardware engineering shops. And you can kind of see the trend there around close to 60% metadata reads on all of those. Around 15 to 20% reads and then about 25% writes. So that's the aim of what we're trying to get to when we try to reproduce this through SFS 2014. Any questions? I'm going to hand it back to Nick. Before I hand it back to Nick, any questions for me?

Starting point is 00:30:17 Yes? What's the average file space that you're getting? File space. How much storage is it? So like some of these samples were like the 1.84 trillion NFS operations that I talked about on 20 NAS systems. That was 640 terabytes. Some of the other ones would have been hundreds of terabytes.

Starting point is 00:30:42 But there was one other analysis that I didn't share with you. But irrespective of whether you take 200 data points or 2,000 data points, the net normalized average of the mix of operations comes out to roughly this much. Any other questions? We saw a couple of presentations earlier. roughly this much. Any other questions? Go ahead, finish. So EDA is a subset of HPC. If you look up HPC on Wikipedia, there will be a nomenclature called embarrassingly parallel.

Starting point is 00:31:34 EDA fits into the embarrassingly parallel HPC in a sense that it isn't really truly HPC because the onus is left upon the end user to divide, subdivide, sub-subdivide the task to have a job belong to only a core. So there isn't really much MPI being done, there isn't a lot of parallel processing, there isn't a lot of homegrown applications, there's only three application providers that are the most dominant. of course there's a 10% market for others but synopsis cadence and mentor graphics are the three large application providers and the way the

Starting point is 00:32:15 chip design workflow has been done and it's it's rightfully categorized as embarrassingly parallel because now they're running into this issue where there is a need to actually have parallel computing come into picture with the nanometer geometry. So certain flows are starting to fork off into kind of a multi-core, multi-host type applications, but it's limited. Any other questions? If you do have questions, of course, I'm still around after Nick.

Starting point is 00:33:02 Sorry. Sorry, I'll turn it off. There we go. All right. Hopefully we'll be in better shape then. There we go. Okay, no feedback. Good.

Starting point is 00:33:17 All right. So Jake has done a great job walking you through what the workload looks like. I want to talk a little bit about the motivation. Why do we want to make an EDA workload in SFS 2014? And Jake touched on a lot of these points. The commonly used tools out there are the scripts that Jake wrote

Starting point is 00:33:37 himself. They're not doing a good enough job of representing production EDA workloads. And there are all these applications out there even though there's only three vendors, there's a lot of different applications for all the different job types. So there are all of these different workload profiles

Starting point is 00:33:53 out there. All these different tools generating all these different workloads. And so, you know, you really want to produce, like Jake mentioned, that combined workload that he has all these traces for. You want to produce that combined Jake mentioned, that combined workload that he has all these traces for. You want to produce that combined workload from all the tools in aggregate.

Starting point is 00:34:13 And SFS 2014, I'm a little biased, but I think it's a pretty darn good tool for generating workloads. As we implemented the workload definition for EDA in SFS 2014, we stuck with the front-end versus back-end workflows. The EDA workload actually just consists of two component workloads. One is front-end and one is back-end. This is because, like mentioned, there's a lot of small files doing the random I.O. versus much larger files where it's sequential I.O. It makes sense to keep those two profiles separate.

Starting point is 00:34:56 If you averaged them out, you would get neither. I'm going to throw two I charts here at you. This is the front end workload. You can see here we have the reads and writes. The reads and writes are done as reading or writing a whole file. This phase is a lot more like a software build where you would pick up that file, that part of the chip, and you would incorporate it in something and then write a different kind of file out. So it's a lot more like a software build.

Starting point is 00:35:27 So the I.O. operations are all on whole files only. And we also see all the metadata here. You can see stat and access clearly stick out as large components, but there's a smattering of other types, create, make, der, unlink to, we'll get to what unlink to means, shmat, etc.

Starting point is 00:35:49 Another interesting point here, you can see the average file size is 8K, so the file size distribution is centered at 8K, but it's a Gaussian distribution around that, well, fairly Gaussian around that. Obviously we're not making incredibly small one byte files. That would be a little bit strange. Another thing to point out, the geometric is set to 50%. The finer details of that are a much longer discussion, but that means that 50% of the time we're going to follow a geometric distribution when selecting what file to use. Basically it means you're going to get either, call it hotbanding, call it

Starting point is 00:36:26 file selection skew, just call it skew. There are going to be files that are accessed a lot more than some other files. There's still going to be a background workload where there's a random distribution of files that could be chosen as well. It's a layered approach to generating this sort of skew, if you will. You can see the I.O. sizes are much smaller for reads. We're somewhere between, well, one byte and almost 32K. Writes are even smaller

Starting point is 00:36:53 at the one byte to almost 8K range. That's all I'll point out. There's a lot more details here and this will be in the user's guide when we release this workload. And it's in the slides you have, of course. The back-end workload, you can see here we're all sequential reads and writes because that's what these tools do. It really is either sequentially reading in something and then sequentially writing out a result. Here, again, we do have that same skew parameter set to the same parameter as the frontend workload. I thought I'd point that out.

Starting point is 00:37:30 You can see the I.O. sizes are much larger. Reads here between 32 and 64K and the writes are between 32 and 128. The files are much larger. Average file size here is centered around 10 megabytes. I will say everything is subject to change. So the only thing I would add here is that for both of the profiles, the read and write transfer size distribution is actually based on traces,

Starting point is 00:38:00 actual traces from the NAS systems. Yeah, there were even more details than we could show you in the time we have. Everything is based on traces and we've been replaying and we're still testing to make sure that the workloads that it has produced matches what we see in the traces we have.

Starting point is 00:38:19 I wasn't sure if I'd be able to do this, but I did. Actually, this is not an official SFS 2014 result by any means. It's just an experiment. You can see it's a generic Linux NAS server, and it's just example data. But I took a generic Linux NAS server, sent to F7.2 and some clients, and I ran the EDA workload as we've defined it in the previous slides. And you can see we get a fairly standard operations per second versus latency curve.

Starting point is 00:38:47 In this case we're SFS 2014 so we talk in terms of business metrics. You can see even this generic Linux NAS server was managing a respectable 31,500 operations per second at a decent latency. Obviously we can see we have a standard knee of the curve. But things are looking well-behaved, sort of reasonable NAS-type performance, like you would expect. And this was NFS v3. So this is your typical EDA shot.

Starting point is 00:39:19 And just to be weird and confuse people, I thought, well, hey, I have a Windows environment too with a Windows NAS server. Let's run it on that. EDA with SMB3. What do you think, Jake? Are we going to see that? Regardless, I ran it and you can see, overall, it achieved

Starting point is 00:39:34 a little less. It did have fewer disks on the back end, but we got a somewhat normal curve. You can see, obviously, we have a very steep, very sharp knee, but, again, we did wind up with 27,000 operations per second, again with a respectable latency. So even with the workload in this pre-release state, we are seeing reasonable results coming

Starting point is 00:39:56 out of the benchmark. So there may be a little fine tuning, but we're well on the way. Currently, EDA is the anchor feature for the SP2 release of SFS 2014, which is what we're working on right now in the subcommittee. This is going to be a performance-neutral release for all the existing workloads. This means that all the currently published

Starting point is 00:40:19 SFS 2014 results are going to remain valid and comparable to any future results. We're not up-revving the benchmark in the major version. Everything stays the same. Everything's performance neutral. Otherwise, we would have to up-rev it. We don't have a choice on that. It's a standard benchmark.

Starting point is 00:40:36 Because of this, there's going to be no distinction on the results page between SP1 and 2 results because there's no need. The upgrade is expected to be free for all current license holders. In addition to EDA, we're also adding some other features. We're constantly adding things to the workload generator because vendors use it for custom workloads and their own regression testing.

Starting point is 00:41:08 We're adding some features that we thought were interesting and didn't affect the performance of the official workload. Among those is an unlink-to-opt type, which is just unlink that removes files of size. Previously, unlink only removed empty files. We are also adding dedupable dataset options, so the ability to generate a dataset that is dedupable. Previously, all files were undedupable, but you could, and we did set the compression ratio. And because dedup is one of those things

Starting point is 00:41:42 where the implementations vary greatly by vendor, we are adding granule size options so you can set that Ddup granule size based on your platform. We're still working out all the details on what that means for standard workloads, but the options will be there in this release to at least play with internally for engineering purposes. Because we gave you the option to set a DDIP granule size, we obviously are giving the option to set a compression granule size because if your compression granule size is

Starting point is 00:42:11 bigger than your DDIP granule size, you will probably not get what you expect as you may be able to figure out. Another thing we added actually for the EDA workload was the ability to have different numbers of files, numbers of directories, and file size distributions per component workload. We needed this because like we've been saying, the front-end and back-end workloads are very different in terms of the files that they access. And previously we didn't have an option to set these per workload, per component workload.

Starting point is 00:42:51 Now we do and we take advantage of that in EDA. All the existing workloads of course are unaffected because they have to remain the same for comparability. We are adding a flag so you could have an encrypted data set. I'm going to leave it at that. It's sort of nebulous, but it's an interesting encryption algorithm. I believe it's the one that Don Capps, the author of NetMess, actually came up with on his own. We're not using that in any workload right now,

Starting point is 00:43:21 and it's not going to be enabled for EDA, but it is an interesting thing to test. Also, we've reduced memory consumption. This is especially relevant for the software build workload. We just found optimizations and put them in because why use memory when we don't have to? It makes testing a lot easier. The key takeaway is, well, EDA certainly is a unique and interesting workload. Initially you may think, well, it's just software build. Who cares? But software build has a little bit more metadata, maybe a lot more metadata depending on your

Starting point is 00:43:59 scale, but it also does not have those large files with that sequential workload like the front end, or excuse me, the back end workload has. Basically, the ability to mix workloads is the key here. Yeah, having that mixed workload. So it certainly is unique, and we have a very large number of traces, like you saw with many, many ops from many different vendors across the globe available to us. And that's a way to make a really nice workload. We do plan to include the EDA workload in our next release of SFS 2014.

Starting point is 00:44:36 That release is expected to be a free upgrade for existing license holders and it will be performance neutral, so all results remain valid for the existing workloads. We're just adding a new one. So that's all the slides we had. I would like to take a moment and note that we do have some publications up. Some vendors have submitted results for some of the SFS 2014 workloads, so if you haven't had a chance to review those, I encourage you to. They're

Starting point is 00:45:05 very interesting, especially you can compare GPFS to ZFS. That certainly was the goal when we were making that benchmark to encourage stuff like that. Because it is the same workload at the application level, of course. That's the whole point. The other thing I'd like to mention is just a plug for SFS you know if you work for a fairly large company chances are you're already a spec member so you are free to participate or we encourage you to

Starting point is 00:45:34 join the spec OSG organization and you can have input into these workloads and we love input you can help develop the benchmark you get access to the latest and greatest stuff for your internal engineering testing.

Starting point is 00:45:50 The scary part is we have to help you get it running. We can do that if you buy it too. You can do that too. I encourage that. Also, if you're a customer of a large storage company, I would say

Starting point is 00:46:04 ask them for their SFS 2014 results. It is an industry standard benchmark for file access. So the best way to get submissions is to have customers ask for them. So sales pitch over. Any more questions? BP, what's going on? So what's the compressibility?

Starting point is 00:46:31 Sorry, how do I say this? What is the data set compressibility for EDA? So the question was, what is the data set compressibility for EDA? And we could roll back. I'm 99% sure that's set to 50% for both sides and Jake can speak to the derivation of that. Yeah, so the dedupability is about 40% and the compressibility is about 50%.

Starting point is 00:46:54 About? Yes, correct. Right. I hate abouts, by the way. And you'll note I didn't actually put any dedupability stuff on here because that's still under test. I know that I have the disclaimer

Starting point is 00:47:12 but I really don't want any of that. We're still qualifying. When we talk about about, when we talk about reads or writes for example, reads are of a range 15-20%. Writes are of a range 20-25 percent, writes are of a range 20 to 25 percent. So similarly, when we talk about compressibility, it's like, hey, it's 49 percent, 52 percent,

Starting point is 00:47:33 53 percent, so on and so forth, right? So it's in that range, right? BP's pondering. I haven't seen it, but somebody's going to publish two numbers, one for programming and one for backing, or is this a combined workload? You can only publish EDA as a whole. The component workloads, you can't separate those out and publish them. If you wanted to, when you have a copy of the benchmark, you could separate them yourself and run them separately if you wanted to.

Starting point is 00:48:08 It's a fun exercise. But you could only publish EDA as a whole. So you can't just be running both workloads at the same time. Exactly. Right, and that's what the EDA production house would be. That's how the EDA production houses would be exercising the storage. What are the typical file sizes? So the typical file size is 95% of the files are less than 100K.

Starting point is 00:48:32 95 plus percent files are less than 100K. Five percent of the files are, zero to five percent of the files are greater than 100K. And I have an entire breakdown of the file size spectrum. And so what we did is we actually mapped the Gaussian distribution that SFS 2014 does and we figured out where would we center the front end and where would we center the back end to get to a similar file size distribution

Starting point is 00:49:05 that we're seeing at EDA customers in terms of the percentage breakdown. And it actually boiled down to the point where we actually needed to drill deeper within the sub 8K files to say, okay, exactly how many percent are 1K and how many percent are near 8k, so on and so forth. But that's factored in as these profiles are generated.

Starting point is 00:49:30 Yeah, it looks very simplistic with average file size 8k. That's how it's defined, but we did actually look at the distributions of what it produces, and they were very close. Right. So what you would do is you would run NetMist or SFS 2014 with the 8K distribution, and you would say, okay, how did it lay that out? And you would also compare that with the 10 meg on the back end, and you would look at the entire data set,

Starting point is 00:49:54 and then you would compare that to the production data set. Any other questions? Yeah. file system. Doesn't that file system affect results? Do some caching, do some additional queries, some organizations. Doesn't it affect the result? Will the result be comparable? So the question is, since SMS 2014 generates workload at the application level, it could go through different file systems

Starting point is 00:50:38 and won't that affect the results? And will they be comparable? And the answer is yes. That's all of those. The main point we wanted to make, well, there are a couple things. First of all, it's very difficult to get the development resources to write a piece of code that will generate every single protocol and version out there right now,

Starting point is 00:51:02 especially as SMB has gotten so mature in the past several years. You can barely keep up. The Samba team can barely keep up, and they're huge. Well, not huge, but it's a lot of brain power behind that team for sure. And the other thing is, you know, we really, it's sort of more interesting to us

Starting point is 00:51:23 to do what the applications would do and let you see, okay, if I'm running ZFS this is going to happen. If I'm running on GPFS, this is going to happen. Because we could never have done GPS if we're trying to write a protocol directly out onto the wire.

Starting point is 00:51:39 We'd have to write InfiniBand that looks like I'm sure it's possible. So the end goal is that the result you measure is the result you would get if you ran that sort of application with that sort of sole storage stack. And so is it comparable from one vendor to the other? Well, if I ran with the CentOS 6

Starting point is 00:52:03 on my array and then NetApp ran with CentOS 7.2, there actually could be some differences there and that is an annoying case where it's like, well, okay, did Linux change or is it a Dell EMC versus NetApp thing or something like that. We had to take that bad with the good of testing that whole application stack and getting people the answer that they would really get if they ran that application in that

Starting point is 00:52:33 whole environment. It's a balancing act. It is a bit of a compromise. I think from the testing I've done, I think it's a pretty good compromise. Just a comment. I'm actually, I don't worry. I just observe that more than NFS client stack differences, say, if you're running over NFS,

Starting point is 00:52:58 you know, elephants compared to oranges on different file system types, the amount of memory in the client has a remarkably large effect where it had zero effect for all intents and purposes on SFS 2008, which is unrealistic. So lots of memory will basically provide massive caching when it can cache for the applications and give you a better predictor of what a application might see due to investment in memory program or whatever.

Starting point is 00:53:28 So the observation was that, you know, having tons and tons of memory in your client versus very little memory in your client will affect the results and inflate them because of all the caching that's possible. And yes, that's certainly true. And we can't get around that. But nobody in the real world can get around that.

Starting point is 00:53:48 Exactly. Nobody in the real world can get around that. I think a good note to make everyone feel a little bit better about that is that we do have a disclosure form. The examples I showed were this. This is the first... This is like your first browser page. It goes on for many, many more. You have to disclose what you used to generate the load, the characteristics of it, the amount of memory. The amount of memory is actually totaled for the whole storage solution. If you have a terabyte of memory

Starting point is 00:54:23 in all your clients and you have 256 gigs in your storage array and that's what you're trying to sell, we're relying on the customers to be smart and review these carefully or the media to do it or marketing departments to point out what the other guy did. So your total memory would be some massive amount of terabytes versus someone who used a more normal config or a smaller config anyway where they would have like a terabyte or something like that. So I'll also put a little bit of a customer side spin on it, right, or EDA end user consumption spin on it, I actually used to model the ADA workload using SFS 2008. And that gave me exactly what Brian's talking about in terms of not having the client-side caching effect.

Starting point is 00:55:12 And I presented that to many ADA customers. And they actually argued against that, saying that, well, no, I do want my client-side caching behavior to be factored in as I try to judge the performance of a storage system. Because that helps me determine how I would size it. So as you think about this workload specifically, what you would want to think about is you would be sizing your clients for solutions for the EDA storage to be similar

Starting point is 00:55:42 to what EDA customers would have their client memory size to be in the compute grid. I think we're being politely told our time is up. I appreciate you coming out and all the questions and be sure to grab your spec pen if you have it already. We'll be around, Vern and I already. So we'll be around. Vern and I, at least, will be around all week, so any more questions. We have one more session about Emerald,

Starting point is 00:56:12 so feel free to stop by or catch us before or after that. Thank you. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Thank you.

Your Ad Here

Storage Developer Conference - #50: Introducing the EDA Workload for the SPEC SFS Benchmark

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.