Storage Developer Conference - #50: Introducing the EDA Workload for the SPEC SFS Benchmark
Episode Date: July 10, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 50.
Today we hear from Nick Principe, Principal Software Engineer with EMC,
as he presents Introducing the EDA Workload for the SpecSFS Benchmark from the 2016 Storage
Developer Conference.
So today I'm going to give a quick intro to SFS 2014 for any of you not familiar.
And then Jay will take over and talk about what exactly EDA is and the characteristics
of EDA workloads.
And I will walk you through,
he'll walk you through some trace data
from real EDA environments.
And I'll walk through how we defined the EDA workload
in the SFS 2014 benchmark.
And I do have some example data for you.
I wasn't sure I would.
If you have the deck right now, we'll update it.
I don't think the one has the example data,
the one you can download now.
So we'll get that updated.
And then there are a few more enhancements
that will actually come into the SFS 2014 load generator
as part of the update that will include EDA.
So I'll cover those as well.
What is SPEC?
If you're not familiar, it's the Standard Performance Evaluation Corporation, which
is a nonprofit established to produce industry standard benchmarks and probably most famous
for SPEC CPU, I would say, but we have a whole suite of benchmarks for different products
and industry interests.
Quick disclaimer, we're going to be talking about SFS 2014 SP2 and the EDA workload,
which is part of that.
And right now this is pre-release software,
so everything in here is subject to change
because everything is still under review and test.
So a quick overview of SFS 2014.
It is an industry standard storage solution benchmark.
The whole lineage was an NFS server benchmark.
In 2008 we expanded to CIFS and NFS.
And with 2014 we expanded to testing whole storage solutions.
In 2014 we also branched out and added four workloads instead of just a workload per protocol.
So we started with the database workload, the software build workload, VDA and VDI and
now in SP2 we will be adding the EDA workload.
SFS 2014 measures performance at the application level.
So unlike previous versions, if you're familiar with that, it doesn't generate its own protocol
packets.
It actually goes through your operating system all the way to the final destination.
So it engages the whole storage stack from application to disk,
although if you have enough cache
somewhere in your storage stack,
it won't make it to disk.
That's the beauty of it.
But it does test your storage solution
all the way from what your application would say,
which we feel is much more relevant to customers
because they want to know
what their application performance will be.
And because we have this flexibility
and we're not generating our own packets on the wire,
we can test a much broader range of products configurations.
So you can think of this as regular hard disks,
multi-tier hybrid arrays, and all flash,
but you can also think cloud storage.
Someone just did a test
and actually got it running in Amazon,
believe it or not.
And we have published results now
and my colleague from
IBM has published
results on GPFS.
So this is not just a NAS benchmark
anymore.
You can test any fully featured file
system. And I sit here wondering after this
whole week, I'm like, hmm, so what about these ones that are almost POSIX? I bet some of
the workloads will run fine on that, but I haven't tested that. That would be cool.
So for a more extensive introduction to SFS 2014, we actually do have video recordings
on the spec.org link or the YouTube link,
and the slide decks are also available publicly from SNEA.
So there's one session that Spencer and I did from 2014
and one that Vernon and I did from 2015.
So I encourage you to view those if you weren't able to attend.
So now I'm going to hand it over to Jig,
and he's going to talk about exactly what EDA is.
Can you guys hear me?
So how many have heard of EDA?
Okay, so if you haven't heard of it, it's okay.
And the word or the acronym itself is not that important.
It loosely represents a large number of things, right?
The acronym itself stands for electronic design automation.
But the way the word EDA is used is used to characterize a group of companies as well as a number of tools.
So you can think of it as representing semiconductor companies.
You can think of it as representing companies that actually manufacture chips.
Or you can even think of it as companies that develop EDA tools or applications with it. So if we go line by line here, bullet by bullet, it represents software tools and workflows that are used in designing semiconductor chips. There is arguably over a hundred
cent tools and it will take easily over dozens of tools to design a chip from its specification to fabrication.
It's a compute intensive process,
and the compute grid has a world from just having
risk-based processors to having thousands and thousands
of cores that are x86 based, right?
So it's a compute intensive process
and the amount of concurrency required
has increased over time as the technology node
has shrunk to nanometer geometries.
Storage is often the bottleneck as the compute grid
relies heavily on all data sets residing on NAS.
So it's a shared storage infrastructure.
All the nodes on the compute grid are uniformly configured
to access all the data sets such as projects, scratch, tools,
home directories, so on and so forth.
So more often than not, you know, it's the
file system and the protocol stack that becomes the bottleneck. If you were to
look at the data set characteristics, the data set consists of millions of small
files. This is because if you think about the EDDA chip design flow, somebody decided that they're going to represent a circuit
as a flat ASCII text file
and represent another circuit as another file,
another circuit as another file, so on and so forth,
and create a POSIX directory hierarchy
to define a block and bigger block
and then ultimately a chip, right?
So it is the unstructured nature of the directory hierarchy
and the way the chip is defined that results in millions to billions of small files.
There is a small percentage of large files.
The large files can be as large as hundreds of gigs, right? The end result is stored in a standard format called GDS, GDS2, and that GDS2 file
represents your image of the chip that you send out to the foundry to be manufactured. And so there is a small percentage of large files.
The characteristics of the I.O. are mixed and sequential, right?
So mixed, random, and sequential, right?
So as you go through various design phases,
you have both sequential I.O. for the larger files
and random I.O. for the overall smaller files.
What we did is we divided up the design phases into two high-level design phases front-end
design and back-end design this is how it's actually carried out and known in
the industry as well and what we're going to do is we're going to represent
the workload and the storage characteristics of the workload by
associating with the high-level design phases as well.
The front-end design phases is where you have millions of small files.
The back-end design phases is where you get the larger files.
The front-end design phases does generate a lot of transient data known as scratch data.
And the scratch data in the EDA space, unlike HPC, is actually stored on a NAS.
So a traditional HPC, what you would find is that traditional HPC compute grid will
have some kind of interconnect, will have some kind of a distributed file system across
it. across it but traditional EDA is you know all the data including the scratch
data is on a NAS system you have lots and lots of jobs running concurrently
I've heard numbers anywhere from hundreds of thousands to millions of
jobs per day a lot of the front-end jobs and when I say front-end and back-end I don't mean front-end and back-end in terms of storage right I the front-end jobs, and when I say front-end and back-end, I don't mean front-end
and back-end in terms of storage, right?
I mean front-end and back-end in terms of the EDH of design flow.
The front-end workflow, you know, the jobs run only for two to three minutes, right?
A single job can run for less than a minute sometimes, maybe a few minutes. And the back-end jobs can run for hours, right? A single job can run for less than a minute sometimes, maybe a few minutes.
And the back-end jobs can run for hours, right?
On the next slide here, I'll give you some more details
into the workflow itself.
But that's kind of the high-level idea.
Because of the deep and wide directory structure,
the workload tends to be namespace or metadata intensive.
And you'll see some of that factored
in as we define the workflow.
Any questions as I go along?
So here's the details in terms of the workflow.
You know, you will see a similar chart like this if you were
to Google for, you know for EDA chip design workflow.
But I'll step you through it quickly and then I'll tie it into some of the storage characteristics.
I'll use a simple example of an adder.
Let's say you're trying to design an adder.
That's your design specification.
It's that you want to design an adder. So that's your design specification, you know, it's that you want to design an adder. Design capture, design is captured in what's known as RTL. RTL stands for
register transfer level, I think, but in any case it's essentially representation of your circuit
in a hardware description language, right, Either a WER log or HDL.
It's tiny in terms of size, just as I was telling you on the previous slide.
It's essentially represented in a flat ASCII text file
in a high-level programming language, right?
After that, you know, you say, okay, I have described an adder then I'm going to make sure I actually
get that to functionally work as design so I would say 2 plus 2 and I anticipate
4 I would say 2 plus 3 and I anticipate 5 so it goes through a functional
verification stage then you synthesize the design synthesis could be either on
the back end or the front end but think of synthesis as nowize the design. Synthesis could be either on the back end or the front end.
But think of synthesis as now bringing the design
to the gate level, right?
You're compiling your design.
And, you know, you're compiling your design against a given set
of standard libraries, standard cell libraries
that the foundries give you like TSMCs of the world.
These are the manufacturers of the chip, right?
So once you compile the design,
there is more due diligence to be done,
laws of physics kick in,
and you would need to do timing analysis, right?
If you had, you know, I simplified it and I said adder,
but if you had cascaded blocks in your complex chip,
then it's important that block A feeding into block B is not delayed.
So that's where timing analysis comes in.
Place and route is optimally using a given set of die size to optimally place all your blocks for you know heat
characteristics as well as timing characteristics. Extraction refers to the
fact that even the interconnects matter right you know if you were in the
earlier days like in the 90s this this wasn't as big of a deal but as the
chips got smaller and smaller to nanometer geometries even the the wiring
in between the blocks matters and the resistance or the capacity that it
may induce could have an ill effect on your overall intended results, right, or expected results.
So there is a workflow known as extraction or parasitic extraction.
You then do physical verification and sign off,
and then you eventually tape out the chip to the foundry, right?
So these are, at a high level, these are kind of the workflows.
You can see, you know know a lot of loops going back
right so it isn't necessarily sequential in nature there's a lot of going back and forth from one
design team to another you know high level folks like to think of it in a sequential manner but
it's not necessarily sequential right and then if we zoom out even further, again,
we're going to stick with the front-end design
and the back-end design theme.
And from a storage perspective, it works out well
because the front-end design is where you're going to run
hundreds to thousands of jobs concurrently, right?
Because there's no such thing as a fully
rarefied chip right you do the best due diligence you can which is why
yes yes you are running against the same set of block and running different
corners as they call it, right?
Which essentially just means just as I was telling you,
if it was an adder, you would say 2 plus 2, 2 plus 3, 2 plus 4.
So those are different iterations that you're going to run through the same block, right?
And so that's, you know, you go through that type of a phase
which is what induces the high levels of concurrency.
And physical design is where, like for example, timing
analysis.
An output file for timing analysis
could be as large as 26 gig.
I've seen examples of output files of,
there's a specific application called
Primetime.
This application generates large files and so you, you know, essentially you have a mix
of small and large files.
Of course, the number of jobs that you run is much less in the back end design flows compared to the front end design flows.
You're looking at a magnitude of hundreds to thousands of jobs for front end design.
You're looking at a magnitude of tens to hundreds for the back end design flow to give you kind
of a perspective.
Clear so far in terms of how we're approaching this.
As you look at this, one of the things to keep in mind is
it's rather complex, there's many tools,
there's many workflows.
The way the data is allocated by the storage admin
doesn't take into account any of these things.
They essentially just say, okay, here's a chip underscore FE
for all your front end work.
Or chip underscore timing analysis or, you know, STA,
static timing analysis, right?
So, you know, there isn't a lot of due diligence done
at the time the storage is allocated.
And as a result, what the NAS is seeing is kind of a mix of all these activities.
And combine that with the fact that a house like Broadcom or a large semiconductor house
may be working on tens of chips simultaneously.
So you really have kind of a hodgepodge of things happening against the NAS.
So this is the way to think of it. This captures what I was trying to tell you.
Think of this as the EDA compute grid. The EDA compute grid consisting of tens of thousands of cores today.
Someone like Intel would have 50,000 cores, right?
It wouldn't be uncommon for someone of Intel scale
to have 50,000 cores in their HPC compute grid, right?
So start here, tool one through tool N.
As I alluded to earlier,
you could have over a dozen some tools.
Then each one of those tools,
if you have a certain type of an input to that tool,
you have a job,
and that job will have a certain type of I.O. characteristics.
So you couldn't necessarily even say,
I'm going to take application 1 through application 50 and look at the I.O. profiles for each application
because depending on what you input to the application, your output and your I.O. profile is going to be different.
So you have a large number of I.O. profiles that are generated per each job that you run.
Of course, there is some similarities, right, per tool.
Specifically, for example, I can say that if I look
at prime time cases, prime time tends
to be a read intensive workflow.
If I look at VCS or NCSim,
it tends to be metadata intensive workflow.
There are a certain amount of similarities,
but the gating factor became that when I was a customer
of storage solutions, I would reach
out to the engineering communities and say,
hey, can you give me a test case?
And they would give me two or three test cases and what used to happen is that if I based my decisions
on just those test cases I wouldn't necessarily see any difference in
production and more often than not even when comparing that solution a to B to C or a storage solution in general, because Nick led
in with a slide saying this is not just for NAS, right?
But so, you know, I wasn't really able to differentiate
through synthetic benchmarking or even
through specific test cases, which would just isolate a
profile here or a profile here or a profile here, right?
So I started exploring an alternative.
That alternative was, okay, let me go on the NAS side,
and let me see what the NAS is doing, and what the NAS is doing over time.
And let me try to understand what type of a profile a given NAS is seeing,
and what another NAS is seeing, and and what another NASA scene in my environment.
So I come from Broadcom so I had access to a large number of different NASA systems, right?
So I did exactly that and I started looking at the consolidated IO profile and I said,
okay, once I come up with the consolidated IO profile, how do I replay it?
And in that journey, I wrote my own scripts.
I looked at SFS 2008,
and that got me closer than my own scripts.
And then eventually SFS 2014 in terms of some of the client-side implications,
some of the mixing of the workloads,
so on and so forth.
So that's how we got to where we are. And so let's look at some traces to try to come up with
this here, right? This is the goal here is to come up with the consolidated profile.
So let's look at one trace over time and I actually had even more
complex charts that I took out so I simplified this. In EDA most of the
workflows are NFS 3 right so you have over a dozen some RPC calls if you were
to condense that down you condense it down to reads, writes, and others, right?
And so you can look at this one over time, and you can see, okay, I'm doing certain,
this particular NAS is doing certain amount of read operations, certain amount of write operations,
and certain amount of metadata operations.
You can certainly see there are bursts of reads and bursts
of writes or bursts of metadata at least.
There are certain bursts of writes but you can't, you know,
in the scale of NFS ops, you can't really see it.
And those bursts are actually important
because that will skew the overall normalized profile that we formulate.
But the idea became that if I strike a line through here,
through the reads, through the whites, and through the middle on the metadata,
then I would have some type of an EDA,
sustained average normalized profile over time.
And the larger the sample you take,
the better the results you get.
And that's kind of the whole premise around how the profile was formulated.
And if I improve a storage solution,
what it's going to do is it's going to shift this chart up.
It's going to give the ability of the storage system to be able to deliver
a greater amount of these type of operations.
The variation will not change, but the amount of reads, the amount of writes,
and the amount of metadata that you can do will be shifted up
through optimizations that you may perform right so with that in mind let's
look at the largest sample that I collected I looked at 20 nest systems
that had been up for over 300 days in aggregate these 20 NAS systems had generated 1.84 trillion NFS operations.
And I said, okay, what does the spread of those operations look like?
So if you look at the entire NFS3 stack, you have this type of a spread,
39% get adder, 23% writes, 18% reads, 12% access, so on and so forth. If you were to
simplify that, you can say around 60% metadata around 20% reads around 25%
writes, right? That's one EDA customer. Then I said, okay, let me rinse and repeat
this exercise for over a dozen some EDA customers.
Because essentially, they're all going through the same flow.
Right?
So I did exactly that.
And I said, okay, here's peer two.
Right?
Here's another customer.
41% get at her, 21% writes, 11% access, 8% look up, so on and so forth. So the other chart was 59% metadata, this one 64%
metadata. 15% reach, 21% rights. Similar, right? And so keep doing this and so the rest of the
slides are going to look similar but there are data points from different EDA customers around the globe right so some are in Korea some are in the US some are in
Japan so on and so forth right so here's another one seven nest systems up 420
days 188 billion NFS operations, strikingly similar, right?
64% metadata, 21% writes, 15% reads.
I think almost identical.
Of course, metadata is metadata, right?
So you look at get adders here at 29%, over here is 41%.
But when you think of it as metadata,
you know, it's 64% metadata. Of course, we factor in
the unique characteristics of the metadata when we formulate the profile. Here's peer four, percent writes 16 percent reads 62 percent metadata pier 5 65 percent
writes 16 percent reads 19 percent writes so on and so forth right so pie
charts look pretty and they look nice but what I did eventually is I tabularize
them which is coming up next but just to kind of you know reiterate the point and
emphasize the point if you put Pier 1 which is a US EDA house and Pier 5 which
is a EDA house based out in Korea and you put them side by side and look at
how strikingly similar their profiles are right 39% get out of versus 38% get out or 10% access versus
12% access 19% rights versus 23% right so on and so forth now you're getting
into the noise level difference of 5% and you can start start claiming that
it's 60 to 65 percent metadata 15 to 20% reads 20 to 25% writes right and so let's tabularize
them there is also some uniqueness at certain EDA shops which tend to blend
hardware engineering workflows with software engineering workflows. So what this means is it means the chip design flow
that I described to you is explicitly just the hardware
engineering.
It's the chip design process.
Software is becoming more and more
of a necessity along with the hardware.
So a lot of these EDA shops have a lot of software builds as well.
When you factor that in, what it does is it increases the metadata reads.
So a lot of these metadata reads that are at 83%, 91%,
those are shops that have a
blend of hardware and software engineering workflows mixed on their
NAS system but the aim here for this profile that's being defined is
specifically for the hardware engineering only there is a software
build profile already right so what I'm highlighting are the shops that are
purely hardware engineering shops.
And you can kind of see the trend there around close to 60% metadata reads on all of those.
Around 15 to 20% reads and then about 25% writes.
So that's the aim of what we're trying to get to when we try to reproduce this through SFS 2014.
Any questions?
I'm going to hand it back to Nick.
Before I hand it back to Nick, any questions for me?
Yes?
What's the average file space that you're getting?
File space.
How much storage is it?
So like some of these samples were like the 1.84 trillion NFS operations
that I talked about on 20 NAS systems.
That was 640 terabytes.
Some of the other ones would have been hundreds of terabytes.
But there was one other analysis that I didn't share with you.
But irrespective of whether you take 200 data points or 2,000
data points, the net normalized average
of the mix of operations comes out to roughly this much.
Any other questions?
We saw a couple of presentations earlier. roughly this much. Any other questions? Go ahead, finish. So EDA is a subset of HPC.
If you look up HPC on Wikipedia,
there will be a nomenclature called embarrassingly parallel.
EDA fits into the embarrassingly parallel HPC
in a sense that it isn't really truly HPC
because the onus is left upon the end user to
divide, subdivide, sub-subdivide the task to have a job belong to only a core. So
there isn't really much MPI being done, there isn't a lot of parallel processing,
there isn't a lot of homegrown applications, there's only three
application providers that are the most dominant. of course there's a 10% market for others but synopsis cadence and
mentor graphics are the three large application providers and the way the
chip design workflow has been done and it's it's rightfully categorized as
embarrassingly parallel because now they're running into this issue where
there is a need to actually have parallel computing come into picture with the nanometer
geometry. So certain flows are starting to fork off into kind of a multi-core, multi-host
type applications, but it's limited.
Any other questions?
If you do have questions, of course,
I'm still around after Nick.
Sorry.
Sorry, I'll turn it off.
There we go.
All right.
Hopefully we'll be in better shape then.
There we go.
Okay, no feedback.
Good.
All right.
So Jake has done a great job walking you through what the workload looks like.
I want to talk a little bit about the motivation.
Why do we want to make an EDA workload in SFS 2014?
And Jake touched on
a lot of these points.
The commonly used tools out
there are the scripts that Jake wrote
himself. They're not doing a good enough
job of representing production EDA
workloads. And there are all
these applications out there
even though there's only three vendors, there's
a lot of different applications for all the different job
types. So
there are all of these different workload profiles
out there. All these different
tools generating all these different workloads.
And so, you know,
you really want to produce, like Jake mentioned,
that combined workload
that he has all these traces for. You want to produce that combined Jake mentioned, that combined workload that he has all these
traces for.
You want to produce that combined workload from all the tools in aggregate.
And SFS 2014, I'm a little biased, but I think it's a pretty darn good tool for generating
workloads. As we implemented the workload definition for EDA in SFS 2014,
we stuck with the front-end versus back-end workflows.
The EDA workload actually just consists of two component workloads.
One is front-end and one is back-end.
This is because, like mentioned, there's a lot of small files doing the random I.O.
versus much larger files where it's sequential I.O.
It makes sense to keep those two profiles separate.
If you averaged them out, you would get neither.
I'm going to throw two I charts here at you.
This is the front end workload. You can
see here we have the reads and writes. The reads and writes are done as reading or writing
a whole file. This phase is a lot more like a software build where you would pick up that
file, that part of the chip, and you would incorporate it in something and then write
a different kind of file out.
So it's a lot more like a software build.
So the I.O. operations are all on whole files only.
And we also see all the metadata here.
You can see stat and access clearly stick out
as large components,
but there's a smattering of other types,
create, make, der, unlink to,
we'll get to what unlink to
means, shmat, etc.
Another interesting point here, you can see the average file size is 8K, so the file size
distribution is centered at 8K, but it's a Gaussian distribution around that, well, fairly
Gaussian around that.
Obviously we're not making incredibly small one byte files. That would be a little
bit strange. Another thing to point out, the geometric is set to 50%. The finer details
of that are a much longer discussion, but that means that 50% of the time we're going
to follow a geometric distribution when selecting what file to use. Basically it means you're
going to get either, call it hotbanding, call it
file selection skew, just call it skew.
There are going to be files that are accessed a lot more than some other files.
There's still going to be a background workload where there's a
random distribution of files that could be chosen as well.
It's a layered approach to generating this sort of
skew, if you will.
You can see the I.O. sizes are much smaller for
reads. We're somewhere between, well, one byte and almost 32K. Writes are even smaller
at the one byte to almost 8K range. That's all I'll point out. There's a lot more details
here and this will be in the user's guide when we release this workload.
And it's in the slides you have, of course.
The back-end workload, you can see here we're all sequential reads and writes because that's what these tools do.
It really is either sequentially reading in something and then sequentially writing out a result.
Here, again, we do have that same skew parameter set to the same parameter as the frontend
workload.
I thought I'd point that out.
You can see the I.O. sizes are much larger.
Reads here between 32 and 64K and the writes are between 32 and 128.
The files are much larger.
Average file size here is centered around 10 megabytes.
I will say everything is subject to change.
So the only thing I would add here is that
for both of the profiles, the read and write
transfer size distribution is actually based on traces,
actual traces from the NAS systems. Yeah, there were even more details
than we could show you in the time
we have.
Everything is based on traces
and we've been replaying and we're still testing
to make sure that the workloads that it has
produced matches what we see in the
traces we have.
I wasn't sure if I'd be able to do this, but I did.
Actually, this is not an
official SFS 2014 result by any means.
It's just an experiment.
You can see it's a generic Linux NAS server, and it's just example data.
But I took a generic Linux NAS server, sent to F7.2 and some clients,
and I ran the EDA workload as we've defined it in the previous slides.
And you can see we get a fairly standard operations per second versus latency curve.
In this case we're SFS 2014 so we talk in terms of business metrics. You can see even
this generic Linux NAS server was managing a respectable 31,500 operations per second
at a decent latency. Obviously we can see we have a standard knee of the curve.
But things are looking well-behaved,
sort of reasonable NAS-type performance,
like you would expect.
And this was NFS v3.
So this is your typical EDA shot.
And just to be weird and confuse people,
I thought, well, hey, I have a Windows environment too
with a Windows NAS server.
Let's run it on that.
EDA with SMB3. What do you think, Jake?
Are we going to see that?
Regardless, I ran it
and you can see, overall, it achieved
a little less. It did have fewer disks
on the back end, but we got
a somewhat normal curve. You can see, obviously,
we have a very steep, very sharp knee,
but, again, we did wind up with
27,000
operations per second, again with a respectable latency.
So even with the workload in this pre-release state, we are seeing reasonable results coming
out of the benchmark.
So there may be a little fine tuning, but we're well on the way.
Currently, EDA is the anchor feature for the SP2 release of SFS 2014,
which is what we're working on right now
in the subcommittee.
This is going to be a performance-neutral release
for all the existing workloads.
This means that all the currently published
SFS 2014 results are going to remain valid
and comparable to any future results.
We're not up-revving the benchmark
in the major version.
Everything stays the same. Everything's performance
neutral. Otherwise, we would have to up-rev
it. We don't have a choice on that. It's a
standard benchmark.
Because of this, there's going to be no distinction
on the results page between SP1 and 2
results because there's no need.
The upgrade
is expected to be free for all current license holders.
In addition to EDA, we're also adding some other features.
We're constantly adding things to the workload generator because vendors use it for custom
workloads and their own regression testing.
We're adding some features that we thought were interesting and didn't affect the performance of the official workload.
Among those is an unlink-to-opt type, which is just unlink that removes files of size.
Previously, unlink only removed empty files.
We are also adding dedupable dataset options,
so the ability to generate a dataset that is dedupable.
Previously, all files were undedupable,
but you could, and we did set the compression ratio.
And because dedup is one of those things
where the implementations vary greatly by vendor,
we are adding granule size options so you can set that Ddup granule size based on your
platform.
We're still working out all the details on what that means for standard workloads, but
the options will be there in this release to at least play with internally for engineering
purposes.
Because we gave you the option to set a DDIP granule size, we obviously are giving
the option to set a compression granule size because if your compression granule size is
bigger than your DDIP granule size, you will probably not get what you expect as you may
be able to figure out.
Another thing we added actually for the EDA workload was the ability to have different
numbers of files, numbers of directories, and file size distributions per component
workload.
We needed this because like we've been saying, the front-end and back-end workloads are very
different in terms of the files that they access.
And previously we didn't have an option to set these per workload, per component workload.
Now we do and we take advantage of that in EDA.
All the existing workloads of course are unaffected because they have to remain the same for comparability.
We are adding a flag so you could have an encrypted data set.
I'm going to leave it at that.
It's sort of nebulous, but it's an interesting encryption algorithm.
I believe it's the one that Don Capps, the author of NetMess,
actually came up with on his own.
We're not using that in any workload right now,
and it's not going to be enabled for EDA,
but it is an interesting thing to test.
Also, we've reduced memory consumption. This is especially relevant for the software build workload. We just found optimizations and put them in
because why use memory when we don't have to? It makes testing a lot easier.
The key takeaway is, well, EDA certainly is a unique and interesting workload.
Initially you may think, well, it's just software build.
Who cares?
But software build has a little bit more metadata, maybe a lot more metadata depending on your
scale, but it also does not have those large files with that sequential workload like the front end,
or excuse me, the back end workload has.
Basically, the ability to mix workloads is the key here.
Yeah, having that mixed workload.
So it certainly is unique, and we have a very large number of traces,
like you saw with many, many ops from many different vendors across the globe available to us.
And that's a way to make a really nice workload.
We do plan to include the EDA workload in our next release of SFS 2014.
That release is expected to be a free upgrade for existing license holders and it will be
performance neutral, so all results remain valid for the existing workloads.
We're just adding a new one.
So that's all the slides we had.
I would like to take a moment and note that we do have some publications up.
Some vendors have submitted results for some of the SFS 2014 workloads,
so if you haven't had a chance to review those, I encourage you to.
They're
very interesting, especially you can compare GPFS to ZFS. That certainly was the goal when
we were making that benchmark to encourage stuff like that. Because it is the same workload
at the application level, of course. That's the whole point.
The other thing I'd like to mention is just a plug for SFS you know
if you work for
a fairly large company chances are you're
already a spec member so you are free to
participate or we encourage you to
join the spec
OSG organization
and you can have input into
these workloads and we love input
you can help develop the benchmark you get access
to the latest and greatest
stuff for your internal engineering
testing.
The scary part is
we have to help you get it running.
We can do that
if you buy it too.
You can do that too.
I encourage that.
Also, if you're a customer of a large storage
company, I would say
ask them for their SFS 2014 results.
It is an industry standard benchmark for file access.
So the best way to get submissions
is to have customers ask for them.
So sales pitch over.
Any more questions?
BP, what's going on?
So what's the compressibility?
Sorry, how do I say this?
What is the data set compressibility for EDA?
So the question was, what is the data set compressibility for EDA?
And we could roll back.
I'm 99% sure that's set to 50% for both sides
and Jake can speak to the derivation of that.
Yeah, so the dedupability is about 40%
and the compressibility is about 50%.
About?
Yes, correct.
Right.
I hate abouts, by the way.
And you'll note I didn't actually
put any dedupability stuff
on here because that's still under test.
I know that I have the disclaimer
but I really don't want any of that.
We're still qualifying.
When we talk about about,
when we talk about reads or writes
for example, reads are of a range
15-20%.
Writes are of a range 20-25 percent, writes are of a range 20 to 25 percent.
So similarly, when we talk about compressibility, it's like, hey, it's 49 percent, 52 percent,
53 percent, so on and so forth, right?
So it's in that range, right?
BP's pondering. I haven't seen it, but somebody's going to publish two numbers, one for programming and one for backing, or is this a combined workload?
You can only publish EDA as a whole.
The component workloads, you can't separate those out and publish them.
If you wanted to, when you have a copy of the benchmark,
you could separate them yourself and run them separately
if you wanted to.
It's a fun exercise.
But you could only publish EDA as a whole.
So you can't just be running both workloads at the same time.
Exactly.
Right, and that's what the EDA production house would be.
That's how the EDA production houses would be exercising the storage.
What are the typical file sizes?
So the typical file size is 95% of the files are less than 100K.
95 plus percent files are less than 100K.
Five percent of the files are, zero to five percent of the files are greater than 100K.
And I have an entire breakdown of the file size spectrum.
And so what we did is we actually mapped the Gaussian
distribution that SFS 2014 does and we figured
out where would we center the front end
and where would we center the back end to get
to a similar file size distribution
that we're seeing at EDA customers in terms
of the percentage breakdown.
And it actually boiled down to the point
where we actually needed to drill deeper
within the sub 8K files to say, okay,
exactly how many percent are 1K
and how many percent are near 8k, so on and so forth.
But that's factored in as these profiles are generated.
Yeah, it looks very simplistic with average file size 8k.
That's how it's defined, but we did actually look at the distributions of what it produces,
and they were very close.
Right.
So what you would do is you would run NetMist or SFS 2014 with the 8K distribution,
and you would say, okay, how did it lay that out?
And you would also compare that with the 10 meg on the back end,
and you would look at the entire data set,
and then you would compare that to the production data set.
Any other questions? Yeah. file system. Doesn't that file system affect results? Do some caching, do some additional queries, some
organizations. Doesn't it affect the result?
Will the result be comparable?
So the question is, since
SMS 2014 generates workload
at the application level,
it could go through different file systems
and won't that affect the results?
And will they be comparable?
And the answer
is yes.
That's all of those.
The main point we wanted to make, well, there are a couple things.
First of all, it's very difficult to get the development resources to write a piece of code that will generate every single protocol
and version out there right now,
especially as SMB has gotten so mature in the past several years.
You can barely keep up.
The Samba team can barely keep up,
and they're huge.
Well, not huge, but it's a lot of brain power
behind that team for sure.
And the other thing is, you know,
we really, it's sort of more interesting to us
to do what the applications would do
and let you
see, okay, if I'm running ZFS
this is going to happen. If I'm running
on GPFS, this is going to happen.
Because we could never have done GPS
if we're trying to write a protocol
directly out onto the wire.
We'd have to write InfiniBand that looks like
I'm sure it's possible.
So
the end goal
is that the result you measure is the result you would get if you ran
that sort of application with that sort of sole storage stack.
And so is it comparable from one vendor
to the other? Well, if I ran with the CentOS 6
on my array and then NetApp ran
with CentOS 7.2, there actually could be some differences there and that is an annoying
case where it's like, well, okay, did Linux change or is it a Dell EMC versus NetApp thing
or something like that. We had to take that bad with the good of
testing that whole
application stack and getting people
the answer that they would really get
if they ran that application in that
whole environment.
It's a balancing act. It is a bit of a compromise.
I think from the testing I've done, I think
it's a pretty good compromise.
Just a comment.
I'm actually, I don't worry.
I just observe that more than NFS client stack differences,
say, if you're running over NFS,
you know, elephants compared to oranges on different file system types,
the amount of memory in the client has a remarkably large effect
where it had zero effect for all intents and purposes on SFS 2008,
which is unrealistic.
So lots of memory will basically provide massive caching
when it can cache for the applications
and give you a better predictor of what a application might see
due to investment in memory program or whatever.
So the observation was that, you know,
having tons and tons of memory in your client
versus very little memory in your client
will affect the results and inflate them
because of all the caching that's possible.
And yes, that's certainly true.
And we can't get around that.
But nobody in the real world can get around that.
Exactly. Nobody in the real world can get around that.
I think a good note to make
everyone feel a little bit better about that is that we do have
a disclosure form. The examples I showed were
this. This is the first... This
is like your first browser page. It goes on for many, many more. You have to disclose
what you used to generate the load, the characteristics of it, the amount of memory. The amount of
memory is actually totaled for the whole storage solution. If you have a terabyte of memory
in all your clients and you have
256 gigs in your storage array and that's what you're trying to sell, we're relying
on the customers to be smart and review these carefully or the media to do it or marketing
departments to point out what the other guy did. So your total memory would be some massive
amount of terabytes versus someone who used a more normal config or a smaller config anyway where they would have like a terabyte or something like that.
So I'll also put a little bit of a customer side spin on it, right, or EDA end user consumption spin on it, I actually used to model the ADA workload using SFS 2008.
And that gave me exactly what Brian's talking about in terms
of not having the client-side caching effect.
And I presented that to many ADA customers.
And they actually argued against that, saying that,
well, no, I do want my client-side caching behavior
to be factored in as I try to judge the performance of a storage system.
Because that helps me determine how I would size it.
So as you think about this workload specifically,
what you would want to think about is you would be sizing your
clients for solutions for the EDA storage to be similar
to what EDA customers would have their client memory size
to be in the compute grid.
I think we're being politely told our time is up.
I appreciate you coming out and all the questions and be sure to grab your spec pen if you have
it already.
We'll be around, Vern and I already. So we'll be around.
Vern and I, at least, will be around all week, so any more questions.
We have one more session about Emerald,
so feel free to stop by or catch us before or after that.
Thank you.
Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org. Thank you.