Grey Beards on Systems - 166: Greybeard talks MLperf Storage benchmark with Michael Kade, Sr. Solutions Architect, Hammerspace

Episode Date: September 25, 2024

Sponsored By: This is the first time we have talked with Hammerspace and Michael Kade (Hammerspace on X), Senior Solutions Architect. We have known about Hammerspace for years now and over the last co...uple of years, as large AI clusters have come into use, Hammerspace’s popularity has gone through the roof.. Mike’s been benchmarking storage … Continue reading "166: Greybeard talks MLperf Storage benchmark with Michael Kade, Sr. Solutions Architect, Hammerspace"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Greybeard on Storage episode is brought to you today by Hammerspace. And now it's my great pleasure to introduce Michael Cade, Senior Solutions Architect at Hammerspace. So, Mike, why don't you tell us a little bit about yourself and what your role at Hammerspace is? Well, as you said, Ray, I'm a senior solutions architect. I work for business development.
Starting point is 00:00:46 And basically what I do specifically in our group is I develop software solutions for middleware between ourselves and partners. And I also do a lot of the benchmarking, which I think is one of the main topics we're going to talk about today in terms of the MLPerf benchmark and some of what we've gone through and some of the results that we've seen. Yeah, yeah. So the topic du jour is MLPerf storage, and it's a good place to start. So what's new in MLPerf? I'm familiar with the version 0.5, and I believe the new one's 1.0. So what are some of the big changes that have occurred
Starting point is 00:01:27 in storage since the last time? Well, 1.0 is still, I personally believe it's kind of still in its infancy. We've gone through a lot of heartache, if you will, preparing for the benchmark and a lot of the vendors went through the same things. But the big changes, and then we can go into some of the problems that I think all the vendors have seen, and the performance that we've seen is that the 0.5 had two tests, UNet3D and BERT. And in 1.0, they dropped BERT, and added ResNet-50, which is image classification, and they added Cosmoflow, which is cosmological correlations of the movement of heavens and earth. And both of those, well, UNet3D was a PY torch and the ResNet-50 and the Cos CosmoFlow were, I can't remember what they are. Excuse me? TensorFlow?
Starting point is 00:02:31 Yes, TensorFlow. Thank you very much. TensorFlow workflows. And so quite different. The other thing is that 0.5 had V100 GPUs and 1.0 used A100s and H100s. And you could pick whether you run H100s or A100s. And in our case, we actually did the benchmark, and we ended up with multiple results
Starting point is 00:02:58 because we did both A100s and H100s across the different workflows. Now, on the old storage benchmark, it was actually a simulation of the GPUs. It's almost like a dead time after an IO operation or something like that. Did that occur also on 1.0? Yes, that's exactly what happened. They're what they call accelerators versus GPUs, where they simulate by calling in an accelerator. They simulate a GPU of the class A100, H100.
Starting point is 00:03:27 They've got full specifications of what those GPUs would do. And because it's a storage benchmark more than a GPU benchmark, the whole point was to do as many IOs as you can within the correlation of what would an A100 do or what would an H100 do or what would an H100 do? And they put the dead time in between to say, oh, the A100 does, as an example, one gigabyte per second per GPU, and the H100 does five gigabytes per second per GPU. As an example, I'm not sure those are the exact numbers, but you get the ideas. And then basically, they said, oh, you've got to, for the benchmarks, make sure that you can keep the GPU 90% busy for both the ResNet-50 and for the UNet-3D. And for Cosmoflow,
Starting point is 00:04:16 you had to keep it 70% busy. Oh, that's interesting. The Cosmoflow is kind of a space galaxy simulation. It's kind of an HPC astronomical workload as far as I can tell, right? Yes. It's all about, you know, where would Andromeda be in the winter solstice versus the summer solstice, right? And so you actually track that movement of heavens and then also objects within the heavens. Yeah. And, of course, we're moving as well. And all this stuff is simultaneously revolving within the heavens. Yeah, and of course we're moving as well, and all this stuff is simultaneously revolving around the galaxy.
Starting point is 00:04:49 It's all bizarre. So as far as Hammerspace's benchmark, did you do all those three or just one of them? I did. Cosmoflow is a very interesting one. I wasn't able to complete that because the problem that ML Commons has with this benchmark is it wasn't very clear on some of the things that you had to do. And they've learned a lot of things by all the vendors running 1.0. And I think it's going to get a lot better for 2.0 because we gathered a lot of information. But in 1.0, ResNet 50 was really clear on how to run UNet 3D and Cosmoflow was not. So almost no vendor completed Cosmoflow. I think there was only one, maybe two out of the dozen
Starting point is 00:05:42 vendors who submitted numbers. And of the two who submitted, I think one of them couldn't even meet the 70% threshold, but they still submitted numbers. And of course, they'll get tossed out. Yeah, yeah. So from that, I assume that you did the ResNet-50? We did the ResNet-50. We did the UNet3D. Phenomenal numbers. I was very pleased with it. One thing that for the listeners who don't know about HammerSpace is that we've designed and built a completely parallel distributed file system that with the global namespace where you can function anywhere in the world. You could have a location in Phoenix, Arizona, one in Zurich, Switzerland, where I'm currently located, one in Venice, Italy, and anyone could come in and see all the data in all those locations and say, now I want to work on a piece of data, and that data could be in Italy, and it comes to Zurich, or you could be
Starting point is 00:06:43 in Phoenix, Arizona, and it comes to Phoenix, Arizona. And then it moves back again. Because it's also parallel, I can have multiple different storage vendors, or I can build it with our own storage, which we call DSX, in that data center, what we call a site. And then I spread the data or HammerSpace spreads the data across all those back end storage vendors. And as a client comes in, we actually hand out these delegations where we just say, oh, the data you want is over on machine number one, oh, the next file is on machine number five, oh, the next files on machine number five. Oh, the next file is on machine number seven. And if you open five or 10 or 15 files simultaneously, which you can do in Linux with NFS 4.2, you actually get all of
Starting point is 00:07:33 these machines cooperating at the same time. And it's been very successful for us as evidenced by probably a lot of people have heard, we won the Meta contract this year for a large sum of money. And basically, we're running all of their backend storage across HammerSpace with probably, you know, I want to say it's thousands of GPUs doing all of their AI work to basically determine, you know, who submitted content, is that content meeting the criteria of the company, et cetera, and so on. Yeah. So getting back to the benchmark, the benchmark, is it a training benchmark or is it an inferencing
Starting point is 00:08:16 benchmark? It's all inferencing benchmark. So yeah, it's inferencing. But I shouldn't really say that the UNET3D is training and ResNet50 and CosmoFlow are inferencing. And like I said, it's all about storage. It really doesn't matter. You know, some of our people got wrapped around the axle as well in terms of, you know, what clients are running and how many GPUs. And it doesn't really matter.
Starting point is 00:08:45 The whole MLPerf benchmark is all about storage, not about the clients and not about the GPUs. And really what you want to do is you want to take the idea of, I have so many accelerators, which are really GPUs in the MLPerf parlance, and I can run so many accelerators against the storage based upon the config I've got. So a lot of vendors, what they'll do is they'll run a single client with, say, 20 accelerators of H100 type or 40 accelerators of A100.
Starting point is 00:09:18 And then they'll say, oh, that's across a storage platform that maybe has five storage servers. Now, how can I expand that to be five clients with five times the number of accelerators? And what do I have to do on the back end storage to get that to work? And that's exactly what all of the vendors do to try to show linear or exponential growth across their storage. So you mentioned clients and storage servers. So a client is the compute system that has the GPUs on it. So different size clients could support different levels of computer. It's all the same compute, regardless of what size of the client.
Starting point is 00:10:04 Yeah. or it's all the same compute regardless of what size of the client? Yeah, you know, real GPUs, you're only ever going to put eight in a client of any type. That's the most you could do, max, right? But with MLPerf, because they're simulating GPUs, they give you the added advantage, and that's what MLCommons calls it, is added advantage of saying like for instance in my resnet 50 in a single client i could run 80 of these gpus and i'm doing air quotes here accelerators on a single client but if you allocate a client that's got less cpus and less memory maybe you could only get 40 and then you run more clients it has nothing to do with the storage. I could run five clients with 40 each against the same storage where some vendors are doing bigger clients with 80 each.
Starting point is 00:10:52 That shouldn't be the measure, the client. It's really how many accelerators can I get, regardless of how many clients it is, against the storage that I have. Yeah, and the benchmark goal is to attain a certain percentage utilization. You mentioned, I think, 90% for ResNet-50 and UNet3D and 70% for CosmoFlow. That's correct. So as far as the benchmark is concerned, you're reporting that sort of number plus some other numbers as well, some other metrics? So what you're doing is you're reporting, not only can you meet the AU
Starting point is 00:11:31 percentage is what they call it, 90% and 70%, because you have to do that in order to pass. But then you're saying how many samples per second you were able to achieve? And then what was the megabytes per second? Now, what's interesting is if I run 40 accelerators against storage, and I'm, you know, Weka or Hammerspace or NetApp or any of the other vendors, 40 against the storage at 90% is still going to be about the same samples and the same megabytes per second, right? So the benchmark comes out pretty even across all the vendors. It's really how many of those accelerators can I do against the storage on any given configuration. I guess I don't understand what you just said, Mike. So you're saying that 40 accelerator configuration would be roughly the same number of samples per second and gigabytes
Starting point is 00:12:25 per second for all the vendors? That's correct if they meet the 90 percent because that's built into the benchmark you know 40 says I'm meeting it in 90 percent and I'm able to drive you know let's say again an A100 is a gigabyte per second. 90% of that's 900 megabytes per second. And if I've got, say, 10 of those, that's nine gigabytes per second. Again, these are theoretical numbers and not necessarily real. And so if I'm meeting 90%, that every vendor who does 10 times 900 meg would be able to hit 9 gig per second. Yes. It's going to be doing the same data bandwidth.
Starting point is 00:13:07 Now the question is, can their storage handle 9 gig per second? Right. Where does it fall down, right? And that's really what the measure is. I could get 10 accelerators, and maybe NetApp can get 5 accelerators, and maybe someone else gets 7 accelerators, et cetera. You get the idea so that you know in our case we were able to on so many accelerators for unet 3d we showed a linear growth
Starting point is 00:13:34 from a single client up to five clients we could have gone much much higher it's just a matter of cost at that point you know how big a lab are you going to reserve or how much are you going to reserve in the cloud in order to do this testing? And what we wanted to do was show a performance level that was linear in growth from, you know, say 10 accelerators to 40 to 400. Right. And so we just showed that linear growth and so said that, oh, from 40 to 400, I truly got 10x. I didn't get 9x or 8x where some of it dropped off. We truly got 10x. Some of performance perspective and from a GPU perspective, et cetera, et cetera. That's right. You want to keep above that 90 percent because ML Commons tosses the numbers out. If you if you don't have 90 percent or above, they don't count.
Starting point is 00:14:29 Right. Right. What you're actually. Publishing is you can do, let's say, 43 of these GPUs at 90 percent or something like that. Right. Yeah. Against the storage configuration of what you really have. And the best vendors are the ones who can show linear increases. You don't want to be able to say, oh, I did 400 at this number. You don't know what the low number is then, right? If someone wants to buy a smaller configuration, they have no idea. All they see is the big number or the opposite, which is even worse. I did 40 on a single client and I got, there are a lot of vendors who did single clients and got five gig per second.
Starting point is 00:15:13 Well, that's great. But what happens if you scale it up to 10X? Can you grow? And a lot of vendors just didn't do that, unfortunately. Right, right, right, right. Well, yeah. And the vendors are, you know, they all have a limited amount of time to submit this. And you're right about the costs and stuff like that.
Starting point is 00:15:32 So you mentioned you could do this in the cloud as well as on-prem. Did you guys do it in the cloud? We did 100% AWS. The next run, we're building our lab right now. We just didn't have the hardware available to our business development group. It was all dedicated to engineering for code development. So we're building the lab right now for business development, and we're going to do the next series in-house. But all of this was done in AWS. And it was very, very successful. I think we could
Starting point is 00:16:07 have gotten better numbers because the cloud kind of limits you a little bit in terms of EBS volumes and how much you can get and what kind of tuning you have to do and all of that rigmarole and nonsense. But once you get past that and you do a lot of trial and error, you can get some pretty good numbers out of AWS. Yeah. I was always concerned about, you know, network noise and things like that. Noisy neighbors.
Starting point is 00:16:31 I mean, I guess you could get some dedicated hardware and get dedicated networking as well to, to, uh, at a cost, I guess, even on AWS.
Starting point is 00:16:41 So, so that's, that's interesting. You mentioned, you know what, if you do a big enough network, it almost doesn't matter. We did the 400 gig network clients. Unlike Azure and Google, where you can specify with a lot of granularity what you want in terms of network, etc.,
Starting point is 00:16:59 you're limited in AWS based upon the instance type. So if you pick an instance type, then you're either going to get 25, 50, 100 gig, 200 gig, 400 gig network. You have no choice. So at that point, you've got to pick all the instance types you want that sit within the 400 gig space, and then you get some phenomenal performance. Yeah, yeah, yeah, yeah, yeah. So does it matter what the storage servers are? I mean, in these sorts of configurations? I mean, could you have like 10 storage servers or 1,000 storage servers? Would it make a difference from, I guess, GPU perspective samples per second? Well, then you really need the clients, right?
Starting point is 00:17:40 So in the storage side, for instance, we did 22 storage servers behind our metadata server. Now, for those listeners who don't understand Hammerspace and what NFS 4.2 really does is it creates two planes. One is a metadata plane, a control plane, if you will, where when a client mounts via NFS 4.2, they're talking to the control plane and the control plane passes them back. A delegation is what the official term is where it says, oh, go on the data plane and go get the data from this server. And then it gets out of the way. So the control plane only gives referrals or delegations and it gets out of the way. And then the client goes direct to the storage servers. So what I did was I built our control plane, what we call an anvil in our parlance, Hammerspace parlance. And I built a pair of those for full redundancy. And then
Starting point is 00:18:40 as we're adding data to the file system, what the anvil does is it stores the metadata and then it allocates across evenly across those 22 storage servers so that it becomes a true parallel file system where the client, in this case, UNet3D or ResNet50 or Cosmoflow, when you go create that data, you think you're writing it into one directory. In fact, it's spreading it across 22 storage servers in my case. And then when you go to read it back, the ANVIL hands back these delegations that says, oh, the file you want is really on 5, or the file you want is really on storage server 7, or storage server 10, and it gets out of the way and lets the client go do the work. And that provides the true parallel high-performance file system that I think in this type of workflow where you're doing training or inference is what customers really want. And so if let's say you had 44 servers, storage servers behind the same metadata, would the performance be twice as good?
Starting point is 00:19:48 Absolutely. What we're seeing is that I did a test with five, and then I did one with 20, and then I added a couple more just to see if the curve would stay linear as I did this. And it was absolutely linear all the way across. In fact, I did an interim one at 10 and it stayed linear. I just didn't want to have too many permutations when I submitted it to ML Commons. So I did a five and a 22 where I did the low and the high end. And it was absolutely linear across the whole thing. It was really a beautiful thing to see, to see that data get spread evenly across. And then to see the clients say, oh, I want to open this file in the same directory that opened the previous file. And it was really on a different backend server to give
Starting point is 00:20:36 you that parallel nature. And all that's associated with the parallel file system functionality that HammerSpace is bringing to bear and NFS 4.2, et cetera. That's correct. And that's one of the things I think is, I wouldn't say it's a secret sauce so much. 4.2 is truly an industry standard. You see Lustre and Panasys, they're implementing kind of the same thing, but they're doing it through custom client software, right? They're not doing 4.2. We were, I think, farsighted or foresighted enough in the beginning to say, you know, this should be an industry standard so that when people buy this, they don't have to modify
Starting point is 00:21:19 their clients. They just buy a client from Dell or, you know, pick a vendor, you know, Supermicro, et cetera. And I'm going to load up a version of Linux. And any version of Linux within the last five years is basically going to have 4.2. And I just do a mount. And all I do is say I want to mount NFS type 4.2. And I mount that anvil, that control plane. And it just works. There's nothing else
Starting point is 00:21:47 I have to do. And it could be our storage on the backend with what we call DSX servers that have storage on them. Or it could be Isilon on the backend. We honor those, a Cumulo on the backend, a NetApp on the backend, et cetera, a Windows storage server on the backend. a NetApp on the backend, etc. A Windows storage server on the backend. And our Anvil is smart enough to say, oh, I'm going to place the data here, here, here, and here, based upon the performance, the cost, whatever you want to set up. So I could say all the isolants are at this cost component, all of the cumulose are at this cost component, all of the NetApps are at this cost component. But guess what NetApps are at this cost component. But guess what?
Starting point is 00:22:26 They also have a higher performance component. So now when a client says, oh, I want to write something in this high performance, it will all go to the NetApp. Or I want to write something in this cost component, it will split between the Isilon and the cumulo. And we'll load balance all of that in the back end, giving that true parallel performance across multiple vendors, including our own storage. So you ended up submitting two separate configurations for both Usenet, Unet 3D, and ResNet 50. And like I said, Unet was a training run PyTorch, and ResNet-50 was an inferencing run TensorFlow, right? Yeah, so ResNet is all about image classification. And UNet3D is image classification too, but using what they call volumetric segmentation.
Starting point is 00:23:19 So they're kind of similar in one regard, but one uses PY Torch and one uses TensorFlow. So TensorFlow for the ResNet-50 and PY Torch for the UNet3D. They're both kind of image simulations or image learning, if you will. And then Cosmoflow is that cosmological correlations that it does for movement of the heavens. Right, right, right. So I submitted eight sets of numbers. It gets very convoluted at some point. Eight sets?
Starting point is 00:23:51 If you do ResNet-50, you do UNet3D, you do A100, and then you do H100, you end up with eight different sets of numbers. Okay. Yeah. Yeah. Especially if you do it like i did with five back-end servers and then 22 back-end servers right so it's two sets of numbers yeah yeah yeah yeah in the uh in the ml perf documentation do you show the the number of servers and stuff like that or is that i i i
Starting point is 00:24:22 can't recall last time when i looked at version 0.5. Yeah, so what – Go ahead. They've got a little bit more detail than this version. Again, these are some of the things I think all the vendors and ML Commons learned, is they wanted us to do a JSON file where we basically laid out all of the components with the CPU, the gigahertz, the amount of memory, the boot drives, the storage drives, and then how they lined up with the network. So it was a very convoluted JSON file for every run that you did. And then they also
Starting point is 00:25:00 wanted a PDF with the drawing that described it in addition to this JSON file. Now, the assumption is that when they go to parse all these numbers to put them on their website, they're going to go through the JSON file and say, oh, he had five clients. He had five backend servers. He had a metadata server. He had this network. They're all tied in this way, and it would do some type of drawing for you. But what ended up happening is that it just didn't work out.
Starting point is 00:25:29 Most vendors didn't know how to fill it out. It was, you know, I think that's one of the things they're going to have to change. They're going to rely a lot more on the PDF file with the description from the vendors than they are in the JSON file. Yeah, yeah. It's something they're going to fix in 2.0. I'm pretty positive, plus a number of other things. Yeah, I didn't find V.0.5 had a lot of information on the storage configuration,
Starting point is 00:25:58 but hopefully it'll be more in 1.0 and hopefully more as 2.0 comes out and things of that nature. Yeah, and one of the things I think we petitioned very heavily on a lot of the vendors is there should be a cost component. Right now in 1.0, I could run it and I could have put 400 backend servers and another vendor could have put 300, another vendor could have put 50, but you don't know what the costs were. And so the cost per accelerator or the cost per GPU would be really critical, I think, for an informed decision from a vendor.
Starting point is 00:26:31 Right now, that's kind of left up to the reader more than ML Commons giving you a guideline. But I think that's something that's going to change. They never showed cost, never have showed cost. In fact, there's very few benchmarks really that show cost. In my mind, maybe SPC. I was going to say the spec SFS, what was the last one, 2010 or something like that? 2008, 2008 or something like that. Yeah, that one I think actually does a very good job.
Starting point is 00:27:04 I've run that before in past vendors I've worked for, and it's a lot of admin time on the part of the benchmarker to go and lay out all the costs. But then you get a very informed decision as a potential purchaser of storage. Right, right, right, right, right. And so the training and the inferencing runs were all based on standard files. They weren't like GPU direct or something like that. They were all standard NFS, as you mentioned, 4.2 software accessing the data. In our case, yes. Other vendors did NFS3. Some vendors just had a single machine and it was just direct attached storage, right? So in our case, it was all NFS4.2. And it just worked right out of the box. It was really fantastic. Right, right, right. This is one of the things you asked the question, was
Starting point is 00:28:12 a GPU direct or RDMA? It wasn't in this one. And in fact, that was, again, one of the problems I think that a lot of the vendors saw. I brought it to the forefront as did other vendors to the ML comments team, that there was no O-directed, in other words, allowing us to bypass caching. There was no RDMA.
Starting point is 00:28:34 There was no, you know, GPU direct. That's all something I think for this type of workflow, especially when you're talking AI training or inference, that's really critical. And that should be in the benchmark. workflow, especially when you're talking AI training or inference, that's really critical. And that should be in the benchmark. And they've got that written down as one of their features they want to add in 2.0. Right, right, right, right, right. There's plenty of different steps to the AI working activity. I mean, obviously, training and inferencing are the two big ones, but there's data preparation and cleansing and data filtering, there's checkpointing and things of that nature. Did you, in the training run, did you have to simulate checkpointing as well as reading the data? Yes.
Starting point is 00:29:18 And in fact, the checkpointing had to occur on the same storage volumes that you were doing your data runs against. So wherever you did your data gen, where you created your data, the checkpoint had to reside on those same volumes. Again, this was something that wasn't really clear in the documentation and some vendors fell apart on it. But we all got a chance to basically do peer review and look at each other's work and make suggestions. And I would say that worked out really well. You know, vendors can get snippy with each other, and it didn't happen with MLPerf. Everyone was very polite and, shall I say, kind to each other in terms of their critiques. And then we were given a two-week period in which to do reruns. And a lot of vendors had to do reruns, and we weren't
Starting point is 00:30:13 exempt from that. The documentation was not kind of clear in some of the regards, and we had to do our own series of reruns. And we got called on it by other vendors, but no one ever got snippy with anyone. And it's, it was really kind of nice. So to your point, I get long winded, but the checkpointing was on the same volumes. And I think that, you know, that added, of course, a performance component to the whole test. Yeah, I mean, so it's, you know, from a sequential read kind of thing to a sequential write of considerable nature, and then the training is that,
Starting point is 00:30:52 and then the inferencing, I guess, is random read kind of thing? Yeah, there's a lot of what they call index files. So if you look at, like, the CosmoFlow, you would generate as, again, as an example, I don't know the exact numbers off the top of my head. So I'm just going to kind of throw a number out in, in, let's say that you had,
Starting point is 00:31:11 you know, 400 clients, in order to do that, the data gen run, there were three runs you did in MLPerf. One was data sizing, where you named how many GPU accelerators that you wanted to run, how many clients they were going to run on, and then it would tell you how many files that you had to create. Then you did a data gen run that would create those files. And then you did the run itself and you had to do five contiguous runs where you did run one, no more than one or two or three minutes between a run, do run two, And they all had to pass at 90% or above. Cosmoflow and ResNet-50, because they were TensorFlow, not only had the number of files, let's say Cosmoflow was 4 million for, say, 400 accelerators. You also had 400 million 10-byte index files. So now you had 4 million data files
Starting point is 00:32:09 and 4 million of these 10-byte index files, where you were constantly opening the index file and updating the index as to where you were. And that's where it got very heavy, because those, even though they're sequentially read, they're only 10 bytes. So they look more like a random IO than anything else. Right, right, right. That's bizarre. And it's just because it was both inferencing and TensorFlow that caused the problem. Yeah. That's interesting.
Starting point is 00:32:38 I have to dig into that from a storage performance perspective. Yeah, I was very surprised too. It doesn't look pretty, right? Yeah, I was very surprised too. It doesn't look pretty, right? Yeah, I was very surprised too, because when you go into it, you basically had to open all these index files at the beginning to say, where were we going to start from?
Starting point is 00:32:53 Of course, you're always starting at byte zero, but you had to open all of them to read the 10 bytes before you open the data files. So it was a lot of IOs at the very beginning before you ever started the benchmark. And you could see this where the Cosmoflow benchmark would start and you'd be sitting there for a couple of minutes going, what's going on? How come I'm not seeing any output? Because it was constantly outputting where it was in the benchmark and you'd see nothing.
Starting point is 00:33:19 And then all of a sudden I delved into it and said, well, what's really happening in the background? And I start doing some scans through the OS and I found, oh, it's opening all these little index files. Very, very interesting. That is interesting. It's probably an inferencing. I don't know if it's a TensorFlow artifact as much as it's an inferencing benchmark artifact. But in any case, it's certainly, luckily, I guess it doesn't really make a difference in your overall performance because the performance starts clicking in when you're actually doing the
Starting point is 00:33:50 data reading, I guess, right? That's correct. And all these files are, you know, UNet 3D because it's PY Torch is all just big, huge sequential files. ResNet 50 and CosmoFlow are, you know, TensorFlow and they're large sequential files as well, but they both have index files. And so because those index files, you've got to get those out of the way first before you can do the sequential reads of all the other files, which are fairly large. But it's that start that just kind of throws you for a loop because you go, what's going on? How come it's taking so long? And then all of a sudden it just starts up, and away you go, what's going on? How come it's taking so long? And then all of a sudden, it just starts up and away you go. Earlier, you mentioned samples per second is one of the numbers that you provided. Is a sample a complete file read? Is that? Yeah, pretty much. Okay. So, you
Starting point is 00:34:38 know, what it would do is it would, let's say in UNet 3D, you would run it. And at the very end, of course, like I said said you had to do five runs consecutively you get a number for the first round a number for the second number for the third and it averaged across the five and the averaging across the five is what gave you the above 90 and then the uh averaging across the five gave you the number of samples per second and your megabytes per second so let's say say the samples per second were 700. Again, I'm making that number up. And 50 gigabytes per second.
Starting point is 00:35:14 So then you could say, oh, I'm doing 700 file reads where I'm reading the whole thing across all of my clients and all of my simulated GPUs, which were called accelerators. And that's how many I got in any given second. And that gave me 50 gigabytes per second of throughput. Right, right, right, right, right. Huh, interesting. So do you have any idea how many vendors are actually submitted on this GoRoute?
Starting point is 00:35:42 I believe it was, and please, you know, it's in the podcast, but it isn't something quotable. We'll see it when the numbers are actually output. I believe it was 12 were the ones who submitted, but then when we did peer review, there were a number of vendors who just didn't follow the rules, couldn't get the numbers out, maybe lost their benchmarking slot in their internal labs. For whatever reason, there were a number of vendors, I think, three of them completely withdrew their numbers. A couple of them just couldn't do the numbers and get the results above 90% or above 70%. So I think in the end, we're going to see probably, I'm guessing eight is my guess, but I could be wrong. There could be higher or lower by one or two.
Starting point is 00:36:33 And each of those potentially could have submitted multiple runs with one client. Oh, there were some vendors who had 15 different numbers and then they cut them back a little bit so then they ended up with maybe 12 or 8 numbers but those are all going to have to be graphed and charted by ml common it's going to be very interesting to see how they do their pivot tables if you will so that you can say oh a100 unet 3d tell me me samples are gigabytes per second or megabytes per second and tell me what those numbers are. And then you'll see a group of vendors come up, right? Right, right, right, right. Yeah, so I've done blog posts on prior MLPerf and I did a blog post on MLPerf storage in the past.
Starting point is 00:37:19 So I may end up doing another one for this one. I don't know. It's certainly of interest. I think the whole idea behind MOPerf is great. It's a way for the industry to understand how it's advancing over time and for customers to understand what sort of configuration is necessary to drive a certain number of accelerators in this case. Yes. From a storage perspective. I totally agree.
Starting point is 00:37:48 And having done benchmarking, I'm an older guy. I just turned 68 two days ago. And I've done benchmarking for 45 years for vendors starting at NetApp in 1997, all the way up to today. And I have to say that this is a very interesting benchmark in that it isn't entirely what you would call synthetic. Most benchmarks are completely synthetic. This one takes three real-world case workflows and benchmarks them
Starting point is 00:38:22 versus saying, oh, let's just go with FIO and see how much we can push sequentially or how much we can push randomly. And it doesn't pertain to any workflow at all. This one at least pertains to three real world things that people in parallel computing or data scientists are going to want to see. Yeah. Well, anybody doing training and inferencing, these are fairly straightforward workloads that people would normally use for these sorts of things in the field, in reality, in
Starting point is 00:38:51 enterprise. Well, Mike, it's been great. This has been a good session. Is there anything else you'd like to say to our listening audience before we close? No, I appreciate your time, Ray, and I look forward to the results of MLPerf
Starting point is 00:39:10 when they're published and seeing what everyone thinks of the different vendors. And hopefully you'll come to HammerSpace for a conversation. I think that we have something very compelling to show different users in terms of not only performance, but ease of setup, ease of use, and be able to run a global namespace around the world with that parallel processing, that parallel storage system that most people
Starting point is 00:39:39 are looking for. Right, right. And for the AI workloads, it becomes even more impressive. All right. Well, this has been great, Mike. Thanks again for being on our show. And thanks again to HammerSpace for sponsoring this podcast. All right. Thank you, Ray. And have a good evening. That's it for now. Bye, Mike. Thank you. Until next time we will talk to the system storage technology person any questions you want us to ask please let us know and if you enjoy our podcast tell your friends about it please review us on apple podcast google play and spotify as this will help get the word out Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.