Grey Beards on Systems - 33: GreyBeards talk HPC storage with Frederic Van Haren, founder HighFens & former Sr. Director of HPC at Nuance

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to the next episodes of Graybeards on Storage, a monthly podcast to show where we get Graybeards storage and system bloggers to talk with storage system vendors and others to discuss upcoming products, technologies, and trends affecting the data center today. This is our 33rd episode of Greybeards on Storage, which was recorded on June 9, 2016. We have with us here today Frederick von Herren, founder of Hyphens Consulting and former senior director of HPC at Nuance. Well, Frederick, tell us a little bit about some of your experiences at Nuance. Well, Frederik, tell us a little bit about some of your experiences at Nuance.

Starting point is 00:00:46 Well, thank you. Yeah, so I started about 10 years ago in high performance, and it was kind of weird in the sense that I was running text-to-speech for R&D, and since I touched Linux and Windows and other flavors of Unix, I guess I was the perfect candidate to start high-performance computing at Nuance. In those days, high-performance basically meant there were 10 desktops connected to the main network. And that was it, right? That was the great high-performance computing environment. And so my management was telling me in those days that speech recognition needs a lot of data in order to improve the product, right? So speech recognition is statistical application. You're saying something

Starting point is 00:01:30 and then the application is trying to compare that with a subset of data. And then hopefully from a statistical standpoint, the application can guess really what you're saying. So the mainstream idea was let's start collecting as much data as we can afford, because cost was still the driver. And then let's see how far we can push that. So in the beginning, we started out with hiring a company that would provide us guidance on storage, on servers and high performance file systems. And, you know, and high-performance file systems. And once we got all the equipment in, I kind of realized that there were a lot of nuances, if I can say it like that, between raw storage and actual capacity, multi-core, single-core. And I got interested in knowing more on what's behind all of this. One thing I learned

Starting point is 00:02:27 really quickly is there is always a bottleneck. The question is where it is and provide a platform, if you wish, that you can control as an individual and replace pieces as you go along. Because I believe that if you own the platform, you could replace vendor A with vendor B and improve performance and hide the complexity from users. So one of the things we decided to do is to swap out the original high-performance file system with IBM GPFS. GPFS? Yes, GPFS. They call it spectrum scale nowadays, I believe. So that's about eight years ago. We wanted to look for a reliable and a scalable piece of software that could glue our storage devices together and make it look like one.

Starting point is 00:03:19 So a little bit of a fast forward from a user perspective. Our users are seeing the exact same file systems as they saw eight years ago, but we have to replace the hardware behind it four times, right? So if you own the platform, something like GPFS, you have the power and control to make those decisions. Because I came from R&D next to GPFS, for some reason I was convinced that I needed three tiers of storage. And don't ask me why, because I really don't know. It was a gut feeling where I needed high-performing, medium, and then low-performing. And so I was asking people, what do you think?

Starting point is 00:03:56 High performance, medium, and average. And at some point, we stumbled over a company that did ATA over Ethernet. Oh, our friends at Co-Raid, well, our former friends at Co-Raid. There you go. And so we said, look, why don't we try with, you know, a few shelves, you know, Supermicro, all that good stuff, 25 drives per shelf. So we started and to a certain degree, we kind of said, well, what if we buy 40 of those shelves and how far can we push it? And before we know, we had 1,000 drives of co-rate and based on SATA drives, 500 terabytes SATA drives, and performance was reasonable in the sense, you know, considering SATA drives and the amount of drives we had.

Starting point is 00:04:47 The only problem was having 2,000 drives of that really was not an option. It was management of it was a pain. Failure of drives was very, very difficult. At some point, we even decided that we would put the webcams on the shelves just to see when the drives would fail. What, the lights? You're kidding me. This is your management? Well, this was the whole idea with Corate. It was, you know, we will make these things really stupid. And if your goal is to be really stupid, it's easy to achieve. That's right. And so one thing I learned out of it is that, you know, having a Tier 1 and Tier 2 and Tier 3 wasn't really a necessity.

Starting point is 00:05:20 I did understand that with good basic equipment, you could scale it out if the hardware would allow you to scale out. So at some point we said, you know, co-rate, that's great, but that's not going to work for us. We wanted to go a step up. So we wanted to go really with a JBot that had relatively good management tools and preferably CLI because high-performance computing is all about automation. I mean, nobody's going to log in into a thousand service manually. So you want to do all of this in an automatic fashion. And that's how we started working with HP and we asked HP, so what's your, what kind of JBOT with management tools can we use and get off the ground? And then before we know, we started working with the MSAs, version 1, version 2, version 3.

Starting point is 00:06:07 Then they renamed it to P2000. Then we had up to 12,000 drives of MSAs. And wide striping across 12,000 drives solved your performance problem so you didn't need a faster tier? Yeah, so the way you should look at it, the MSA had 96 drives per dual controller. And so we purchased blocks.

Starting point is 00:06:27 On one hand, we had SAS drives. And I think in those days, we used a 300 gig 15K RPM. And then for slower storage, we used the 750 gig SATA drives. So we called the SAS site, we called it Scratch because it's heavy duty read-write. And then we used and called the SATA site static. Basically, this is incoming data from users, and we're not going to modify that data. So we're going to write it once, but read many times. And so with the help of GPFS, we would create RAID blocks within an MSA stack. And then we took a bunch of LUNs across multiple MSA stacks, and then we created file systems. So if you want to picture this,

Starting point is 00:07:14 imagine that we have stacks of MSAs in multiple racks, and then you create a file system horizontally across all the different racks. And that's how you achieve your performance, right? If you want more performance, you add more racks and more LUNs to the file system. So 12,000 drives was doable. Okay. And the MSAs were really a bit more than a JBUD because you were using the basic data protection. That's right. That's right. We were looking for, from a price point, we were looking something as close as possible. So Howard was right. You were wide striping the data across all 12,000 drives? Yes. Yeah. This would be a performer. This would be a screamer.

Starting point is 00:07:53 Yeah. Well, you know, 100 IOPs per drive, 12,000 drives. Yeah, we're talking millions. Millions. We're talking, you know, a lot of IOPs. That's right. And so we had great luck with this whole environment, but from a growth perspective, we were pretty much doubling every 18 months. So nobody was looking forward to that 24,000 drives, right? So it's not because it worked really well with 12,000.

Starting point is 00:08:17 So we were back with the same issue as we had with the core rates, basically was we're questioning ourselves about management, we're questioning ourselves about costs and scalability. So then we started looking at what's the next step? So where do we look? And just for your information, we test a lot of vendors, right? So what we're trying to understand is what the market has to offer, what performance, what's the latency, the costs. And most importantly, at least from what we saw, is also support after purchasing the equipment is really, really important. And the way you can look at it is when you have that many hard drives, and certainly for the MSA, you are a statistic. So if there is an issue in the firmware, you're going to find it.

Starting point is 00:09:04 Right. This only affects one out of 10,000 drives. Yeah, that would be. Oh, that's me. Yes. So, and it was interesting because there were cases where we started to find stuff before anybody else found it, right? So the moment they had the new firmware, we would wait a little bit. I mean, you don't always want to be the guinea pig. But at some point, you converged to the new firmware and then you would realize, you know, here are the problems. So at

Starting point is 00:09:29 some point we started to decide to share all our logs with HP, the Colorado team, and they would regularly look into our log files to see if there's anything abnormal going on that would point to a bug in the firmware. So when you're talking log files, you're not talking about storage per se, you know, the outboard logs for the drives or the storage controller, but your internal logs from your activity and console logs and those sorts of things. Is that what you're talking about? Yes, that's right. Yes.

Starting point is 00:09:58 So we would work together with the Colorado team and then they would ask us, here are the logs we want from our device. Here are the logs we would like to see from your application and then they would put one and one together and then figure out what the issues were and you know they found a decent amount of stuff i mean we're like we were like analytics for them right so they they they have a new firmware they would give it to us and then we would pretty much tell them where the issues are. Okay, and basically this, what looks to me like five or six petabytes of data, is audio files of voice to use for statistical analysis? Yeah, it's combined.

Starting point is 00:10:36 So typically when a speech recognition happens, there are two files that are being generated. One is the, as you suspect, the WAV file, the binary file, which is the sound. And then there is a text file associated with it, which is a textual representation of what the system thinks was being said. So yes, by nature, it's a lot of small files. You know, what we try to do on the static side is to take a lot of those files together

Starting point is 00:11:04 and power them up to make them, you know, megabytes or gigabytes file. And sometimes we do that based on the metadata of the file. So for example, we could put automotive data together, or in some cases, we would put that tar file based on a customer. And then the Scratch base is a little bit more the Wild West, right? Where Scratch is heavy-duty rewriting, and that's where you can expect to see smaller files. And smaller files is a challenge to your metadata. Oh, yeah. It's also a challenge to your overall performance, but it's always a challenge to keep people in line. So every time new data is being added, we wanted to make sure that when they added more

Starting point is 00:11:46 data, they would stack them into our files. You found GPFS was up to the scaling, I mean, from thousands of drives to 12,000 drives and beyond that? Yes. And the metadata, you know, I'm not sure what the metadata engine looks like in GPFS was able to handle literally millions, if not billions of files. Is that? That's right. I mean, every time I utter something to something like Siri, that would represent a separate file or two. Yes. So the way you can look at it is that per second, GPFS is processing two or three million files. And so the GPFS product, you know, if you separate the backend data where the actual content of the file goes versus the metadata. So the metadata is the metadata of the files, right?

Starting point is 00:12:32 It's pretty much like a database. It's a textual row by row, and each row identifies metadata information for a file or a directory. And it's challenging, right? Because metadata is not a lot of capacity, but you need a tremendous amount of IOPS, right? And that's why... Fred, Rick, when you're talking metadata, you're not talking about like standard NFS file metadata, but rather data that your systems are applying to indicate like it's automobile or somebody talking? No, no, no.

Starting point is 00:13:05 I'm actually talking about file like an NFS metadata. POSIX metadata. Yes. So, I mean, the way you can look at it is anytime a user wants to do something, they're going to hit the metadata, right? So even if they do a simple LS, they're going to hit the metadata. Now, when you talk about the voice metadata, that's where we used to put that into a Hadoop cluster. So there was a Hadoop cluster with 76 servers. And there you could put in a query for files. So, for example, you could say, as a researcher, I want to know where all the American English files are, 8 kilohertz, female collected in automotive,

Starting point is 00:13:46 that query, SQL query, would then go to the Hadoop cluster and will return with a list of paths to where the data is actually stored. Wait a minute, you had a separate metadata server. A Hadoop cluster is a separate metadata server? Yes. Oh. Well, you have to right so the last thing you want is when you have 12 000 drives is somebody to go into unix and say find me

Starting point is 00:14:14 yeah you're going to be gone for days and the file that you were looking for could be there at the time you start your sequence but it could be deleted in the middle of it, right? So you need something that's a lot more performing. We started out with Hadoop because we believed that Hadoop was a good use case, but then we switched over to HP Vertica, and our queries went from 12 minutes to 53 milliseconds or so. So that was a huge boost. And Vertigo is yet another scalable database service from HP?

Starting point is 00:14:51 It's a product from HP and HPE, and it's a columnar database, right? So I'm not sure if you're familiar with columnar databases. Yeah, a little bit. But it's still a SQL database solution. Yes, it is. And we went from 76 servers to 10. Ray?

Starting point is 00:15:11 Yes, sir. Verdict is also the back end that Nimble uses for InfoSight. Ah, that's interesting. I always thought they used Hadoop. Nope. That's a different discussion. Ah, interesting, interesting. So you mentioned that this was all statistically based.

Starting point is 00:15:29 It seems like the world is moving towards, and I'm not sure what the right terms are, but neural net, deep learning, machine learning kinds of things. Is that transition happening for speech recognition as well? Oh, yes. Oh, yes. It's very important. So the software or the algorithm you use to compare whatever you're saying versus the data you're trying to compare's say, a thousand speech recognitions at the same time, you're going to need a thousand CPU cores, right? And that's a lot of equipment once you

Starting point is 00:16:12 want... Wait a minute, wait a minute. Something like Siri that can be handling almost a million, right? I mean... Yes. You would need a million cores? That's right. You can't virtualize this stuff? No. I mean, the problem... Virtualization doesn't give you more, right? So if the CPU runs at three gigahertz, virtualizing is not going to give you six, right?

Starting point is 00:16:32 It might tell you you're going to get six, but you're not going to get more than three. So the reality is, because of the algorithm, because of the single core approach, it was a bottleneck from a scaling perspective. And that's where machine learning and neural networking is coming into play. And when I started working in speech recognition 15, 16 years ago, in those days, you were not going to run your speech recognition algorithm on a CPU. You were going to a DSP. And today, what's the replacement for a DSP? It's a GPU, right? So you're going to go to a company like NVIDIA and they will give you a lot more cores. Granted, the GPU cores are not as fast as the CPU cores, but for speech recognition, it doesn't really matter that much. It's really the amount of cores. So suddenly with neural networking and NVIDIA,

Starting point is 00:17:27 you have 3,000 cores or over 3,000 cores per GPU card, and you can assign a bunch of those cores to a recognizer, and now you can scale way, way beyond what you could do in the past. I give you some metrics, right? If you buy a 1U server, like a DL360, two sockets, you end up with, what, 24, 28 cores or so altogether. If you buy a server that can host eight GPU cards, typically there are about four U, so four times more volume, but you get

Starting point is 00:18:01 eight times 3,000 cores, So that's 24,000 cores. Granted, they're GPU cores, so we have to be careful there. But for speech recognition, this is a huge boost. So in 4U, you have 24,000 cores. So you can scale and you can provide a lot more flexibility than before. Now, it does come with its own problems, right? So if you imagine that each CPU core needs access to a certain amount of IOPS and you go to GPUs, I mean, then the math is going the other direction, right?

Starting point is 00:18:38 Yeah, because now you've got two cores per spindle and that's going to be a problem. Yes, and then you're a whole different ball game but you know that's what that's why we have technology and innovation is is to help you solve these problems and if you want a lot of iops and a single point or a single server there is technology out there that can do that for you if you prefer a san architecture or distributed architecture that's going to work for you as well. But yeah, like I said in the beginning, right?

Starting point is 00:19:08 It's all about a bottleneck and the bottleneck moves around and you just need to find ways to improve the bottleneck and make sure that you control the platform in the sense that you can make a decision to go left or right without having to reteach your users on how to use the system. Yeah, it's kind of the first lesson you have to learn as an IT architect is you never solve anything. You just move the bottleneck.

Starting point is 00:19:34 It's way too frequently that I see people, especially in the early days of Flash, people would say, I have a storage performance problem. I only have 1,000 IOPS. Let me go to a million IOPS. And, of course, at 10,000 IOPS, the bottleneck was someplace else, and they had already spent their whole budget for the year. Yes, and I do like it, right?

Starting point is 00:19:53 But as you said, it's very difficult for people, certainly the people who have to pay for the whole thing, is that the bottleneck changes, right? They look at the bottleneck as something negative, while in reality a bottleneck is just where you improve the performance of your environment, and it's just moving to somewhere else. But that doesn't mean that it's worse than before. It could, but not necessarily. You mentioned that the machine learning, deep learning neural net has got a different scalability, I guess, if I call it properly, than the old statistical Markov, hidden Markov

Starting point is 00:20:26 analysis. Can you explain that a little bit, Frederik? Yes, it's all about the ability to parallelize, right? So if you look at a GPU card, for example, a GPU, it's running a single application, but the data is different. I mean, if we go to the basics, what does a GPU card do typically in a laptop, for example? The only application you're running is putting pixels on your screen, but from position to position, the pixels are going to be different. And so you can look at that as a highly parallelized application. And that's the same thing you're doing with the neural networking, where you basically say, I'm going to take

Starting point is 00:21:06 a large task, I'm going to cut it into pieces, I'm going to give a bunch of cores a task or a set of data. They all run the same application, and at the end of the cycle, they all come back together and say, okay, we're going to take the data points with the highest return. From a CPU perspective, it's quite the opposite, right? Where with a CPU, you can run anything on a CPU or a core. So you can have two cores running two totally different applications. But if the only thing you're trying to do is the same thing over and over with different

Starting point is 00:21:43 data, then the ability of having the CPU doing two different things at the same time is not gaining you anything, right? So now they have an architecture where they create a tremendous amount of flexibility you're not going to use. You prefer to use technology where parallelization is built in. So in the old Pid and Markov approach was a CPU based approach in the new neural networks, the GPU based approach. Is that how I read this? Yeah, it's more about how wide you can parallelize. And because GPUs have the ability to go to the thousand cores, it's naturally a better way to use it. And, you know, in general, when I talk about gpu you can use the word gpu by accelerator cards right so there's this intel also has the on fi there are other companies you know fgpas there's a lot of

Starting point is 00:22:33 tools you can use depending on your use case you need to understand how useful it's going to be to your use case and then you choose what you want to do. I think Google actually came out with some new... TPUs, right? Yeah. That's right. So basically what that is is exactly the same thing, except that for their particular use case, they found that one way of doing it was better than GPUs or FGPAs or Xeon Phi. In reality, it's all about use case.

Starting point is 00:23:02 And same thing with storage, right? It all depends on your use case and how to use and how to pick the environment. And things get really interesting in the next couple of years when Intel comes out with the Xeons that include some FPGA built in. Yeah, I think it's a natural thing, right? So when we talk about high-performance computing, it's kind of, you know, people say, oh, that's a niche and it's very specific. But since storage has become so cheap, a lot of people have enough storage and the ability to store a lot more data than they used to. And so they're really entering, to a certain degree, the world of high-performance computing without really wanting to call it high performance computing. Yeah, we're starting to see that the technologies of high performance computing, like, you know, large clustered file systems like GPFS, move their way into the commercial data center.

Starting point is 00:23:54 Yeah. And then there's, you know, GPFS is kind of the old guard, you know, they have been around for long. Intel, although you would look at them as a hardware company, they are spending a lot of money on Lustre, which is a competitor to GPFS, which has some interesting features. You know, it's a fast-moving market. It's not being ignored. You have to keep track of all the different things. And the good thing there, as always, is you have options, right? Options depending on your use case.

Starting point is 00:24:20 Is GPFS open source? No, no, no, no. It's proprietary. It's from IBM. No, no, I know it was IBM, but I didn't know whether it was open source? No, no, no, no. It's proprietary. It's from IBM. No, no, I know it was IBM, but I didn't know whether it was open source. No, no, no. Lustre has an enterprise. Paying version also has open source, so you, because the scale computing guys started with GPFS

Starting point is 00:24:48 before they developed their own object backend. Yeah, but I think that was under license from IBM, though. It's definitely under license, but they made a lot of changes to make it a distributed as in addition to clustered file system. Yeah, there was another one besides Lustre. There was a GFS I think Red Hat has. Yeah, GSF is pretty good as well, but it's not as scalable as the others.

Starting point is 00:25:12 So if you want to use it up to 100 terabytes, that's okay. But I would use Lustre or GPFS if you are going to use more than 100 terabytes. You mean I can almost have my own HPC site in my basement here? Yes. I just need one of these 4U3000 core. That's right. 24,000, sorry.

Starting point is 00:25:31 I got enough hardware. I just need an application. You know, that's the problem is you need to figure out, you know, infrastructure is grand, but it's really trying to solve a problem that matters. You know, one of the questions that the data world has these days is how many people does it take to administer a 24,000 drive environment with two tiers of scratch and static and a dupe cluster? I mean, how many do you have admins in this world?

Starting point is 00:25:57 I guess that's a question. So the first comment is we do a lot with automation. And so we are a 24 by 7 environment, but my people only work 9 to 5. That doesn't mean stuff doesn't happen after 5. It's just that the way the platform is built and the redundancy, there is no need to jump and run to the data center to replace something. So from an admin perspective, we have one guy in the data center to do break fix, as you can suspect, mechanical device break, they do break a lot. So a lot of his time goes to that. I have an operational team that mostly works on application changes with the users. So we don't have a standard application,

Starting point is 00:26:44 any script is considered an application. So there's a lot of work to work with the users. So we don't have a standard application. Any script is considered an application. So there's a lot of work to work with the users to make sure that they don't destroy the environment. I mean, you can imagine, you know, if you need to write one megabyte, you don't want one million one-byte files. You want a single one-megabyte file. But you would be surprised sometimes, you you know that people don't understand scale and then the engineering side i have one main architect i have one person who looks at open source community because we look a lot at spark and that kind of stuff and we do a lot of pocs and then i have another person on the engineering side who works mostly on security,

Starting point is 00:27:28 on automation tools, XCAD, Puppet, and that kind of stuff. So it's a relatively small, fast-moving group. And this is petabytes of storage, right? Five, six petabytes of storage? Yeah, on block side. And then we have about two on the object side. So yeah, we hear that a lot. A lot of people tell us, you know, that we don't have a lot of people, but you know, you need to own the platform. You need to automate a lot and have a good understanding of high availability, right? So in other words, when you buy a storage device, you have to make sure that your power is well divided. I mean, now with the new device, it's not necessarily an issue. But if you look at the MSAs in the past, you know,

Starting point is 00:28:15 if you wired it wrong, you would end up with two power supplies on the same PDU, the PDU would go down and boom, you pretty much have data loss, right? So yeah, it's making the right choice. A lot of people ask me how we do it. I guess it's difficult to explain. For me personally, when we started, we just had two people, me and my main architect. For some reason, when we design something and look at stuff, I kind of feel where the weak points are, and we try to approach them and handle them. But we had issues in the past where we had bad batches of drives, bad batches of power supplies, and we never got hit by downtime because of the architecture. You also had the advantage of not having 400 unique Snowflake applications that various departments bought that have to get supported.

Starting point is 00:29:08 Oh, yes. That's right. Yes. Yes. The complication in the corporate data center, much of it comes from that. Yes. So back to the, you keep talking about owning the environment. I think that because you went to GPFS, you can effectively replace anything underneath it that you want as long as it talks some sort of storage protocol. Is that kind of how I should understand that? Yeah, it's not necessarily the storage protocol, but let's give you an example. So for example, a lot of the devices out in the market, they will do tiering for you, they will do snapshotting for you, But they all do it within the same device, right?

Starting point is 00:29:47 So imagine, let's take an example. Let's assume I have three power devices and I want to build a file system across them. So horizontally, I build the file system across, you know, three power one, three power two, and three power three. I want to be able to control the tiering because if I tell 3PAR1, go ahead and start tiering, that's going to hit my performance, right? Because that device can decide to move data at any point in time, but it doesn't know what 3PAR2 and 3PAR3 is doing. And it doesn't know anything about your data or your applications.

Starting point is 00:30:20 That's right. You have so much more context that you can make more intelligent decisions. Yes. And as a result of this, we put the intelligence in GPFS and tell GPFS, this is what you need to do. And that gives us – there's pros and cons, right? If somebody comes up with an incredible new way of doing things that we typically do at the GPFS level, yes, that's going to hurt us. But nowadays, we rely on basic functionality of storage devices. You know, 3PAR uses chunklets,

Starting point is 00:30:52 so that works really well. But from a tiering and all that good perspective, we prefer to do that at the GPFS level for control. And as a result of this, we can say we're going to replace 3PAR with something else or upgrade the 3PAR and so on. And GPFS allows you to take a hardware device while it's in production. So, for example, if I say, you know, the example with the three 3PARs, if I want to upgrade 3PAR number three, I can tell GPFS move all the data from three to one and two. It will do that online. And then when all that's done, I can take 3PAR 3 out of production

Starting point is 00:31:30 and the users don't know anything. And then I can replace 3PAR 3 with, let's say, 3PAR 4 and then tell GPFS load balance the data that's on 1 and 2 now across 1, 2, and 4. And all you're relying on the 3PAR, the MSA, to do is provide basic resiliency. Yeah, and that's how we went from, you know, Corate to MSA to 3PAR.

Starting point is 00:31:53 Our users are clueless on what we're doing. As a matter of fact, when we asked them, they have actually no idea. They still think we're running on the same storage device as we had eight years ago. Okay, well, I mean, moving from MSA to 3PAR, there's a significant cost per gigabyte difference. But having fewer devices managing more drives made simplified management to justify that? Yeah.

Starting point is 00:32:20 So there was a cost. So if we talk about price, the original three-part devices, like the V400, they were a lot more expensive than the models we used afterwards, the 7400s. So I would say we had four 400s that were really expensive or at least more expensive than we wanted. But the 7400s are price-wise, at least with our discount, much, much closer to the MSA. Not the same price as the MSA, but at least closer to the MSAs and made it really worthwhile to move for us. Right. And instead of having 96 drives per pair of controllers, you've got several hundred drives. Oh, yes. And with a lot of cost savings too, because each MSA came with a bunch of fiber

Starting point is 00:33:05 channel ports. And now we had a consolidation of fiber channel ports. I mean, fiber channel is not cheap. So we actually, going from the MSAs to the three-pars, we actually reduced the amount of fiber channel ports significantly. Wait a minute, you're a fiber channel SAN? I would have thought this would have been IP SAN, iSCSI or something like that. No, no, no, it's fiber channel. Because of throughput, latency, or? Because of the bad taste in their mouth from ATA over Ethernet. Ah! Wait a minute, wait a minute, wait a minute.

Starting point is 00:33:35 The world is, you know, moving, I think. Yes, but remember, this is what we, that's what we started doing 10 years ago. And just full disclosure, I mean, the platform I just described, we're going to replace it soon with a completely new architecture where we do it differently. But for the last, you know, eight, nine years, it was all fiber channel. I mean, it works really well. We have the fiber channel switches. I have clients that are fiber channel suppliers, you know. Biggest problem with fiber channels, paying for it.

Starting point is 00:34:06 That's right. You seem like you guys do a lot of proof of concepts for vendors, more so than others, I would think. I can't tell, really. How often do you do proof of concepts? All the time. You're looking for a new technology that can provide

Starting point is 00:34:22 cost, reliability, or performance advantages? All of the above. So what happened a couple of years ago is every time we tested a device, we would find issues. And then the vendor was really interested in how we did our testing. I mean, imagine we have like 10, 12 racks available to us for testing, right? So who has 12, 13 racks they can deploy against a new device? I mean, Howard is probably the only guy I know that has 10 or 12 racks.

Starting point is 00:34:48 I've only got five. Ah, I see. And so what we noticed is that people like HP and IBM, because we talk to pretty much everybody, they started saying, hey, can you look at our device and see what do you like, what don't you like, and so on. And as we were doing that, we were getting higher and higher into the engineering teams with all these companies. And so we got a lot of information.

Starting point is 00:35:12 And then at some point, we had companies tell us, hey, we're under NDA, but can you guys look at this? And we have this new thing, and can you try it out? And then before I know, I have VCs calling me up and say, hey, we heard from such and such that you tried, you know, product such and such in alpha. And what do you guys think? And is this good technology? And we kept on doing that. I talked to a lot of startups. Some of them don't even have a website. Others have a website and others are in a stage where

Starting point is 00:35:41 they're hitting the market. And it seems like we provide good technical feedback that they can use moving forward. And I came from R&D, so for me, it's all about technology. If I can learn something and I can help people understand what we do and I can learn what they're doing, then I consider that like a win-win. Yep. And Nuance Management supported you in that, which is good. Yes. Most of the time, yes.

Starting point is 00:36:11 Well, and the other thing is you can help guide their technology to something that's more amenable to what you're looking for, too. I mean... That's right. Yes. And I think that we have a tremendous reputation within the company of making things work, right? So, and sometimes it's not a positive where if something really, really goes bad in a different department, you know, they know where you live, right?

Starting point is 00:36:32 So then you're in troubleshooting. But overall, I like technology. I like working with vendors. I like listening to what other people have to say. And, you know, and that's why I pretty much decided to jump ship and say, why don't I focus even more on technology than I used to? And so far, so good. I really like it.

Starting point is 00:36:56 There's no better way to say it. So at Hyphens, you're consulting with these startups and companies about the technology? Oh, yes. Yeah. You're also working with the HPC companies to identify technology that might be applicable to their environments? Yes. It's interesting.

Starting point is 00:37:13 I mean, I left on May 2nd, and I have companies calling me up, asking them to work with them from an HPC perspective, because they believe they have something that should be good or a good fit for high-performance computing. So I get a lot of traction in those areas. We have to see how we can make that all work. But there is clearly at this point no lack of interest. Yeah, that's great. For a consultant just starting out, that's mind-boggling. I wish I was there. So you mentioned Lustre and GPFS as predominantly the two solutions that you would look at in this sort of environment. Yes. You know, like I said, GPFS is kind of the older technologies out there.

Starting point is 00:37:59 It has some things we don't like from a scalability perspective that new technology like luster is taking over and gpfs started out as a hundred percent block right so now they added features where there's object you know with clever safe acquisition or they start to tear luster is based on a more of a block slash object concept out of the gate right right? So it's more, the architecture is more modern and more targeted, if you ask me. And there are others out there, right? So there's a lot of homegrown file systems. But once you talk about petabytes and scalability,

Starting point is 00:38:39 it comes back to a statistic, right? Do you want to be the person who is the number one or the first customer of a new startup that created the new high-performance file system? We did it before, to be honest. I mean, eight years ago, we did that with a company and we got burned significantly. You know how they say you can tell the pioneers. Yeah. Pioneers are the guys with the arrows in their ass. Yeah. Well, yeah, I've been there, done that.

Starting point is 00:39:07 I understand that completely. And file systems take a while to mature. Oh, yes, a lot of testing, a lot of validation. Howard, do you have any last-minute questions to ask? No, no, I'm getting it. I've enjoyed the conversation. Yeah, this has been great. Frederick, do you have anything you want to say to the

Starting point is 00:39:25 GreyBirds on Storage audience? I would say keep listening. I think it's a great venue, and let's keep it going. And we like it a lot. And any of your listeners out there interested in doing a review on iTunes for the podcast, that would be great as well and get us a little bit more exposure.

Starting point is 00:39:42 Well, this has been great. It's been a pleasure to have Frederick with us here on our podcast. Frederick, where can we find you on the Twitters? I have two. The first, the personal one is FVHA, which is an abbreviation of my name, so FVHA.

Starting point is 00:39:58 And I also have one for my company, which is at hyphens. That's great. And www.hyphens.com is your URL? That's right. Yeah. I'm working on it. Like you said, I didn't expect to have so much work and so much interest in the beginning,

Starting point is 00:40:14 so my website isn't really there yet. I think we can all say that. Yeah. And we've been at it for years. Some of us decades. We won't go there anymore. All right. Well, next month, we won't go there anymore. All right. Well, next month we will talk to another startup storage technology person.

Starting point is 00:40:29 Any questions you might want us to ask, please let us know. That's it for now. Bye, Howard. Bye, Ray. Until next time, thanks again, Frederick, for being on our show. Yep. Thank you for having me.

Grey Beards on Systems - 33: GreyBeards talk HPC storage with Frederic Van Haren, founder HighFens & former Sr. Director of HPC at Nuance

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.