Grey Beards on Systems - 113: GreyBeards talk storage for next gen. workloads with Liran Zvibel, Co-Founder & CEO WekaIO

Starting point is 00:00:00 Hey everybody, Ray Lucchesi here with Keith Townsend. This Great Beyond Storage episode is brought to you today by Weka.io and was recorded on January 26, 2021. We have with us here today, the Rand Zibel, co-founder and CEO of Weka.io. Lorraine, why don't you tell us a little bit about yourself and what's going on Weka.io these days? All right, thanks. And I'm always happy to be on your podcast. There are a few really serious podcasts about storage. So as you know me, I'm not here for the first time. I've been doing storage most of my adult life. And we started Weka late 13 when we basically identified, hey, storage is not going the right way. Customers couldn't actually get a single system to do what

Starting point is 00:01:02 they wanted. They had to go into buying direct attach. And back then, people like Fusion I.O., obviously, since then, it was all commoditized by NVMe or SAN if they needed performance, but higher grade features. Now, the ease of use got from NAS that didn't have good performance. It didn't scale, and they went for object storage for scalability. create a completely new kind of a storage system that basically solves everything you would expect from a storage system with a single system. So we've created a system that has the NAS ease of use. So assured data all clients can access that is faster than what you can get from high-end block or direct attached. It doesn't have scalability issues.

Starting point is 00:02:08 So you can have billions of files in a directory, trillions of files in the file system that is completely software. So it runs on any on-premises server vendor you'd like, but we also run on the cloud that allows you to push data to the cloud. So effectively, you try to answer all these problems with one solution? Exactly. So now if you look at the legacy storage vendors, and these are companies we really appreciate, each one of them has dozens of products. And the reason they have dozens of products is you actually need these different products to solve different kind of IT issues.

Starting point is 00:02:54 And our vision is to create a single product that is going to solve most of enterprise storage requirements. So it's going to be very easy to leverage and it's going to allow customers to run exactly the same on-premises and on the public cloud. So, Loren, one of the... I was going to say one of the arguments that some of the traditional vendors will place is that, you know, this is a horses-for-courses argument,

Starting point is 00:03:24 the best tool for the proposed job. Is that more of this is just their position in the market versus the reality? What are customers reacting to when you come with them with this all-in-one solution? So customers love it because what happens with the traditional vendors, and you know, I've been one. So I've sold IBM storage for a while. And before they acquired XCB, we sold it as a startup. When you're making a product that you have to do like the pre-sizing way, the customer buys, they're not going to put more than what they had to on that box because it will outgrow. It forces customers to go and buy tons of little silos. And, you know, it's not efficient, it's not effective. With Weka, since we can scale basically limitless to, you know, everything has

Starting point is 00:04:22 limits, but our limits are just so big, it's not practical to reach them. Since you can start with a smaller system, grow as you'd like, add more applications, grow your applications, you don't have to go manage so many different boxes, which is one problem. At IBM, we saw the multi-system management box and customer paid money to be able to manage more boxes in an easier way. It's hard on the one hand, and you don't have to go into this spiel of sizing. So if you have to now go and size, it's so hard.

Starting point is 00:04:59 So, I mean, lots of, you know, and that was great for the old enterprise applications of yore and stuff like that. But nowadays, there's lots of new applications coming to fore. I mean, machine learning, AI, deep learning, all that stuff. It seems like it's coming up with a new pattern of IO. Would you say, Laurent? It does come with a new pattern of IO because what you're mostly getting with machine learning

Starting point is 00:05:25 and a lot of the first applications are around vision. You get tons of small images. So, you know, the traditional filers were somewhat okay with IO over larger files. But when you have to go and read tremendous amount of 200, 300K files, and these GPUs can process data at gigabytes per second, you're getting to the point that if you're leveraging NFS or any other legacy protocol, then you just cannot feed these systems fast enough. Is that because of the metadata requirements that are induced with all these small files and stuff?

Starting point is 00:06:09 So on the one hand, it is a metadata problem, and that is the reason, you know, even if you address the older solution with newer protocols, they will just not have enough metadata capacity to go through all these little files and serve them. But NFS also has deficiencies as a protocol. You know, Sun invented it back in the 80s when Ethernet was 10 megabit.

Starting point is 00:06:37 Ethernet now runs at 100 gigabit. And up until 10 gig, NFfs was actually quite capable of scaling uh so up until networking was uh 10 gigabit uh nfs was not the bottleneck but above that nfs is definitely the bottleneck and what we're seeing is that if that's what you're using and if you've designed system to be able to drive NFS, you just don't have enough metadata, small file performance capacity to drive these GPU-based systems. So are there new protocols that are emerging? I mean, obviously NVMe and stuff like that, but beyond those, are there new protocols that are starting to take on these machine learning, deep learning workloads? So definitely. We have our own, the WACA protocol that basically lets you get full POSIX.

Starting point is 00:07:34 Another deficiency of NFS is if you read the man pages and you program to them, this is not exactly what you get with NFS. So many programs, especially databases, require a local file system and that's the reason customers keep buying stand storage and they form a local file system over them. So we give the semantics of full POSIX. But our protocol is proprietary. NVIDIA is actually taking the fate in their hand, and they develop GPU direct storage. And they enable storage vendors to write data directly into the GPU memory in a very effective way. also owns a networking company because they acquired Mellanox, have come up with their own protocol that allows customers to very effectively feed these GPS systems. So, Lauren, I can easily see how if I'm in a bare metal environment and I need to write an application, especially a machine learning or AI application that takes full advantage

Starting point is 00:08:45 of that bare metal, how Wicca IO and even some of the stuff that NVIDIA is doing with dedicated GPUs takes advantage of that. But in this world in which I need portability of workflows and portability of compute capability, how does Wicca help solve that portability problem? So this is actually spot on. We're not seeing the smart customers running on bare metal anymore. So a few years ago, they were forced to run bare metal

Starting point is 00:09:16 because virtualization had two type of attacks. But if you read what NVIDIA is doing to push their customers, the other frameworks, everyone's now moving to Kubernetes. And both NVIDIA and us, we have full support for Kubernetes in our POSIX file system. NVIDIA has support for Kubernetes in their DP direct storage. customers to go repackage their applications in a way that allows for very easy running on bigger environments and smaller environments and sharing of resource in a way that doesn't waste a lot of the hardware resources further it also allows customers to go and run very similar environments on their on-premises environment,

Starting point is 00:10:07 but also on the public cloud. And this... So they're containerizing these AI machine learning, deep learning workloads and stuff like that. I mean... All over. I could see the containerization of the compute, but the data has to be someplace as well, right? I mean,

Starting point is 00:10:32 and how you feed that. So you mentioned that you have POSIX support under Kubernetes? So we have POSIX support under Kubernetes. We have a CSI plugin. And actually what we're showing customers that we enable them to scale their Kubernetes environments. Because usually what happens, and no, NFS was never a good protocol for sharing of data. I can, you know, each one of you probably did like an ANTAR for a large file, and you'd see that on the local file system, it finishes right away. In NFS, it can take hours, sometimes days. So NFS was never planned to offer the real metadata performance of a local file system. It was not part of what they were trying to achieve back then. But when you're doing these Kubernetes projects, what we see customers,

Starting point is 00:11:25 and when you containerize, you still need to access some stateful applications because application of the day needs state. So as long as these applications end up running all of the containers that access the same data on one host, everything runs terrific because they're using the local file system. But then all of these

Starting point is 00:11:47 platforms, when they start scheduling containers over several hosts, and eventually it happens when you start scaling, you cannot run everything on one host. All of these frameworks boil down to NFS, and now the metadata performance drops significantly. What happens with Weka, because we are showing that we have better performance than a local file system on shared environments, we are actually doing the same thing also with Kubernetes. So, we allow you to access our POSIX interface through the CSI plugin. And no matter where you run your containers, you're getting terrific performance. And now scaling is solved.

Starting point is 00:12:33 Yeah. And the reason you want to scale, of course, is to get as many GPUs as possible doing the inferencing or the training for your workloads and stuff like that. So I could see where this thing would burst up to some, you know, huge size either on-prem or in the cloud. And then at some point contract again down to something that's a little bit less, you know, less compute intensive and stuff like that. So that's very interesting. Right. One of the,

Starting point is 00:12:58 one of the conversations I'm getting into a lot with customers has been this avoiding of lock-in to legacy static architectures. And one of the advantages to what I'm hearing, if I'm using just regular Kubernetes CSI, WCAG I.O. can now kind of abstract itself from this legacy NFS problem in the sense that, you know, it's designed for today. Sure, I can take advantage of 100 gig or 400 gig. But when I get to photonics that can truly take

Starting point is 00:13:33 advantage of large, large data sets, all I have to do is if I'm using the CSI for Kubernetes or whatever abstraction I'm using before, Wicca is kind of all in in this ability. Yes, the underlying bits and pieces, we can change that protocol. We can optimize that protocol, but we're not asking you to consume us directly. We're going to feed some other provider, which allows for a better operating experience.

Starting point is 00:14:01 Definitely. And you're going to do the same thing on-premises or let's say, in AWS. Even in AWS, we're a primary storage competency partner. So AWS tells their customers, hey, you can use Weka and it's as good as our own services. Obviously, on-prem, you can get us from Hewlett Packard Enterprise, Dell, Lenovo, Supermicro, Hitachi, Bantara, so a lot carry us. One feature we didn't mention is our snap-to-object.

Starting point is 00:14:31 So we have the NVMe tier that is incredibly performant. We can let you extend the namespace to object storage, and then you get the object storage economics. But a great feature we have is what we are calling snap to object. When you are tiering to an object storage, you can save snapshot to that object storage in a way that any other work system, could be way smaller, could be way bigger, can pick it up from that point in time and keep running. And that system could take a snapshot, save it to the object storage, and the original system will attach to it as if the work happened locally.

Starting point is 00:15:11 So you can have an on-premises Weka. It could tier to AWS S3. It could save snapshots. And then in case of DR or you just need more capacity, you could spin up cluster. And you could leverage us for either DR or you just need more capacity, you could spin up cluster and you could leverage us for either DR or burst. And with Kubernetes, the application doesn't care where it runs. So the DevOps and IT, they need to make sure they provide similar Kubernetes environment from the application perspective, but the actual business owner, the business line doesn't need to know if now it's run on-premises

Starting point is 00:15:51 or it runs on the cloud. So one of the industries I've seen this as a unique problem is drug discovery. You know, I can ingest a ton of data at the edge via it can even be wicked as the front end. But one of the difficulties is getting the data into either the cloud or some locale, some on premises location, let's say in Germany where there's a bunch of drug research happening. So if I'm collecting the data in the U.S. and I get past the export laws or whatever, and I need to process the data in Germany, what I'm hearing is I can use object as kind of that transport protocol that's cheap, fast, and easy to get it replicated from the U.S. or those edge locations back into where I'm going to actually process the data with my GPUs. So in essence, you would snapshot it from the U.S. shores into object, and then you

Starting point is 00:16:54 would instantiate a Weka IO in Germany to access that object? Is that how this would work? This is definitely possible. And we have customers that are doing things that are very similar. We have a lot of pharmaceutical customers. And what they're doing, some of them run the processing on-premises, but they have cycles that just require a tremendous amount of processing. And then they push it to the cloud.

Starting point is 00:17:24 And we have a very good ability to even run over several availability zones, because at some point they can adjust provision and not compute on a single availability zone. And this is something that we're seeing a lot in the earlier stages around genomics, computational chemistry, cryo-EM, which is the new kind of microscopy. And also, then this surprised us, the later stages of the FDA approval require apparently tremendous amount of data analysis. So the FDA is coming with questions about the data and running a lot of statistical models. And the new wave of drug discovery these companies are doing is actually running that phase in AWS. And then they're enabling to shorten what would take weeks on prem to hours on AWS,

Starting point is 00:18:25 and they're only paying for these hours. Yeah, and I think harkening back to an earlier part of the conversation, what I've seen as a challenge, and just HPC in general, not just drug discovery or AI, has been workload portability, the ability to get the application. And while we're talking about the luxuries of AWS and Google and all of that, that's great. A lot of times we're using niche HPC providers that just don't have the same capabilities as AWS

Starting point is 00:18:55 and these types of solutions enable those niche cloud providers like the University of Chicago or some other university that rents out super compute capability. You mentioned someplace in that discussion there that you support multi-availability zone access to the data.

Starting point is 00:19:18 Could you explain that? Right. So this is something that a lot of customers are coming to us because it's not as well supported with the native organic storage. But you can run Weka on one AZ and actually run clients on other AZs and it runs very well. You can also choose to run Weka in what on-prem would be considered a Metro cluster way. So you can actually run Weka on three or four AZs. You would protect four plus two or five plus four. And then in the case that the complete AZ is coming down, your data is still protected.

Starting point is 00:20:00 And still have access. Yeah, yeah, yeah, yeah. That's it. That's very interesting. And, you know, if you store your data on EBS and AZ goes down, you're done. So we actually offer higher resiliency to AZ failures than the AWS's native storage solutions. Right, right, right. So, I mean, so the POSIX stuff is interesting to me, but in an on-prem environment, you would end up making changes to the server software. And I guess I'm trying to understand how that works in the Kubernetes solution. Is it all kind of embedded in the CSI plugin? Yeah. So as long as you run a well-known Kubernetes distribution,

Starting point is 00:20:49 either native or like a Redshift branch or whatever, we got you covered, we will support you, and it's all baked into the solution. And then what customers get is a very, very similar runtime environment, both on-premises and in the cloud. Another thing they're getting with Weka where they don't get with theirB. And we would connect to your Active Directory.

Starting point is 00:21:27 We would synchronize ACLs between the NTFS ACLs and the POSIX ACLs. You can access with S3 if you have like REST stateful clients. Like you could run Kubernetes on AWS, but you would want some Lambda to access it in a stateless way. Then it could access S3. So we are actually providing, not only we let you run the bulk of what we're doing with Kubernetes, we let you access the same data through other environments that you may be using as well. I was going to say, this is part of my love-hate relationship with Kubernetes. The concept

Starting point is 00:22:06 of being able to abstract away the underlay, I'm not getting in. Kubernetes isn't getting in the data path. It isn't the data provider in a sense that is creating the protocol. If I want to access a Windows server file share, I have to use SMB for the most part. This is a, Kubernetes is that shim that allows me to allow any protocol to be created between Kubernetes, the CSI and the OS

Starting point is 00:22:36 and as an application developer operator, I just have to worry about the CSI which is a powerful thing, but then I have to deal with Kubernetes. Yeah, well, Kubernetes has got its own challenges, I'm sure. Though I think Kubernetes is here to stay. And at some point, more and more and more will be transferred.

Starting point is 00:22:59 But what we're seeing, you know, if you're starting a Greenfield project, it's easy to say it's all Kubernetes. But if you already own and manage a legacy enterprise application, you're not going to be able to transform all of it to Kubernetes in one go. And this is when our Maldi protocol comes into great help because customers don't have to do it. And this process may take months and maybe years. So while they're going through the process, they can still have a portion running on the old way and a portion running in Kubernetes. Application changeover, yeah, yeah, yeah. Right. And so this multi-protocol is effectively NFS, SMB, S3 access,

Starting point is 00:23:43 as well as POSIX to the same... With CSI and NVIDIA GPU direct storage. So basically, any unstructured data protocol customers use, we support. Right, right, right. There's a whole inter-protocol locking thing, but I'm sure you've got that covered as well, right? I mean, there's challenges here, right? There are challenges and it cannot run perfect. And customers know when they're doing it because some of the semantics don't map completely.

Starting point is 00:24:17 And this is a problem that no other vendors have dealt with for a while. But we still are getting customers to be able to leverage the same data on different kinds of environments. All right. Well, this has been great, Lorani. The only thing we didn't talk about, I guess, was the workflow stuff that's starting to come out of Kubernetes. There's been some challenges in order to be able to, I don't know, containerize the workflows that these AI ML apps are doing these days.

Starting point is 00:24:51 How does Weka support some of that work? So actually, we have a lot of subject matter experts, both around AI machine learning and Kubernetes. We're working a lot with NVIDIA. NVIDIA is backing containerization. They have their own container platform. They call it the NVIDIA Cloud, but that's basically running all of the different, you know, the TensorFlow and and PyTorch inside containers or they're either like DALI or whatever. So we have a lot of experience working with our end customers and with NVIDIA on taking the codes and the workflow

Starting point is 00:25:39 that used to run just on bare metal. And again, they started on bare metal because it was more efficient and now help transform them to Kubernetes because on the one hand, get similar kind of efficiency. But on the other hand, you're able to go solve bigger problems

Starting point is 00:25:59 because you get the orchestration and the ability to share infrastructure in a very effective manner. And frankly, I've never seen anything as elegant as Kubernetes and the ability to transform IT, I think, in my career. So there are so many good ideas baked into one platform. It's just amazing. Yeah. Somebody mentioned that it was almost like a billion dollars of R&D that was released to the world through Kubernetes. I mean, the ability to scale is just beyond comprehension for most enterprises, right?

Starting point is 00:26:39 And that's what this thing does, you know? Right. And I think most large, so we're seeing it coming from the really cutting edge, you know, customers we have, or even like the startups. It's not yet, it hasn't yet made it to the large corporations. I think these folks are still thinking in standard virtualization. But, you know, these ideas are so fundamentally terrific. I think it's going to make waves throughout the industry. Yeah, I agree.

Starting point is 00:27:16 All right. Well, this has been great. Keith, any last questions for Laurent before we close? Nope. This has been extremely informative. Yeah. Yeah. Laurent, anything you'd like to say to our listening audience before we close? Nope. This has been extremely informative. Yeah. Yeah. Laurent, anything

Starting point is 00:27:25 you'd like to say to our listening audience before we end? No. As always, I've had tons of time and

Starting point is 00:27:34 hopefully you're all staying safe. Okay. Well, this has been great. Thank you very much, Laurent, for being on

Starting point is 00:27:39 our show today. Thank you for having me. And thanks again to Weka.io for sponsoring this podcast. That's it for now. Bye, Keith. Bye, Ray. And bye, Laurent. Bye-bye. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please

Starting point is 00:28:03 review us on Apple Podcasts, Google Play, and Spotify as this will help get the word out. Thank you.

Grey Beards on Systems - 113: GreyBeards talk storage for next gen. workloads with Liran Zvibel, Co-Founder & CEO WekaIO

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.