In The Arena by TechArena - Delivering a Sustainable Data Platform for the AI Era with Weka, Data Insights Series Sponsored by Solidigm

Episode Date: May 2, 2024

TechArena host Allyson Klein and Solidigm’s Jeniece Wnorowski chat with Weka’s Joel Kaufman, as he tours the Weka data platform and how the company’s innovation provides sustainable data managem...ent that scales for the AI era.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena Data Insight Series. My name is Allison Klein, and joining me back in studio is my co-host, Janice Naroski from Solidigm. Welcome to the program, Janice. Thank you, Allison. It's great to be back. We continue to just have fantastic discussions with leaders in the industry and, you know, pursuing what's happening with the data pipeline and how is that transforming for organizations as they try to tackle new capabilities with artificial intelligence. Tell me about who you've got lined up for us today to talk to. Yeah, thank you, Alison.
Starting point is 00:00:37 I am super excited about our guests today. We have joining us Joel Kaufman, who is the Senior Technical Marketing Manager of Weka. And anyone who's been following anything AI, Weka is definitely one of those organizations that are a major bright spot in innovation. And we're just super excited to hear from Joel about the technologies that they're working on and the innovation that they're working toward. Welcome to the program, Joel. Hi, glad to be here.
Starting point is 00:01:08 So, Joel, why don't you just start with an introduction of yourself and, you know, your background and data and how it led to the position that you've got at Weka? Yeah, I started out quite a while back without dating myself too much, but for a while I worked for Silicon Graphics, which should take you way back in the day, doing a variety of things for them around high performance computing, managing systems there, things like that. And then after a while, a few people who had moved on to a different company, which became NetApp, said, hey, you should come over and start working there. And so I wound up at NetApp for a very long time, approximately a little over 20 years, doing everything from introductory setup to managing entire teams of technical marketing engineers, and then handling a lot of our data protection, data management, and replication, data replication programs. And then after a while, again, people you know always pull you around. And so I got this call saying, you should come over and check out this very cool new technology at a company called Weka.
Starting point is 00:02:19 At which point I moved out of the engineering side of technical marketing and more into the marketing side of technical marketing, where I help pull together and explain our technology in a way that is meaningful and makes sense to a lot of our customers, our partners, and sometimes even internally for training our own people at Weka. Now, we recently had Weka on the program, and I wanted to follow up to go a little bit more under the hood with your solutions. The first interview was great, and people should check that out. But there's so many more questions that I had at the end of that interview. So I'm so glad that Janice and I are getting the chance to talk to you. Before that we go there, though, can you just
Starting point is 00:03:05 do an introduction of Weka for those who aren't familiar with your solutions and, you know, put a little context around how Weka works within the industry? Yeah. So the way to think of Weka this day is that we're really this data platform that, you know, partially through a lot of foresight and partially through a little bit of luck, we came to this point where there was this convergence of really high performance compute, really high performance networking, but storage seemed to be left behind in the dust a little bit. And so our founders took a look at it and said, you know, there's got to be a way of solving this storage problem as part of this, I guess you call it kind of a pyramid, right? A triangle of these three core things that go into infrastructure. a incredibly parallelized file system. We're able to deliver really strong software-based data management,
Starting point is 00:04:10 distributed storage functionality across the entire data platform that we provide. And we can do it on-prem or in the cloud for pretty much any workload that's out there. We tend to focus a little bit today on the really performance-intensive workloads, things like AIML, high-performance compute, but all the different variations about what those implementations look like across a large number of industries. That is amazing, Joel. Can you comment a little bit further, though, on any particular customer challenges that Weka is kind of uniquely solving? And why is this a focus for you today? Yeah. So if we take a look kind of at the history of what's
Starting point is 00:05:00 been going on in the industry, we started out with things like HPC. And HPC was sort of this isolated, you know, it was built for large labs, maybe some, you know, universities that were doing this in the public sector space. And there was this trend of ultra high performance. And that moved on for about 10 years. And then you fast forward towards today. And what we're finding is that a lot of the types of applications, the problems that customers really were trying to solve around, you know, even internal things, manufacturing, business and finance types of things, started to require more and more compute power. They started to require a lot more intelligence to what they're doing. And when you couple that with the rise of things like AI ML, generative AI, it began as convergence. And so now we're seeing HPC and converging into AI in the
Starting point is 00:06:00 enterprise space. And so a lot of these challenges that we're seeing is companies that are saying, I'm used to doing traditional enterprise-level IT. Here's a whole new classification of applications that they might be in the cloud, they might be on-prem, they're incredibly high performance, they have scale that they've never seen before, not just from a performance or capacity standpoint, but even little things like the numbers of files that are being used to pull this data
Starting point is 00:06:33 in for processing are at volumes that just literally have never been seen in the history of computing. And so being able to say, take a step back and go, we've simplified this environment. We're able to give you all the scale and performance you need. And by doing that, it really makes it a much more simplified and easy to consume experience for those enterprise customers. Some of the customers that we're dealing with at this point aren't just the traditional top enterprise tier. It's a new generation of data center providers that are doing things like GPU as a service, right? They need to figure out how do they handle tens or hundreds of customers all trying to consume massive farms of GPUs in a somewhat isolated manner and yet maintain consistent performance across all of it to give the best bang for the buck. and not now are looking at the fact that traditional types of storage, you know, that are being represented as data platforms simply don't have the performance density to make them actually sustainable moving forward. So being able to say we can offer all these capabilities,
Starting point is 00:07:59 but reduce our entire carbon footprint, make sure that it is sustainable from a community standpoint is really becoming very, very crucial to a lot of these customers. beautiful story. And, you know, you're talking about how you've simplified what I think folks who have managed data for a really long time know is that that simplification is not an easy thing to do. So can you go a little bit deeper into how Weka has approached data management in a unique way to deliver that simplification to customers across that diverse landscape? Yeah, absolutely. So one of the things that we've noticed, and I heard you at the very beginning talking about data pipeline. We've been talking about data pipelines now for probably close to three years or so.
Starting point is 00:09:02 What we discovered in a lot of customers is that when you start looking at legacy architectures, and I don't mean that in a really disparaging way, what I really mean is architectures that have done fantastic for traditional enterprise IT for years and years and probably will continue to do that for the future. They're not architected to look at different types of IO in a system and manage it in a really appreciable way. So a good example for this is, let's talk about AI in general, or generative AI. If you look at a pipeline for a workflow and a tool chain that's used for all these, you start out by ingesting data, and it could be market data. It could be data sets for things like genomics, protein folding libraries, integration with cryo-EM systems. It could be images for doing manufacturing quality assurance,
Starting point is 00:10:01 right? QA on varieties of components, things like that. So you ingest, there's a certain IO profile for that. Then you turn around and say, the next thing you need to do is normalize this data out, go ahead and transform it, maybe an ETL or an ELT type of function, something that takes that data. And suddenly you go from this big, you know, maybe slow, but lots of streams of writes coming in, to now you have to do this blended IO of reads and writes and back and forth. And as the scale of these files gets larger and larger, now you have to do tons of metadata lookups. And eventually you get around to processing the data. And then the final step, well, not really the final step, but the next step in the pipeline is maybe you send it off to training
Starting point is 00:10:47 in the AI model because you've normalized the data. Now you do the ingest, and you do the retuning, and the training, and the fine-tuning, and so on, and that's a massive type of read function, and then you take the data and you validate it, send it back if someone says the precision's not enough, and you start your automated loop, and this type of blended I.O. across the board
Starting point is 00:11:07 has been a nightmare for most companies to handle. And it was so bad for a long time that even with going from hard drives to flash drives, you simply could not have storage systems that were architected to handle every stage. And so you wound up with, here's my dedicated ingest system. Then I'll copy the data over to a system for doing the ETL. Then I'll copy the data again to pump it into these GPUs for training.
Starting point is 00:11:41 And then I'll take the result back out and then they'll copy it back. And this has caused just massive complexity. So what Weka has done is we have such a, the ability to handle so many different IO profiles at the same time without any real performance deterioration is we've flattened that entire copying architecture out. We've made it to the point where you can just have a single pool or file system, if you want to, of this data and have so much performance across reads, writes, big files, small files, and numbers of files that we've removed that entire copy process.
Starting point is 00:12:23 And ultimately what it does is it helps you feed your compute platform, CPUs and GPOs, faster to keep them massively utilized so they're not sitting there burning power, just idling along, waiting for data to come in. And that simplification has really transformed what a lot of our customers are doing. That is amazing. And just hearing you explain that simplification has really transformed what a lot of our customers are doing. That is amazing. And just hearing you explain that simplification, as many others are kind of looking at the overall data pipeline to kind of understand how to navigate through it.
Starting point is 00:12:54 I think you guys have a really amazing handle on it. But can you take a step back? And you've mentioned a lot of different workloads as well, but I'd love to understand even further, you know, you guys seem to have a really big affinity with working with data in the cloud. And can you speak a little more specifically around how you're solving data movement challenges across distributed systems? Yeah, this is a real interesting one. And I kind of want to clarify a little bit. When we talk about data movement challenges, there's a unique aspect to it that I think is underestimated. And that is that data has an extreme amount of gravity. And so when we talk with customers, one of the biggest things that we do and consult with our customers about is not just, you know, can you move the data, but should you move the data? And so we're finding
Starting point is 00:13:52 this really interesting combination of customers who are saying, nope, everything I'm going to do is on-prem. Some who are entirely 100% cloud native entirely. We actually have one customer in the media and entertainment space, Preymaker. Everything they do, it's 100% cloud focused. They don't want to deal with infrastructure. But more and more,
Starting point is 00:14:14 we're beginning to see this trend of customers who are making these decisions about what data needs to be moved. And it doesn't necessarily have to be all of it. So we get into this very hybrid type of, of cloud play. So what Weka does under the covers a little bit is we have a technology that, that really enables us called Snap to Object. And one of the things when we began this, this, uh, this process of bringing Weka as a product to market, is we took a look at what costs look
Starting point is 00:14:48 like and what a better way of maybe doing replication would look like, or data movement, really, in this case. Actually, let's call it data mobility or data liquidity, if you want to. And so being able to take a complete image of what the data is on a Weka system, move it to an object store where we don't care where it lives. It could be and we produce this every time we take the snapshot and move the data, if you can pass that key along to another Weka system, it can then go access that data as long as they have access to the object store. And so you get this sort of combination of killer third-party witness of data because it's now on an entirely separate object store system. So you have that separation of domains and yet any other weka system can grab it and so we're seeing use cases um a great example there's a pharmaceutical company um in the boston area uh really big work that they're doing around protein folding and virology and things like that to create solutions you know from the health of of their customers And what they do is they do a lot of their reprocessing on-prem.
Starting point is 00:16:30 And then when they do their final model trading and final analysis, they snapshot the data up into the cloud, right? Attach a cloud-based Weka system, and then they can scale up massive amounts of rental compute power to address that data at really high performance levels, again, even in the cloud. And then once they're done, they get a couple of much smaller outputs and they just send it right back down to on-prem
Starting point is 00:16:58 for final archiving, storage, et cetera. And so we're seeing this type of data movement and distribution happening across a lot of our customers now. So, Joel, thank you so much for that. Could you also speak a little bit more about the work you guys recently introduced with the WekaPod as a complement to your WekaReference architecture and tell us a bit about, you know, why that infrastructure? Yeah. So this has been kind of an interesting journey. If you look at Weka from the start point,
Starting point is 00:17:35 we are completely agnostic. We are a software solution, right? In fact, our original coding and our original builds were all in the cloud. We're one of the, I guess you could say, sort of oddball infrastructure companies, where instead of starting on-prem and then saying, we'll figure out how to port it to the cloud, we start in the cloud, and then we had customer demand come in and say, hey, you should be on-prem because we have a real need locally. And one of the benefits of all this, just a little side note,
Starting point is 00:18:16 is that because we're just software, we run the exact same binary, whether it's on-prem or in the cloud. We don't change how we operate our data platform. And because of this, it gives us a certain amount of agnosticism that lets us, you know, lets our customers again, make those decisions to deploy anywhere. So cloud was evolution one. Then we move on to evolution two, which was on-prem and then hybrid and evolution three now is what we're, is essentially the appliance or the ultimate in consumption simplification for on-prem customers. systems and base pod systems, they wanted to have a, a appliance effectively built out for, for use in those, those particular use cases. And so Weka, we, we partnered with one of our, our hardware vendors that we work with, and we've produced effectively a completely wrapped appliance that if a customer wants to purchase it they can
Starting point is 00:19:27 they can buy it as as a complete effectively turnkey bundle as part of a super pod or base pod deployment and we go out the door now and give them that complete reference architecture with with us as the data platform the compute from from NVIDIA, and make it super simple for them to use. Effectively, at that point, it's a plug and play type of solution. Where we're starting to see customers have real strong interest for this is customers who either don't know what their AI solutions or high-performance compute solutions will look like, or they're unsure what the requirements will be moving forward in the future. And so one of the things that we've done with this WekaPod that's kind of interesting is that
Starting point is 00:20:18 it's all Gen 5 hardware under the covers. And what I is pcie gen 5 so you get the latest processors you get the latest uh ssds flash drives and the latest networking in there and the ultimate result is that we give you this performance density that is you can start with a very small environment in fact the smallest one of this is eight servers or eight nodes, I guess you could say. And yet those eight nodes can go anywhere from half a petabyte to one petabyte, but can deliver performance that is absolutely unheard of. 18 million IOPS, latency that rivals a raw fiber channel SAN. In fact, in some cases, even faster than that. And yet this entire stack is seven and a half kilowatts of power consumption completely. And so that type of performance density, yeah, it gives our customers so much flexibility in terms of saying, look, if I put in this really small system, I don't know what my future scaling will look like, but the performance is there. The ability to expand is there.
Starting point is 00:21:34 I don't have to burn a ton of power and cooling for that footprint. You know, to be quite frank, this entire space is moving so fast. I mean, we've seen, let alone the last two years, the last six months or even two months alone of how this industry has moved and the changes in what data structures look like and the hardware. yourself. And yet this is probably as close as you're going to be able to get to something that at least for that data platform component really could help you out when there's an unknown future coming with the rate of change. You know, Joel, I loved how you talked about that because I've been thinking about this quite a bit in terms of trying to forecast how the industry is going to keep delivering balanced platforms when you're looking at the innovation cycles that we're looking at, right? And so as you break down that platform, and you did such a beautiful job of it,
Starting point is 00:22:34 are there areas where you look at balancing performance, efficiency, and scale where you want the industry to really pay attention across logic, storage media, network and IO, where you're thinking, hey, this is going to become a bottleneck pretty soon? Or is there anything in particular that you would call out from an efficiency standpoint that the industry really needs to focus on? Yeah, I think there's going to be a couple of turning points that are going to have to be addressed at some point. From a hardware standpoint, it's only going to go up from here, right? We've gone from 10 years ago, 15 years ago, 10 gig Ethernet to, you know, all the way through 100, 200, 400. And by this time next year, 800 gigabit is on the table. So I don't think we're going to have these huge network bottlenecks for the vast majority of workloads that are out there.
Starting point is 00:23:32 And the same thing, processors will get faster. GPUs will get faster. Storage devices will still be kind of interesting in terms of how they go. I think there's two things that are going to have to be addressed. One is that there's in the storage industry and storage device category, there's really this kind of bifurcation that's happening. And this is kind of a, just to be fair, this is a Joel opinion to a large extent. It's not necessarily a Weka opinion. But what we're seeing is this weird bifurcation. You have traditional flash SSDs, you know, the TLC layer that are relatively high performance, good endurance, and so on and so
Starting point is 00:24:18 forth. But so far they've been somewhat limited in size. And so that's created this secondary type of device out there where you have bigger SSDs, flash devices, but they don't necessarily have as much endurance. But capacities are significantly higher. And so the question becomes, can you go ahead and figure out a way of making either the fast versions, the TLC devices, bigger? Or is there a way of making the bigger QLC devices a little bit faster? And so there's a bit of convergence there that needs to happen. If I had to place a bet, I'd put it on the TLC side,
Starting point is 00:25:05 because that seems to be where the innovation is happening a little bit faster. But that being said, you know, as Janice was saying, Soladyne, they've gone and turned the corner the other direction and said, we're going to make QLC that's getting faster and faster and faster. So I think there's, when those start to collide head to head, that's going to be a real interesting point to see where that goes. Beyond that though, the number one thing that's going to have to happen in the industry, it really is this addressing of sustainability. I came away from the GTC where we had a huge presence there and they were talking about Blackwell, the new GB200 systems from NVIDIA and the Blackwell processor combined with the Grace processors and a stack where it's going to be water-cooled pretty much by default
Starting point is 00:26:05 when you buy the full, huge, disaggregated GPU there. And when Jensen was on stage talking about 120 kilowatts per rack, the question becomes, it may be significantly faster, and you can subdivide that up. I think we're going to reach this interesting point where it's going to be incredibly hard for various companies to actually be able to deploy something like a water-cooled GB200 system. And it will be customers who can only have, you know, do they have the power that they can actually bring in?
Starting point is 00:26:45 And more and more data centers, power is becoming an absolute problem because the utilities that are supplying them are going, we're out of power capacity. We literally cannot provide more amperage and current into your data center because we're out our our major infrastructure we don't have enough power plants to produce the power to feed these things um and so i think initially this is going to be a very sparsely deployed type of system simply not not because the technology doesn't exist but because the technology doesn't exist, but because the power doesn't exist. And that's going to be something that's going to become an ongoing reckoning
Starting point is 00:27:30 through the industry where what is the correct amount of compute power or that you're able to even put against a problem without having to build custom data centers in places where they have surplus power. And that's going to be a very interesting challenge moving forward. Wow. A lot of awesome insight there, Joel. When you started off by saying we're, you know, bifurcation of hardware and who's up to the challenge. And I think we're definitely up to the challenge. And as you talk through, you know, sustainability and some of those elements, we couldn't agree more, which is why we really believe in, you know, not just creating drives that are higher capacity or fast, but how does it kind of help attack this entire solution set?
Starting point is 00:28:27 And your insight and work with Weka has just been admirable from my standpoint personally, and also from our company, SolidIme standpoint. But we do have to ask, been to a lot of shows, seen Weka there, but where else can folks go just to learn more and engage with your team and maybe trial some of this work you're working on? Yeah, you know, as is true, once it where you should go to find any and all information. We have links there for all of our solutions, both on industries, both on technology types. And from there, we can absolutely, you know, you can click a few buttons, chat with people live online and get answers and find out more about how we can help you out. Well, Joel, next time you guys bring out the purple Lamborghinis, I'd like an invitation for a ride. That was really cool at GTC.
Starting point is 00:29:39 And I'm sure there were a lot of folks who are a bit envious of the folks that got a chance to take a look at those. Thank you so much for being on the program today. I've been following Weka since last year when you guys were at Cloud Field Day, and I've just been so impressed with the solutions that you're delivering to the market. We want to keep having you on the program, and today just underscored why. Thanks for being here. More than happy to do it. And absolutely, if the Lamborghinis come out, come find me. I'll make sure you get in for a ride.
Starting point is 00:30:11 Janice, you're going to have to come with me. We're going to have to find a three-seater. All right. And Janice, thank you so much for co-hosting. We will be back with our next episode soon as we explore the data pipeline.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.