In The Arena by TechArena - Building Composable Data Systems with Voltron Data CEO Josh Patterson

Episode Date: January 9, 2024

TechArena host Allyson Klein chats with Voltron Data CEO Josh Patterson about delivery of Theseus, a composable data system framework that unleashes develops with new interoperability and flexibility ...for AI era data challenges.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and today I'm really delighted to have Josh Patterson, CEO and co-founder of Voltron Data with me. Josh, it's a real pleasure. Welcome to the program. Thanks, Alison. So you guys are delivering really interesting solutions to the market, and I was so excited to hear that you were going to come on. Why don't you just start with an introduction of Voltron and your background as the co-founder of the company? Well, to start, Voltron Data, we formed the company to really push enterprise data solutions to new bounds. One of the things that we saw coming in the market
Starting point is 00:01:07 for a while now is compute systems weren't really keeping up with downstream AI, machine learning, geospatial, graph analytics systems. Really, all the tools that allow you to derive more insights from your data or make more business-relevant decisions faster were getting faster, running at a much larger scale. And the data pipelines, ETL, feature engineering, were really holding things back. And so Voltron Data, we wanted to address this in two ways. One was through open source and open standards, primarily, really allowing people to build on open standards without the risk of things just purely being open source. software that's meant to be used across language boundaries, system boundaries,
Starting point is 00:02:09 compute boundaries. You really want to make sure that people know that it's reliable, that there are hot fixes to security vulnerabilities, that the ecosystem is staying as up-to-date as possible. And so that's what we've been doing for the first two years of our company. We'll continue to do that. And we're really proud of all the work that we've been doing for the first two years of our company, we'll continue to do that. And we're really proud of all the work that we've been doing across the open source ecosystem, really allowing more efficient data movement, data processing, and bringing all these disparate frameworks and capabilities together more graciously. The second part of what our company does is around acceleration, making compute faster, essentially. This really kind of goes well with my work that I did in NVIDIA. Several of us were heavily involved with Rapids. So Rapids is a GPU accelerated collection of libraries that allow you to do data frame processing, machine learning,
Starting point is 00:03:10 graph analytics, all these different things on NVIDIA GPUs. But Rapids in and of itself was a little bit difficult to use. And it was held back by a lot of the existing frameworks at the time. Spark, Dask, and other distributed compute engines were really designed from the ground up for CPUs. And because of that, they weren't accelerated native. They didn't take advantage of all the different acceleration possible graciously. And we thought that we could build a system that could fully leverage acceleration and all the open standards that we've been working on in Champion to really provide a state-of-the-art compute experience to really keep up with the demands of machine learning, deep learning, and all these other types of systems.
Starting point is 00:03:50 And that's Theseus. And so we recently just went to market with Theseus. We launched in Barcelona a couple weeks ago with HPE at HPE Discover. And we're really excited about it. It's a distributed compute engine that fully utilizes NVIDIA GPUs, high-end networking, storage. It also utilizes vectorized computing on x86 and ARM. But the main thing is it reduces the time to doing feature engineering as well as reducing the footprint.
Starting point is 00:04:28 Doing what would take hundreds of servers in a much, much smaller footprint, saving costs, doing it faster, and just more energy efficient. You really took the industry by storm with this Theseus announcement and it got a lot of attention from folks. I think you opened some eyes about what you were capable of doing. You know, one thing that I wanted to ask you about, I know that you've talked to a lot of customers about what they're trying to do and you've talked about this performance wall. Can you talk about what you mean by that and why Theseus is such an important part of the solution of climbing that wall, if you will. Absolutely. You know, one of the things that we've noticed is people want things faster.
Starting point is 00:05:14 You know, with chat GPT, if a response took two, three minutes, it really wouldn't be a conversation with data. And so what makes these things really powerful is you get responses back in, you know, a second or two. And, you know, I remember a time when, you know, doing layer regressions or logistic regressions took a decent amount of time. You can now train really complex models, you know, in seconds. There's a time when running XGBoost at a couple terabytes would take, you know, hours of training time, you know, and now that's seconds. And so the whole industry has really pushed how fast we can do machine
Starting point is 00:05:57 learning. And that's, you know, really not even talking about, you know, really advanced AI. And so the training times of machine learning, deep learning have accelerated so much. What we're seeing is people are spending less than 20, 30% of their time actually doing training. Most of their time is spent doing pre-processing, feature engineering, getting the data into the format where you can train a model.
Starting point is 00:06:26 And that's taking significantly longer now than ever before. And the problem is only getting worse. Roughly every time a server gets five times faster doing training or inference, it's requiring almost 50 times the amount of ETL compute footprint to keep up with that same speed. And that's just getting really difficult to manage. If you have terabytes of data and you're using Spark, and all of a sudden you're like, hey, this is great,
Starting point is 00:07:01 but I need this to happen two times faster, three times faster. It's not that you need two to three times more servers. Sometimes you might need five to ten times more servers. And it's just really starting to grow exponentially. And it's this kind of asymptotic problem. Eventually, it's as fast as it's going to be. If you have a finite amount of data and you want it faster, there's just this barrier to doing it any faster.
Starting point is 00:07:28 And that's the wall. What we've seen with this is there are a lot of companies where they're like, if I could just do ETL and feature engineering faster, if I could bring all of my data together from different data silos, amalgamate it in certain way, and then feed it into these models for either training or inference or graph analytics or geospatial intelligence, I can derive real value. And it's not even the cost of how much it would cost. It's we can't build another data center.
Starting point is 00:07:56 We can't really scale this to any larger of a cluster because it's just prohibitively expensive or we're just out of room. And so how do we shrink that footprint back down? How do we push those SLAs to new levels? And that's where accelerators come in. Using all the same underlying technology that is made graph and geospatial and AI and ML so fast, we use that for ETL, feature engineering, we can get, you know, significantly faster results.
Starting point is 00:08:35 But it's hard. It takes a lot of work. And, you know, we really couldn't get there without a lot of the open standards that we, you know, kind of depend on, you know, Apache Arrow, Substrate, IBIS. And so what we've done is we've, you know, really kind of took a holistic approach of what's slowing down these systems and systematically removed a lot of these barriers so we can get over the wall or get around the wall or break through the wall, however you want to say it. But we just have to be able to, you know to really accelerate this data budging. That makes a lot of sense. When you think about open source and you talked about the importance of open source, when you talk to customers, why is that so critical? Open source is really important for two reasons. One, it accelerates innovation.
Starting point is 00:09:20 You never know what's coming next. And being to kind of take, you know, an idea and extend it further or, you know, integrate into a new solution allows people to kind of quickly iterate on new ideas. And open source isn't just about code being free. It's about kind of sharing what's possible as quickly as possible. So we can, you know, really kind of continue to diversify ideas. And so if how you connect systems together, how you build systems, if all the foundations were closed source, you're constantly repeating the exact same thing. There's a lot of redundancy and just sort of deadweight loss of engineering. And open source really ameliorates that.
Starting point is 00:10:14 You know, Arrow is a really great example of this because it standardizes how data is represented across different languages and hardware. And even something that simple on paper is actually a really hard, pretty large library to build and maintain. But it's extremely powerful. If I don't have to copy and convert data when I'm going from a Java application to a Python application, if I don't have to convert data into a new format when I'm going from CPU to GPU, I free up so much system resources. And then we start doing this at scale, it's really powerful. It really starts to allow pipelines to be much bigger, faster, more resilient.
Starting point is 00:10:59 And open source allows people to really do these things, you know, in a more nimble and novel way. And that's just kind of the open standard side. You know, the other side of this is, you know, systems like PyTorch. PyTorch allows people to really, you know, create new deep learning frameworks, models. It allows people these gracious on-ramps for when they want to bring you know new innovation to market there's hey if you build it with this framework you know people can use it quickly they get integrated with some of their existing code it just allows you know people to really kind of iterate faster on ideas without having to start over from the ground up or wait for a specific vendor um you know to to release something new and so open source really is about accelerating ideas and and reducing a lot of waste that makes a lot of sense um one of the things that you
Starting point is 00:11:59 announced that i thought was really interesting, especially given your background in GPU, is that Theseus supports a number of different architectures from a processor standpoint, CPU from an x86 standpoint, ARM processors, and GPUs. And you also said even more accelerators in the future. Why did you make that decision to be so broad in terms of your support? And where do you think customers are going with their deployments of different architectures and heterogeneity? Excuse me. The reason why we went so broad is the fact that there's still a lot of x86 and ARM hardware available to users.
Starting point is 00:12:48 And there are times where you want to use whatever computing resource you have available just to get the job done. There's a scarcity of NVIDIA GPUs. Everyone is talking about how hard they are to acquire. The lead times are pretty long. And we didn't want to limit people with how they were doing ETL and future engineering. Part of this is we really wanted to improve
Starting point is 00:13:18 people's experiences really regardless of what infrastructure they had today. Now, it is faster. It is cheaper. It's a smaller footprint. Using NVIDIA GPUs, high-end networking like InfiniBand or Rocky, and the performance numbers are extremely impressive with NVIDIA GPUs. But we should still be able to run these systems on x86 and ARM
Starting point is 00:13:46 because there's a lot of spot pricing with ARM in particular that's extremely cost effective. There's a lot of innovation happening in the ARM space where ARM is having really large memory systems, really, really high core count. And, you know, it's just really adding a lot of value to customers to be able to write code and target, you know, really a lot of different underlying systems. When we talk about, you know, and more hardware to come, there's a lot of innovation up and down the stack. It's not just in the compute, you know, as we think about it today, but it's also in networking.
Starting point is 00:14:23 You know, we want to support, you know, an increasing number of, you know, but it's also in networking. We want to support an increasing number of networking backends. And then storage. Substrate, this kind of data analytics intermediate representation layer, data analytics IR for short, allows us to map logic to different engines without having to necessarily rewrite code for an engine. That engine, we typically think about that as a compute engine, X86, ARM, and video GPUs.
Starting point is 00:14:53 But that compute engine could be computational storage. It could be filtering and doing projections at the lowest level where the data is residing at rest. And so we built the system to take advantage of all the acceleration we have today, but there are many people working on computational storage right now. Wouldn't it be great as, you know, when these, you know, storage systems go to market, there's an engine out there that could immediately leverage those computational storage systems. It had high quality integration and through open standards, it knew how to interact with
Starting point is 00:15:35 them very graciously. And so not only did we want Theseus to be really good at what's in the market and available today, but what's going to be in the market soon. And I do think that computational storage is kind of the next frontier of data analytics. It's not just accelerating kind of your in-memory computation. It's not just graciously falling between
Starting point is 00:15:57 accelerated memory to system memory to disk. But it's also pushing computation to the to disk. But it's also pushing computation to the storage devices and eventually the networking devices as well, being able to do smarter compression and decompression as data is in transit. Yeah, that makes a lot of sense. Now you announced this with HPE
Starting point is 00:16:21 and they said that they're going to fold it into their data lake. I'm sure others will follow. What have you been hearing from customers more broadly about the Theseus launch and why is it so important to have partners like HPE at your side? People are excited. So that's great. One of the things that, you know, when we were still in quasi-stealth, people were always asked, well, is it a SaaS? Is this going to be our next database? Is this going to be in our next data lake house,
Starting point is 00:16:50 data warehouse? One of the things that we address, I think fairly well, hopefully well, was partnering with HPE that, you know, we're not trying to be a database. We're not trying to be a data lake. You know, I've said this a lot. Data has gravity, but people still need really great computational systems. They need accelerated systems. They need to be able to do more with
Starting point is 00:17:16 their data with, you know, less hardware, less footprint. And this is why partnering is, you know, really important to us. You know, HPE,, Ezmeral Unified Analytics, and GreenLake, their hybrid cloud SaaS platform, they have really large users. They have their own data fabric. It's just a very well-established company, and their software growth is, you know, phenomenal. And so now taking, you know, all this innovation in hybrid cloud and, you know, their ability to allow users to spin up different engines, whether it's Trino or Spark, was already kind of pushing the bounds on, you know, composability and, you know, how we think about systems of the future. It's not one size fits all.
Starting point is 00:18:08 There's no silver bullet. And so the partnership with HPE just felt really natural because they already had the system that allowed users a lot more freedom to do what they wanted to do with their data, use the engines that they want. And so introducing another engine, ways to interact with different engines through IBIS, this kind of unified Python front end, it just felt natural.
Starting point is 00:18:30 It worked. But it really solidified to our customers that we're not trying to change how you do things. In fact, we want you to own your code. We want that code to be as widely deployable as possible. We just wanted to be able to target, you know, accelerated systems as well. And a lot of customers are really happy about that. You know, too often people want to kind of lure you into their new state-of-the-art really fast system
Starting point is 00:18:56 and then lock you in, build walls that kind of keep you as their customer. And, you know, we really wanted to build rampant bridges. You know, people, you know, want to write code and target some other system than ours and then target, you know, VCS for certain use cases. That's completely fine. We're very happy with what's happening in the single node space. You know, like DuckDB, for instance. It's an amazing single node engine.
Starting point is 00:19:21 It's extremely fast. And if you have a problem that's a terabyte smaller, you know, there are enough, you know, ARM servers and x86 servers with terabytes of system memory, just use SuckDB. You know, it's a really great tool for a lot of, for a lot of problems, you know, not even small scale, you know, fairly large, you know, data problems. And how cool is it that you can, you know, write code and run it on your laptop for, you know, really small scale problems, scale to these, you know, really large fat nodes, and, you know, do computation there. And then, you know, also run that on distributed GPUs. And customers really were just like, ah, thank you. You know, like that's refreshing.
Starting point is 00:20:04 Very different, you know, that people, I think, really expected us to build ah, thank you. You know, like that's refreshing. Very different. You know, people, I think, really expected us to build a SaaS. And, you know, we would tell people, we're not building a SaaS. And they're like, ah, yeah, sure. We're going to wait for, you know, whenever you launch your SaaS. And the launch really reiterated that, you know, we're focused on building open standards to allow people to have more options, to be more efficient, to build better systems, but not to be as prescriptive of what those systems have to be. That makes a lot of sense. We're headed into 2024. I think this time last year, we were just starting to hear about chat GPT and what that would change in 2023 around AI and the acceleration of AI and generative AI. As you look forward and you're talking with other folks in the industry, you obviously have your finger on the pulse of what people are thinking about in terms of data pipelines and what they're trying to do with them.
Starting point is 00:21:02 What do you think we're going to be talking about next year? And where do you think Voltron fits into that broader conversation? I do think that open standards is going to be the kind of the next evolution of open source, not just open source and, you know, hey, this is open source, but, you know, we changed how X, Y, and Z work. I think people are really going to start to lean into open standards a lot more. Hey, this system is great, but why can't it also run code that was written for this other system? Modularity, interoperability, composability, extensibility.
Starting point is 00:21:40 I think those four words are going to become much more common in enterprises. Why does this system, you know, not allow me to bring this type of data or API, you know, to it? And I think people are going to start to, you know, really request that, you know, more systems start to adopt these open standards. You know, arrow in, arrow out, you know, talking with, you know, kind of these universal front ends like IBIS. And we're already starting to see that. And so I think that's going to be, you know, one trend in 2024. It's going to remain strong. I think the other one is acceleration.
Starting point is 00:22:20 You know, we've lived this world of too long of just, oh, just scale out, just throw more hardware at it. I think performance is finally coming back into relevance. People are excited about performance again because costs matter. And with all the different things going on around the globe economically, you just can't just spin up thousands of nodes and just have these long running overnight jobs. People want things to happen faster and more efficiently. And so I think, you know, we're going to start to see a lot more focus on the intersection of HPC and data analytics. I think, you know, kind of those two worlds, it's already grow apart, but they're going
Starting point is 00:22:58 to come back together in 2024. And one area in particular that I'm excited about is computational storage. I think, you know, we're going to see a lot of computational storage startups, a lot of, you know, existing storage players really start to push
Starting point is 00:23:13 the bounds of computational storage. And hopefully, open standards and kind of computational storage will, you know, kind of see eye to eye early on. So maybe we will have a few less of the mistakes that we saw in the big data space
Starting point is 00:23:28 where you had competing standards happening. That's fantastic. I've learned a lot in this interview. Thanks for being on the show. I can't wait to hear more from you and your team. And I'm sure the listeners online are also intrigued by Theseus and what you're delivering to the marketplace. Where would you suggest folks find out more and talk to your team about an evaluation?
Starting point is 00:23:52 Absolutely. VoltronData.com, our website. I think it's one click away to learn more about Theseus, get a demo. We're excited to really work with large-scale enterprises to address their data challenges and hopefully be able to kind of accelerate their data systems as well as we'll have some new joint marketing with HPE in early 2024.
Starting point is 00:24:21 But right now, the Voltron data webpage, very soon, if there's any kind of issues that people are having with any of our open source products that we support, like Arrow or IBIS or Parquet, drop us a line. We're happy to have our team talk with them, evaluate, and see how we can help. Awesome. Josh, thank you so much for being on the program today. It was really fun. Thank you, Allison. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.