In The Arena by TechArena - Building Composable Data Systems with Voltron Data CEO Josh Patterson
Episode Date: January 9, 2024TechArena host Allyson Klein chats with Voltron Data CEO Josh Patterson about delivery of Theseus, a composable data system framework that unleashes develops with new interoperability and flexibility ...for AI era data challenges.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and today I'm really delighted
to have Josh Patterson, CEO and co-founder of Voltron Data with me. Josh, it's a real pleasure.
Welcome to the program. Thanks, Alison. So you guys are delivering really interesting solutions
to the market, and I was so excited to hear that you
were going to come on. Why don't you just start with an introduction of Voltron and your background
as the co-founder of the company? Well, to start, Voltron Data, we formed the company to really push
enterprise data solutions to new bounds. One of the things that we saw coming in the market
for a while now is compute systems weren't really keeping up with downstream AI, machine learning,
geospatial, graph analytics systems. Really, all the tools that allow you to derive more
insights from your data or make more business-relevant decisions faster
were getting faster, running at a much larger scale.
And the data pipelines, ETL, feature engineering,
were really holding things back.
And so Voltron Data, we wanted to address this in two ways.
One was through open source and open standards, primarily, really allowing people to build on open standards without the risk of things just purely being open source. software that's meant to be used across language boundaries, system boundaries,
compute boundaries. You really want to make sure that people know that it's reliable,
that there are hot fixes to security vulnerabilities, that the ecosystem is staying as up-to-date as possible. And so that's what we've been doing for the first two years of our
company. We'll continue to do that. And we're really proud of all the work that we've been doing for the first two years of our company, we'll continue to do that. And we're really proud of all the work that we've been doing across the open source ecosystem,
really allowing more efficient data movement, data processing, and bringing all these disparate frameworks and capabilities together more graciously.
The second part of what our company does is around acceleration, making compute faster, essentially.
This really kind of goes well with my work that I did in NVIDIA.
Several of us were heavily involved with Rapids.
So Rapids is a GPU accelerated collection of libraries that allow you to do data frame processing, machine learning,
graph analytics, all these different things on NVIDIA GPUs. But Rapids in and of itself was a little bit difficult to use. And it was held back by a lot of the existing frameworks at the
time. Spark, Dask, and other distributed compute engines were really designed from the ground up
for CPUs.
And because of that, they weren't accelerated native.
They didn't take advantage of all the different acceleration possible graciously. And we thought that we could build a system that could fully leverage acceleration and
all the open standards that we've been working on in Champion to really provide a state-of-the-art
compute experience to really keep up with the demands of machine learning,
deep learning, and all these other types of systems.
And that's Theseus.
And so we recently just went to market with Theseus.
We launched in Barcelona a couple weeks ago
with HPE at HPE Discover.
And we're really excited about it.
It's a distributed compute engine that
fully utilizes NVIDIA GPUs, high-end networking, storage. It also utilizes vectorized computing
on x86 and ARM. But the main thing is it reduces the time to doing feature engineering as well as reducing the footprint.
Doing what would take hundreds of servers in a much, much smaller footprint, saving costs, doing it faster, and just more energy efficient.
You really took the industry by storm with this Theseus announcement and it got a lot of attention from folks.
I think you opened
some eyes about what you were capable of doing. You know, one thing that I wanted to ask you about,
I know that you've talked to a lot of customers about what they're trying to do and you've
talked about this performance wall. Can you talk about what you mean by that and why Theseus is
such an important part of the solution of climbing that wall, if you will.
Absolutely. You know, one of the things that we've noticed is people want things faster.
You know, with chat GPT, if a response took two, three minutes, it really wouldn't be
a conversation with data.
And so what makes these things really
powerful is you get responses back in, you know, a second or two. And, you know, I remember a time
when, you know, doing layer regressions or logistic regressions took a decent amount of time. You can
now train really complex models, you know, in seconds. There's a time when running XGBoost
at a couple terabytes would take, you know, hours of training time, you know, and now
that's seconds. And so the whole industry has really pushed how fast we can do machine
learning. And that's, you know, really not even talking about, you know, really advanced
AI. And so the training times of machine learning,
deep learning have accelerated so much.
What we're seeing is people are spending
less than 20, 30% of their time actually doing training.
Most of their time is spent doing pre-processing,
feature engineering, getting the data into the format
where you can train a model.
And that's taking significantly longer now than ever before.
And the problem is only getting worse.
Roughly every time a server gets five times faster doing training or inference,
it's requiring almost 50 times the amount of ETL compute footprint
to keep up with that same speed.
And that's just getting really difficult to manage.
If you have terabytes of data and you're using Spark,
and all of a sudden you're like, hey, this is great,
but I need this to happen two times faster, three times faster.
It's not that you need two to three times more servers.
Sometimes you might need five to ten times more servers.
And it's just really starting to grow exponentially.
And it's this kind of asymptotic problem.
Eventually, it's as fast as it's going to be.
If you have a finite amount of data and you want it faster, there's just this barrier
to doing it any faster.
And that's the wall.
What we've seen with this is there are a lot of companies where they're like, if I could
just do ETL and feature engineering faster, if I could bring all of my data together from
different data silos, amalgamate it in certain way, and then feed it into these models for either training or inference
or graph analytics or geospatial intelligence,
I can derive real value.
And it's not even the cost of how much it would cost.
It's we can't build another data center.
We can't really scale this to any larger of a cluster
because it's just prohibitively expensive or we're just out of room.
And so how do we shrink that footprint back down?
How do we push those SLAs to new levels?
And that's where accelerators come in.
Using all the same underlying technology that is made graph and geospatial and AI and ML
so fast,
we use that for ETL, feature engineering, we can get, you know, significantly faster results.
But it's hard. It takes a lot of work. And, you know, we really couldn't get there without
a lot of the open standards that we, you know, kind of depend on, you know, Apache Arrow, Substrate, IBIS. And so what we've done is we've, you know, really kind of took a holistic approach
of what's slowing down these systems and systematically removed a lot of these barriers
so we can get over the wall or get around the wall or break through the wall, however you want
to say it. But we just have to be able to, you know to really accelerate this data budging. That makes a lot of sense.
When you think about open source and you talked about the importance of open source, when you talk to customers, why is that so critical?
Open source is really important for two reasons.
One, it accelerates innovation.
You never know what's coming next. And being to kind of take,
you know, an idea and extend it further or, you know, integrate into a new solution
allows people to kind of quickly iterate on new ideas. And open source isn't just about
code being free. It's about kind of sharing what's possible as quickly as possible.
So we can, you know, really kind of continue to diversify ideas.
And so if how you connect systems together, how you build systems, if all the foundations were closed source, you're constantly repeating the exact same thing.
There's a lot of redundancy and just sort of deadweight loss of engineering.
And open source really ameliorates that.
You know, Arrow is a really great example of this because it standardizes how data is
represented across different languages and hardware.
And even something that simple on paper is actually a really hard, pretty large library to build and maintain.
But it's extremely powerful. If I don't have to copy and convert data when I'm going from a Java application to a Python application,
if I don't have to convert data into a new format when I'm going from CPU to GPU, I free
up so much system resources.
And then we start doing this at scale, it's really powerful.
It really starts to allow pipelines to be much bigger, faster, more resilient.
And open source allows people to really do these things, you know, in a more nimble and novel way. And
that's just kind of the open standard side. You know, the other side of this is, you know,
systems like PyTorch. PyTorch allows people to really, you know, create new deep learning
frameworks, models. It allows people these gracious on-ramps for when they want to bring you know new innovation to
market there's hey if you build it with this framework you know people can use it quickly
they get integrated with some of their existing code it just allows you know people to really
kind of iterate faster on ideas without having to start over from the ground up or wait for a specific vendor um you know to to release something new and so open source really is about accelerating
ideas and and reducing a lot of waste that makes a lot of sense um one of the things that you
announced that i thought was really interesting, especially given your background in GPU, is that Theseus supports a number of different architectures
from a processor standpoint,
CPU from an x86 standpoint, ARM processors, and GPUs.
And you also said even more accelerators in the future.
Why did you make that decision to be so broad
in terms of your support? And
where do you think customers are going with their deployments of different architectures
and heterogeneity? Excuse me. The reason why we went so broad is the fact that there's still a lot of x86 and ARM hardware available to users.
And there are times where you want to use whatever computing resource you have available
just to get the job done.
There's a scarcity of NVIDIA GPUs.
Everyone is talking about how hard they are to acquire.
The lead times are pretty long.
And we didn't want to limit people
with how they were doing ETL and future engineering.
Part of this is we really wanted to improve
people's experiences really regardless
of what infrastructure they had today.
Now, it is faster.
It is cheaper.
It's a smaller footprint.
Using NVIDIA GPUs, high-end networking like InfiniBand or Rocky, and the performance numbers
are extremely impressive with NVIDIA GPUs.
But we should still be able to run these systems on x86 and ARM
because there's a lot of spot pricing with ARM in particular that's extremely cost effective.
There's a lot of innovation happening in the ARM space where ARM is having really large
memory systems, really, really high core count. And, you know, it's just really adding a lot of value to customers to be able to write
code and target, you know, really a lot of different underlying systems.
When we talk about, you know, and more hardware to come, there's a lot of innovation up and
down the stack.
It's not just in the compute, you know, as we think about it today, but it's also in
networking.
You know, we want to support, you know, an increasing number of, you know, but it's also in networking. We want to support an increasing number of networking backends.
And then storage.
Substrate, this kind of data analytics intermediate representation layer,
data analytics IR for short,
allows us to map logic to different engines
without having to necessarily rewrite code for an engine.
That engine, we typically think about that as a compute engine,
X86, ARM, and video GPUs.
But that compute engine could be computational storage.
It could be filtering and doing projections at the lowest level
where the data is residing at rest.
And so we built the system
to take advantage of all the acceleration we have today, but there are many people working on
computational storage right now. Wouldn't it be great as, you know, when these, you know,
storage systems go to market, there's an engine out there that could immediately leverage those computational storage systems.
It had high quality integration and through open standards, it knew how to interact with
them very graciously.
And so not only did we want Theseus to be really good at what's in the market and available
today, but what's going to be in the market soon.
And I do think that computational storage
is kind of the next frontier of data analytics.
It's not just accelerating
kind of your in-memory computation.
It's not just graciously falling between
accelerated memory to system memory to disk.
But it's also pushing computation
to the to disk. But it's also pushing computation to the storage devices
and eventually the networking devices as well,
being able to do smarter compression and decompression
as data is in transit.
Yeah, that makes a lot of sense.
Now you announced this with HPE
and they said that they're going to fold it
into their data lake.
I'm sure others will follow.
What have you been hearing from customers more broadly about the Theseus launch and why is it so important to have partners like HPE at your side? People are excited. So that's
great. One of the things that, you know, when we were still in quasi-stealth, people were always
asked, well, is it a SaaS?
Is this going to be our next database?
Is this going to be in our next data lake house,
data warehouse?
One of the things that we address,
I think fairly well, hopefully well,
was partnering with HPE that,
you know, we're not trying to be a database.
We're not trying to be a data lake.
You know, I've said this a lot. Data has gravity, but people still need really great
computational systems. They need accelerated systems. They need to be able to do more with
their data with, you know, less hardware, less footprint. And this is why partnering is, you
know, really important to us. You know, HPE,, Ezmeral Unified Analytics, and GreenLake, their hybrid cloud SaaS platform, they have really large users.
They have their own data fabric.
It's just a very well-established company, and their software growth is, you know, phenomenal. And so now taking, you know, all this
innovation in hybrid cloud and, you know, their ability to allow users to spin up different
engines, whether it's Trino or Spark, was already kind of pushing the bounds on, you know,
composability and, you know, how we think about systems of the future.
It's not one size fits all.
There's no silver bullet.
And so the partnership with HPE just felt really natural because they already had the system
that allowed users a lot more freedom
to do what they wanted to do with their data,
use the engines that they want.
And so introducing another engine,
ways to interact with different engines
through IBIS, this kind of unified Python front end, it just felt natural.
It worked.
But it really solidified to our customers that we're not trying to change how you do things.
In fact, we want you to own your code.
We want that code to be as widely deployable as possible.
We just wanted to be able to target,
you know, accelerated systems as well.
And a lot of customers are really happy about that.
You know, too often people want to kind of lure you into their new state-of-the-art really fast system
and then lock you in,
build walls that kind of keep you as their customer.
And, you know, we really wanted to build rampant bridges.
You know, people, you know, want to write code and target some other system than ours and then target, you know, VCS for certain use cases.
That's completely fine.
We're very happy with what's happening in the single node space.
You know, like DuckDB, for instance.
It's an amazing single node engine.
It's extremely fast. And if you have a problem that's a terabyte
smaller, you know, there are enough, you know, ARM servers and x86 servers with terabytes of
system memory, just use SuckDB. You know, it's a really great tool for a lot of,
for a lot of problems, you know, not even small scale, you know, fairly large, you know,
data problems. And how cool is it that you can, you know, write code and run it on your laptop for,
you know, really small scale problems, scale to these, you know, really large fat nodes,
and, you know, do computation there. And then, you know, also run that on distributed GPUs.
And customers really were just like, ah, thank you. You know, like that's refreshing.
Very different, you know, that people, I think, really expected us to build ah, thank you. You know, like that's refreshing. Very different.
You know, people, I think, really expected us to build a SaaS.
And, you know, we would tell people, we're not building a SaaS.
And they're like, ah, yeah, sure.
We're going to wait for, you know, whenever you launch your SaaS. And the launch really reiterated that, you know, we're focused on building open standards to allow people to have more options, to be more efficient, to build better systems, but not to be as prescriptive of what those systems have to be.
That makes a lot of sense.
We're headed into 2024. I think this time last year, we were just starting to hear about chat GPT and what that would change in 2023 around AI and the acceleration of AI and generative AI.
As you look forward and you're talking with other folks in the industry, you obviously have your finger on the pulse of what people are thinking about in terms of data pipelines and what they're trying to do with them.
What do you think we're going to be talking about next year?
And where do you think Voltron fits into that broader conversation?
I do think that open standards is going to be the kind of the next evolution of open
source, not just open source and, you know, hey, this is open source, but, you know, we
changed how X, Y, and Z work.
I think people are really going to start to lean into open standards a lot more.
Hey, this system is great, but why can't it also run code that was written for this other system?
Modularity, interoperability, composability, extensibility.
I think those four words are going to become much more common in enterprises.
Why does this system, you know, not allow me to bring this type of data or API, you know, to it?
And I think people are going to start to, you know, really request that, you know, more systems start to adopt these open standards.
You know, arrow in, arrow out, you know, talking with, you know, kind of these universal front ends like IBIS.
And we're already starting to see that.
And so I think that's going to be, you know, one trend in 2024.
It's going to remain strong.
I think the other one is acceleration.
You know, we've lived this world of too long of just, oh, just scale out, just throw more hardware at it.
I think performance is finally coming back into relevance. People are
excited about performance again because costs matter. And with all the different things going on
around the globe economically, you just can't just spin up thousands of nodes and just have
these long running overnight jobs. People want things to happen faster and more efficiently.
And so I think, you know, we're going to start to see a lot more focus on the intersection
of HPC and data analytics.
I think, you know, kind of those two worlds, it's already grow apart, but they're going
to come back together in 2024.
And one area in particular that I'm excited about is computational storage.
I think, you know,
we're going to see a lot of
computational storage startups,
a lot of, you know,
existing storage players
really start to push
the bounds of computational storage.
And hopefully,
open standards
and kind of computational storage
will, you know,
kind of see eye to eye early on.
So maybe we will have a few less of the mistakes
that we saw in the big data space
where you had competing standards happening.
That's fantastic.
I've learned a lot in this interview.
Thanks for being on the show.
I can't wait to hear more from you and your team.
And I'm sure the listeners online are also intrigued
by Theseus and what you're delivering to the marketplace.
Where would you suggest folks find out more and talk to your team about an evaluation?
Absolutely. VoltronData.com, our website. I think it's one click away to learn more about Theseus,
get a demo. We're excited to really work with large-scale enterprises
to address their data challenges
and hopefully be able to kind of
accelerate their data systems
as well as we'll have
some new joint marketing with HPE
in early 2024.
But right now,
the Voltron data webpage,
very soon, if there's any kind of issues that
people are having with any of our open source products that we support, like Arrow or IBIS or
Parquet, drop us a line. We're happy to have our team talk with them, evaluate, and see how we can
help. Awesome. Josh, thank you so much for being on the program today. It was really fun.
Thank you, Allison.
Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.