The Data Stack Show - 133: Building the Data Warehouse for Everything Else with Sammy Sidhu of Eventual
Episode Date: April 5, 2023Highlights from this week’s conversation include:Sammy’s background in data and tooling (2:46)Going from multipurpose engineering to a CTO position (5:14) Changes in technology and deep learning ...models (7:31)The state of self-driving and adoption (13:49)What is Eventual and what are they solving in the space? (20:54)What are daft and data frame and how they work? (28:11)Building a query optimizer (33:42)Sammy’s take on what is going on in data and future possibilities (45:18)Eventual’s future and its impact on the space (51:44)Final thoughts and takeaways (53:47)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Kostas, I'm so excited that the people that we get to talk to on the show
continually amaze me.
We're going to talk with Sammy,
who's building Eventual,
but he was at DeepScale,
self-driving AI technology
acquired by Tesla.
He was at Lyft,
built the Level 5 team there,
working on self-driving stuff,
acquired by Toyota. Thedriving stuff acquired by Toyota.
The team was acquired by Toyota.
And is now building tooling to help people who do those sorts of things have a much better experience shipping sort of large scale projects and models around complex data. And what I want to ask Sammy is,
he's had a dedicated focus on a very similar type of problem
over a pretty long period of time.
And if you think back,
and we talk a lot about the recent history of data technology,
but if you think back to sort of, I guess, 2015,
when he started at deep
scale, you know, there were still a lot of limitations in terms of running models at scale
and, you know, data storage and all this other sort of stuff, right? And so I can't wait to hear
the trends that he thinks are most important from his perspective, and then what, you know,
what he's building at Eventual based on all that experience.
How about you?
Yeah, I have plenty to talk about with him.
First of all, we have to talk about frames.
We have to talk about Pandas, the Python ecosystem around data.
And I'd like to hear more about like Eventual itself, like
what it is and what's the vision.
It's why they built this thing.
So I think we are going to have like a very interesting conversation, especially
because we have like a person here who has done like all these things around
training models and building models and self-driving cars and like all that stuff.
And who today is like starting from scratch, like a company that has to do
with like developer tooling and data warehouses in ML, but I think that says
a lot about like current state of like tooling and technology that people have in ML.
And I want to hear that from him.
Like I'm very interested, like to see what's going on and why he made his decision.
Right.
So let's do that.
Let's talk with him.
All right.
Well, let's dig in.
Let's do it.
Sammy, welcome to the Data Sack Show.
So much to talk about.
So thank you for joining us.
Glad to talk about. So thank you for joining us.
Glad to be here.
Okay, you have an amazing story and you seem to have this knack for getting acquired by large automotive corporations, which is really fun, times in a row. But take us back to the beginning.
So can you just give us sort of the overview of your journey in, of your journey in data? Oh, yeah, for sure. So once again, I'm Sammy.
It all started when I went to Berkeley, where I focused on high performance computing,
aka making things run fast and deep learning. So this is kind of the era where
deep learning started taking off, neural networks were starting to seep into things.
And I had found a research
lab that focused on putting the two together. So I worked on everything from making models train
really quickly on large supercomputers to making small neural networks. And my professor and PI,
who I worked with, had the idea of starting a company. So they started a company called DeepSkill,
which I joined as one of the early engineers. And during that process, we started with three people
to where we ended up with nearly 40. And I was a CTO towards the end. So there I worked on
everything from building deep learning compilers, training new novel research models, to building
entire data engineering stacks to process things like
point clouds, images, radar, you name it. Towards the end of that, we actually got acquired by Tesla,
the autopilot team, where the majority of my team got absorbed into autopilot working on training
models or building infrastructure, taking all the learnings that we developed at DeepScale at the
time. After that whole ordeal, I went to Lyft, where I continued working on self-driving, where they were a little bit earlier. I brought
everything I've learned from before and trained better models for perception, built entire data
engineering pipelines, and really refined that process of how do you actually ship for self-driving.
I did that for about three years, and my team my team was acquired once again by Toyota this time.
After that, I was like, hey, you know what? I picked up a lot and I'm really kind of tired
of the tooling I've had to deal with using systems that were designed for tabular data,
things like Spark and BigQuery and kind of adapting it to make it work with images and
point clouds and whatnot. So I was like, hey, things can be all better. And my co-founder and I, who I had met at Lyft,
decided to start Eventual to kind of build a data warehouse
for everything else.
Love it.
And so much to ask about Eventual.
One thing I'd like to start with,
because I think it'd be really helpful
for a lot of our listeners, And I'm just plain curious,
going from being sort of a multi-purpose engineer, obviously with an emphasis on sort of
infrastructure that's feeding sort of heavy duty data science stuff,
very practical, wearing lots of hats like early stage company, and then you become CTO.
What was that transition like? And what was sort of, what were the things that like really stuck out to you sort of becoming CTO, as opposed to like, I'm executing a lot of practical stuff
day to day to like, you know, get the software to work? Yeah, it's an interesting shift.
When you're an engineer, the kind of fulfillment you get is
like, what did I accomplish this week or today? Or what feature did I shift? And then when you
kind of transition into like a manager, right, or a tech lead, it's like, what features or what
product or what impacted my team shift. So you kind of have to like, change your mindset a little
bit to get more fulfillment and happiness from that and
then when you're a cto it's even one degree removed from that which is what is my company
shipping how are they making their customers happy and the thing that you can't measure is
are the decisions i'm making now which have an impact a year from now gonna pay off right so
you kind of have to like modify the discount factor in your head for every step of the way.
Yeah, that's super tough.
Was it a steep learning curve
sort of thinking
about that discount factor and having to think
much further ahead than you
ever had in sort of more
practical engineering roles?
I think so. Because in the beginning
when you transition,
it's like you still try to do a little bit of your old job
while doing some of the new responsibilities.
And what you end up doing
is a bad job at both.
And so you kind of have to like
learn how to like,
okay, I have a team to kind of
manage the things I used to do.
I need to now focus and get better
at the things that are now
more important for me.
Yeah, that makes total sense.
Thank you for indulging me on that.
No, it's just that's, you know,
sort of like engineer to manager to CTO.
You know, those are just very different, as you said.
I'd love to get your perspective
on the sort of the problem space
that you operated in and are still operating in
over a pretty long period of time, right?
So you were building a lot of this stuff back at deep scale. And then you sort of saw that through
to Lyft. And then, of course, you worked on Toyota. and now are building your own thing. And so over that span of time, you know, there have been a number of things that have changed,
right?
I mean, the amount of data available to us or the amount of like non-tabular, like complex
data seems to me to have grown a significant amount of that time.
I mean, you actually even have like deep learning generating like a bunch of that net news. So it's,
which is really crazy to think about. It's not just like capturing, you know,
images of the real world, but then also the technology has changed a lot as well. Right.
So like infrastructure, you know, multiple new technologies in sort of the data science
space have come out. Other things haven't changed. As you look back sort of over
your experience, and maybe even sort of tying that into things that led you to founding Eventual,
what are the main changes you've seen over that time period since you started at DeepScale?
Yeah, it's really interesting. I think there's been step functions along the way. So I think initially when you're talking about 2015, 2016,
the model you can ship,
the model that you would build for doing perception
or whatever task you were doing,
was kind of limited by how long it would take
to train on a single machine.
So it would be like you would have a server
racked up somewhere in your office
and you typically want to wait more than three to four days
to train it on.
And that kind of set the limit
of how much data you could train on
and how big your model could be.
And at the time, you would kind of just like,
your iteration time would be limited by that.
So you had to be very careful with what you trained
and the data that went into it.
And then kind of the next step
that happened is
things like distributed training
became very ubiquitous.
We had frameworks
that made it a lot easier.
We had people who had
more expertise with it.
With that, the volume of data
was now proportionate
to how many GPUs
you can train on in parallel.
And with that,
that gave us a huge explosion
in both the amount of data we can train on.
So we went from things that were, you know,
tens of thousands of images or hundreds of thousands of images
to now millions of images that we train on at any given time.
And then the second thing was that our iteration time
went from something like four or five days
to maybe a day or sometimes even hours.
And with that, you could actually crank out a lot more iterations of your model.
And in kind of self-driving, the thing I believe is the person who gets the most iteration
essentially wins.
Because every time you can do something, get feedback and improve on that,
it's kind of like the sharper journey that you can take for improving your overall system.
So that was the next step.
And then we kind of hit a point where the models weren't really advancing anymore. Or another way
to put that is the models were no longer the bottleneck. Before there was a point where if I
just, you know, change my model to the latest and greatest paper, I would just get like a,
you know, a jump in performance. But we kind of hit a point where that didn't make a difference
anymore. And the point that was really crucial was, well, how is my data quality?
What can I improve my dataset?
Are things badly annotated?
Do I have examples with conflicting truth?
And that's where the data game became really important.
So that's where we were, instead of changing the model, we would just dive into the data set, find the failures, figure out what to do with it, and then iterate on the data sets. And these data systems became very important. that back when you had these, let's call them physical constraints,
you mentioned that you had to be really careful,
which makes a ton of sense, right?
Because you're trying to maximize the amount of,
you're trying to maximize every hour that this bottle is running,
and you're still trying to create a fast cycle.
Do you think anything was lost?
A, did you have to be,
like, did you have the license to be less careful?
Like in a world where you can just run this thing on so many GPUs simultaneously
and like you don't have that physical constraint,
like, do you have to be less careful?
And then are there consequences for that
where it's like, well, I mean,
maybe that was helpful in some ways.
It kind of was. Like the analogy I draw in my head is I mean, maybe that was helpful in some ways.
It kind of was.
The analogy I draw in my head is,
if you think about programmers way back in the day,
where if you wrote code and ran it through the compiler and it took forever,
you had to be very careful what you decided to compile, right?
Or things that if you're putting it through a punch card,
you have to be very careful what you're doing.
But nowadays, we kind of just write code and smash compile,
and then we get an error warning right away. Yeah.
Then that's kind of the same question of like, oh, we're programmers back in the day
better. And my answer is like, I think programmers today get to focus on higher level concepts
rather than the exact like doing the job of the compiler. I kind of view it the same way,
which is this is kind of the having just a lot more GPU with a lot more compute is simply just trading off
like human time for computer time.
Yeah, that makes total sense.
That makes total sense.
That's such a good analogy.
One thing I'd love to, we actually have had,
we had a guest on the show previously
who had also worked in a self-driving space,
which is really interesting.
And one conversation we had with him,
this is actually, this has been a while ago,
I think, Costas, maybe a year and a half ago.
Peter from Aquarium, I think.
Yeah, I know Peter.
Oh, okay, great.
We worked in the same lab together.
Oh, no way.
You worked in the same lab.
Okay, that's awesome.
I wasn't aware of that connection.
Okay, this is great.
So we'll
get an updated take from you on Peter's take from a while ago. So there's been an immense amount of
work that you and Peter both have personally invested into the self-driving space. But for
the common person out there, the news headlines tend to lead like the practical experience of self-driving.
So where, like, could you describe the state of self-driving from the perspective of someone who's like built literally like core technology that's enabling this to become a reality in terms of like mass adoption of this or sort of, you know, implementation?
Yeah, it is interesting. It also depends on
what part of landscape you're looking at. So on one side, you have, I call it the bottom up
self driving, which is things like, you know, if you buy a car five years ago, it would have
features like AEB, which is automatic emergency braking. Like if you're going too fast for something, it would brake automatically.
So that was kind of the bare bones
of safety features that you would have in a car.
But now you have things like adaptive cruise control,
automated lane keep,
and it's kind of going higher and higher every year.
And then you have the other front,
which is what I call like top-down,
which is you're having things like robo-taxis
with no steering wheels,
and eventually that technology will trickle down to the everyday man.
So they're both making progress.
But if you actually think about like for a, you know, the bottom up approach,
the progress can be gradual, right?
You can have a car every year where the features get better.
And I think Tesla has shown this example quite a bit,
which is it does more places it's a
little bit better and you have companies like mercedes and you know audi also putting out
these features but for the level five thing where i get into a taxi with no driver no steering wheel
that's like a binary threshold yeah it works or it doesn't yeah so for that one it's i do imagine
one day we'll pass that threshold.
But for now, the bottom-up approach, the stuff I did at DeepScale and
Tesla, I feel like is making the most progress.
Yeah. Are the technologies that drive both of those, what's the relationship between
sort of the underlying work? Like was the baseline work feeding both of those efforts or
are they approached pretty differently?
They're completely different. So if you're just tackling highways, it's actually very simple
systems. And we can actually build systems that can drive 90% of the highways in the US without
too much difficulty. Wow. Yeah. Wow. So it's not too bad. And we are seeing cars that are starting
to do that. Yeah.
However, when you start thinking about like,
oh, I want to do the off ramps, on ramps and the cities,
then that's when things just get very hard. I think self-driving is the most insane case
of the long tail I've ever seen
where the last 10% of work is just ungodly.
Yeah.
How do you...
I mean, certainly part of that last 10% of work is dealing with actual changes,
right? I mean, maps have actually gotten incredibly good at incorporating user feedback,
even in real time, which is amazing, right? I mean, you've seen this happen over the last
couple of years where like, you know, Google Maps and Apple maps, right. Well, like prompt you and say like,
is the rec still there or whatever,
right.
You know,
or is the construction still there?
And so those feedback loops are amazing.
So those maps are getting better.
But when you like,
how does that change?
Right.
Because traffic patterns change,
construction happens,
all of those like incorporating userating user feedback into navigation instructions
to quickly recalculate which street you need to go on
is related to that.
But when you have a car with no steering wheel,
do you approach that differently?
I mean, they're still using map technology,
but it's pretty different, or it would seem different.
Yeah, so the mapping technology is interesting for that.
But also just even general perception, just trying to understand what's going on around the car.
It's quite interesting because for the case of not having a steering wheel in the car, you kind of need to get to a critical point where you can adapt to the change in the world as fast as the change is happening in the world.
So I'll give an example. When I was working in self-driving,
one thing that came out of nowhere was we had models and mathematical models
to kind of represent the motion
for pedestrians on the street,
how a person would walk on the street,
or if they might be on a bike or a motorcycle.
But then out of nowhere,
suddenly SF had thousands of electric scooters.
Oh, yeah.
People would now be on the sidewalk with the road.
Yeah, the bird craze, yeah.
Yeah, it was wild.
And you would see all these random scooters in the street.
And then now your whole set of priors before are now useless.
Yep.
Right?
So you now have to adapt to that change rapidly.
Yep.
And so that's kind of how it is with self-driving.
Why it's so hard is the world isn't like static.
It changes.
And you need to be able to keep your data loops,
your model loops, your ability to ship
as fast as the world is changing.
Yeah, that makes total sense.
One really specific question is,
did you work on distributing models,
like distributing updated models to the actual like
fleets themselves
right because like you have this
challenge of like getting data to update
the model but then you actually have to redistribute that to
the fleet is that problematic or is it
is that actually pretty streamlined now
I would say
it's pretty streamlined the hard part
is doing everything to
from the point where you
have a trained model to being like
I can safely deploy this.
You have to do simulation.
You have to simulate this model
and the whole stack on tens of
thousands of GPUs and simulate
over all your historical data.
Do things like hardware simulation
and then finally do a small rollout
and then finally the full rollout. It's usually a lot of work and why self driving is so ops intensive.
Yeah. Yes. Okay, so no, no listener can complain about QA anymore.
I would say a lot of the staff at self driving companies are QA.
Yeah. Yeah, that's wild. Okay, I could keep going.
One more question for me, but then I'm going to hand the mic off to Kostas. Do you own a car? And
if so, what kind of car do you own? Because you've been acquihired by multiple car companies.
So I just need to know this. So I'm a horrible person to ask this because I love driving. So I drive like a...
I actually am a car guy.
Oh, nice.
I drive like a 1990s BMW that I work on.
Oh, I love it.
And then for my daily commute, I have a Toyota RAV4.
Yeah, totally.
That's great.
I have an old 1985 Land Cruiser that I work on.
So I'm the same way.
It's like this is, you know, pretty low tech, but super fun.
Yeah.
I mean, it's a stick shift.
I enjoy driving it.
Yeah.
Love it.
All right.
Costas.
Thank you, Eric.
All right, Sam.
So you have funded something new that's called DaVinci All, right?
So can you tell us a little bit more about like, first of all, like what DaVinci All
is?
And then I'd like to ask you what made you start working in that.
Yeah, the way I would sum up Eventual is that
we're building the data warehouse for everything else.
So you could think of things like BigQuery or Presto or Athena.
And these work amazing for things like Talbot data.
Or anything that fits in an Excel
spreadsheet. But what if you
have thousands or millions
of images or video or 3D
scans? It doesn't
really work quite well. What we're
doing is building something native for that.
And to do that, we're actually
building an open source query engine
to help query this type of data.
The way I like to why we need a
new query engine is to think the first step we have to do there is think about what is kind of
the natural user interface for Talbot data. If you ask most people, hey, what's the most natural
interface to Talbot data, they'll tell you SQL. I agree that SQL makes a lot of sense for Talbot
data. But if I'm starting to talk about images and video, do I really want
to use SQL to query video or like images or random complex data? It doesn't really make quite a lot
of sense. What does make sense is having something where the ecosystem is. If you do any type of
machine learning or complex data processing, you're probably using Python. You're probably
using tools like PyTorch and TensorFlow
and various image processing libraries there.
So we're building a data frame library that's distributed,
utilizing a RAID cluster underneath,
but is native to the Python ecosystem.
You can use your normal Python functions,
your normal Python objects, but under the hood,
we have a really powerful vectorized compute engine
written in Rust, and also have a powerful query engine and query optimizer and all the
special things that you would want in a data frame.
So, all right.
Talking about like data warehousing, like for imaging and for like, let's
say data types that are outside of the typical relational model that people
have been experiencing so far.
So if this thing does not exist today, right?
Like how you were doing the work that you were doing all these years, right?
Like what was the current, like what were the states of the tooling there?
And how good of an experience it was?
Yeah.
So for companies I worked at or companies I've helped out,
the first step of the process, which usually does not change,
is the equivalent of having a bunch of files on your desktop folder.
The way that most people do it is they have a bunch of individual images
or video just sitting in the S3 or Google Storage bucket.
And in the beginning, that's completely fine. They use that directly to train. But then they're
like, hey, you know what, I want to kind of version what data I have or keep track of some
metadata. And so you end up building a system that first, you know, you have the individual
files and something like S3, but then you use something like BigQuery or Postgres to store
all the metadata.
Then it kind of starts evolving from there, which is, okay, now we need an easy way to access the data.
So then you typically build an abstraction on top of it using like a workload engine. So now you have something where it's like the Talbot data is here, the complex data is here.
You have a pointer that points to it, and then you have something like Airflow or Spark to kind of process it.
And then what ends up happening is that this spaghetti gets piled more and more on top,
to eventually you have three teams managing this one system. That's very limited.
And the work we're trying to do is kind of build an engine that kind of bridges the two,
which is you can process all the top data that you need, write really expressible queries,
but then also have things like image columns, video columns, and you can kind of interact
with both in one place.
The next step for the data warehouse is how do you actually store the data together?
Such at a point where things don't go stale from one another, things are versioned together
and a schema is tied together as well.
So what made you leave the work that you were doing, right?
Like training the models and like a lot of stuff, like turning to like were doing, right? Like, training the models and, like, a lot of stuff, like, turning to, like,
tooling, right? Because
my feeling is that from what I hear from you is that
yeah, like, okay, they found
additional work that's getting done, like,
it's amazing, but we have risked
probably, like, a point right now where
we're slowed down because, like,
the tooling is not there yet, right?
And I think, like, for anyone who has been, like, an engineer
for some time, and regardless of what engineer you are, right? And I think like for anyone who has been like an engineer for some time, and
regardless of what engineer you are, right, like data engineers or front-end
engineers or whatever, like you can see that not every field, let's say, has
access to the same type of tools, right?
Like what is happening like with front-end development, for example, like
the tooling that is out there, like all the different... I know that people keep
complaining all the time about
all the JavaScript
libraries and all these
things, but at the end, that's also an indication
of growth and
progress in terms of the tooling.
In my feeling, at least, if you
compare the tooling that a data engineer
or an ML engineer has today
compared to the tooling that a front-end an ML engineer has today compared to the tooling that
a front-end engineer has, there's no comparison
there. It's a
very different experience in terms of
how much the tools help you do
your job.
So,
how big
of a problem do you think of the
NBT is?
I guess the answer
is easy because you moved on
and you're building
a company around that.
But I'd love
to communicate
to the people
out there
the complexity
and how much
of a problem
it is for not
having the right
tools out there
to do your job.
Yeah.
That's a good question.
I would kind of
put it this way,
which is,
for the problems you mentioned for front-end,
and also I believe for tabular, there's kind of a path for graduation, if you will, right?
If you're starting off with a data science project or doing some data mountain engineering,
and it's just small sets of data, you can use Pandas and you'll be completely fine.
And once you need to graduate to like, hey, I have more data now.
It's taking too long.
I can't fit on one machine anymore.
You have tools to go to. You have data warehouses.
You have things like you have a bunch of different tools that you can go to.
However, the kind of the domain that we're tackling, there isn't really a path besides
I build a custom infrastructure for my problem solving.
So what's happening is that barrier can actually slow your progress quite a bit.
So what we see for a lot of startups
and people just doing projects
is that their data set size
that they can process and kind of comb through
is completely limited with whatever
they can process in one machine.
And if you think about what the implication of that is,
is that complex data typically is a lot larger too.
So if you're processing video or images, that's actually not very much data at all. So what we're trying to
do is kind of build a tool, you know, via DAF to kind of give people a path to graduate,
to give people a way that like they can start off processing data this way. And when they need to
scale up, it will scale with them. And then finally, when they're a larger company around it,
we give them all the benefits of a data warehouse
that you typically find in something like Snowflake or BigQuery.
Yeah.
So, okay.
So what's Daft?
Yeah, I'm glad you asked.
So yeah, Daft is a Python data frame that's distributed,
that's essentially made for complex data.
And what is a data frame?
Yeah, a data frame is, if you're familiar with pandas,
or polars, that's essentially DataFrame.
A DataFrame is a way to represent a dataset
with a set of columns.
And it's kind of very similar to when you SQL a database
and you essentially get a set of columns
and a row that represent those columns.
The thing with DAT that's a little bit different
is that a column can be something you're used to, like an
integer or like a flow or string.
But it can also be an image or
a video. You can do operations like
okay, I'm loading in these MRI
data
into my data frame. I have things like
the patient name, the patient ID,
whatever. And then I can do something like an
explode on the MRI image
and actually get individual slices. And I can could do something like an explode on the MRI image and actually get individual slices.
And I can then natively run a model on
these slices and determine, is there
cancer? And determine like, oh yeah, what's the
difference between these different frames? You can
run all these different operations just using the
normal tools you would use in Python.
So it gives the machine learning
engineer or data scientist a lot of
power with a very simple idea.
How is this difference to having, let's say, a standard data frame or even
Pandas and be used like a data type that comes to our binary data, like a byte
array or something like that, right?
Because one of the things that I have noticed that all the stuff that we are
talking about here is primarily binary data, right? So what's the difference? What is needed on top of that, like to make
this experience like better with working with this data? Yeah, yeah. So I think the biggest
thing there is how do you actually represent it in a way that's efficient? So the actual
operations you do are fast. The second thing is, what does the user interface look like? And then the third thing is, how do you actually make this scalable? I think with Pandas,
you can put things like NumPy arrays or images inside the Pandas data frame as an object. But
the problem is then distributing it over a cluster or natively using what you're used to like
PyTorch or TensorFlow, is difficult.
So what we kind of do is we represent it itself
as an image column and can do things
very intelligently under the hood.
So for example, if you have something that is an image
and you want to send it across the cluster,
we can do things like keeping it in a format
that's most efficient, if it was like a JPEG, for example.
I think another thing that's pretty interesting is that
if you are going to distribute it,
the tools that you have are usually not as powerful.
So I think the biggest tool that most self-driving companies would use
is Spark.
And if you're using the Spark DataFrame API,
one of the key things there is that
you're kind of limited to types that can support.
So you could do things like integers and strings, and they do have a way to just make whatever
object you're dealing with into just a bunch of bytes.
But the problem is then is that's not very pleasant for the user.
I as a user have to constantly convert back and forth from whatever I'm doing to bytes.
And then whenever I'm trying to read it back, I convert bytes back to the thing I'm trying to do.
And that's just not very nice.
Yeah, I would also assume that like probably,
let's say query optimizers for these systems
are probably oblivious to this type of data, right?
Like they cannot really use the information
from a binary array or something like that
and optimize the query itself, right?
So do you see opportunities to add this kind of functionality if you are more semantically
aware of what is stored there?
Yeah, 100%.
So we have a very powerful query optimizer in DAF that kind of handles these use cases.
I think one of the things that happened, for example, with Spark is when people do get
frustrated using the data from API, which you can only use bytes, they drop down to the very low level API, which is called an RDD.
Yeah.
Which is just, I have a big collection of rows of whatever.
And you lose a query optimizer.
You lose all of the out-of-core processing.
You lose kind of all the benefits of a Spark DataFrame.
So for us, we can combine the best of both.
We have a DataFrame API that's very intuitive.
You can use GPUs very efficiently.
But we can also optimize quite a bit when you write your queries.
So whenever you do calls in DAF, it's very lazy.
It stacks up and makes a very good query plan.
And we find the most efficient way to actually run it on your cluster. It stacks up and makes a very query plan. And we find the most
efficient way to actually run it on your cluster.
Yeah, absolutely. And I think it's also completely the whole purpose of
having something in Python, right? Which is how
easy it is to work with it when you go to
the RDD level, where you pretty much have to
read the publications that the folks did back in
Berkeley when the King was back.
To figure out how to work with these. When you reach the point where you have to write an RDD,
it feels almost like, oh, I write Java, but now it's probably better to use JNI or something like
that to go and either operate with C and start writing C.
So that's not what you want to do here, right? Like, yeah, sure, if you want to optimize, do it.
But at the end, it shouldn't be like how everything happens, right?
That's just like a bad experience.
Question.
You keep talking about images, right? But I would
assume there are also other formats
that, like other types of data
there, right? I don't know,
like a radar probably is not
generating images, it generates something else, I don't
know. Or it can have audio,
right? Is there support enough
for this, or the focus
is on images right now?
They are supported.
So we kind of support audio,
these different types of modalities
that essentially wouldn't fit in a regular tool
that you're using there.
We've had some pretty interesting use cases.
We had a user who was dumping protobuffs
from a Kafka stream just into S3.
And they're like, hey, I want to just query
a bunch of these protobufs
without having to like ingest everything.
So what they could do with DAF is just think, okay,
read all these S3 files,
deserialize it using my proto schema,
and then find the ones that have these fields.
And rather than building, you know, ingesting
everything into BigQuery or some
big heavyweight data warehouse,
they could just spin up DAF and
essentially write their query in four lines of code.
Oh, that's super cool.
And you mentioned like the query optimizer, right?
Like taking into consideration
like the semantics of the types that you are working with.
Tell us a little bit more about that.
Like how do you build like a query optimizer
with information related to an image, right?
Like how does this work?
Yeah.
So what we found is that
there's a lot of simple operations
you do in a query optimizer
that can give you like 90% of the speedups.
So things like, what we found is
if you're processing a data frame,
like most data scientists do,
you add every column you potentially could use
and you kind of just stack on top of it.
And so one of the simple things that we can do, for example, is say, okay, the columns we don't
need, let's not actually process them. We can do things like, oh, I actually only need this many
rows of data at the end. Let's only read those many rows of data from the very beginning.
So doing a lot of these operations can actually drastically speed things up. The thing where it gets interesting for complex data is we can actually factor out
a lot of the heavy computations
outside of the complex data.
So for example, if I have multiple tables
or data frames of things with images or audio
or something heavyweight,
and I want to do a join on something like a key
or some kind of timestamp, you actually don't need to shuffle around the really heavyweight, and I want to do a join on something like a key or some kind of timestamp, you
actually don't need to shuffle around the really heavyweight data.
You can actually figure out what is the data I actually need to keep or the data I want
to emit, and then compute that first and then send over the binary data.
So we do operations like these that are much more native or complex data, essentially.
Yeah, that makes a lot of sense.
And you mentioned distributed processing at some point, and I think also you
mentioned Rave.
Tell us a little bit more about that.
Because from what I understand, we have, let's say, the data frame, which is the
API, how the user interacts with the data, but then somehow the actual
processing needs to happen, right?
And how does this work with Daft and Venture?
And I don't know if there's any difference in there, but I'd love to hear about that.
Yeah, I would kind of break it down into multiple layers.
At the top layer, it's like what you mentioned, you have the user API, and this is what the user is telling Da that what they want to do, I want to select these
columns or run this model on this column. And essentially, that gets translated into what we
call like a query plan or logical plan. And this is kind of like a correct compute graph, if you
will, of these are the operations are going to happen. So then the next step there is we get
this plan, and we figure out what is the most optimal way to run this.
And finally, once we have that optimized plan, we can actually then schedule it onto a distributed cluster using Ray. So each of these steps for a given partition or a given subset of data
gets scheduled as an individual function on your cluster. So that's kind of like the three layers
of that. The part that gets really interesting and what Eventual is kind of working on is how should you be storing this data instead? So DAF makes it really easy to query data that's
just sitting around an S3, for example, but how should you store it to make it easily accessible,
have schemas and kind of all the benefits of a data warehouse? And that's kind of what
Eventual is building on the side. So DAF is an open source tool. It's really powerful, but
the stuff about how do you actually
catalog that data and store it is
kind of the main product of Eventual.
Yeah, that makes a lot of sense. And why Ray?
Why did you choose Ray for
why not Spark, for example?
I don't know, like something else.
Why Ray? So the main
reason, I mean, Ray is
pretty low level, which lets us kind of have a lot
of control of what we're doing. But the second part
is we're very
opinionated about not doing anything Java
related in our ecosystem with Python.
Like I can't tell you
how many probably wasted weeks of my life
when you get some random error
in Python you have to scroll through thousands of lines
of Java logs.
In a Spark cluster.
When you read the Spark logs in, I don't know,
CloudWatch or something,
you go through thousands of Java and you figure out,
oh, I forgot a comma.
It's just not a fun experience.
We wanted just something that's very native,
very simple to use,
and something where if a user makes a mistake,
which they probably will,
that's really easy to bug.
Yeah, 100%.
I think like big part of the pain that people have with Spark
is actually like how to debug this thing.
I have heard like many corner stories around that.
And like what it means like to deal with all the stack traces
coming from the JVM.
Yeah.
I've had friends at FANG companies that when we were starting a venture,
they were like, hey, if you can come up with a way for me to know,
if you could just find a better way to present Spark logs,
it would pay you a lot of money.
Yeah, yeah. I have heard from people, I think the worst thing that I have heard
is from people saying to me,
we just cannot find the logs.
Especially when you have
running Spark on EMR
and that kind of case in production,
it can get
extremely painful
to do the actual debugging.
And yeah,
it's one thing if you try to do that and you are like a Java developer.
It's another thing when you're primarily, you know, you're a data engineer and you're
writing your code like in Python, right?
And then you have like to go and figure out what the JVM is doing there.
Like it's, yeah, it's bad.
I can't get that.
Yeah.
And I think the other painful part is like,
kind of like the not like hair on fire,
but the other thing, which is like,
why is my program slow?
Why is it not running as fast as it could be?
And then just profiling and knowing
what's actually happening under the hood
is very hard with these JVM tools
and interopting with Python.
Yeah.
Yeah, 100%.
And you also mentioned Rust at some point.
So how Rust fits the equation here?
Because, okay, you were talking about Python
and being opinionated on that,
but now we also have Rust, right?
Yeah.
What's going on there?
So the thing is, when you're dealing with
this large amounts of data,
unfortunately, Python's not fast enough.
So under the hood, kind of like the user API is all in Python.
You can run Python functions, Lambda functions, Python objects.
But the stuff that's actually doing the processing under the hood is Rust.
So we've crossed the boundary from Python into Rust to actually do all the hard computing.
So on the top level, the user APIs in Python are plans and what happens in Python,
but the functions that get called to actually do the number crunching is in Rust. Funny enough,
we started with C++ because that's my background, but there's actually not really that many C++
programs anymore. So we're like, hey, let's make the investment. Let's move to Rust. And it's
actually been really amazing.
I'm like really happy we made that move.
So our whole core engine is written in Rust
and it makes things very performant
and actually really easy for contributors to jump in
and get their hands dirty.
So how do you, by the way,
you know, that's like we are,
that's like a little bit of a different question
from what we've been discussing so far.
But like as a C++ developer going to Rust,
like what was your experience?
And I know it's like a of a controversial question that I'm
asking right now because there's a lot
of
language wars happening
out there about C++
is dying, Rust is all the thing, no, Rust
is bad, go and use Ling
or whatever, but
how was your experience?
Let's see.
So I would say, I mean, I love C++.
I've been doing it for over a decade.
But the thing that I really like about Rust over C++ is that C++ essentially is optimizing for backwards compatibility.
What that does mean is there's not things that get improved over time.
The thing with Rust is when you start building it,
if you're just a noob, everything kind of comes with sensible defaults.
I think it's very underrated.
In C++, you can set things up such that it's optimized so it won't copy.
It won't do this. It won't do that.
But you have to set it up and you need to have someone experienced on your team
to kind of lay that groundwork.
But Rust, it comes out of the box.
And the things like the build system
and dependency management,
all of that come out
pretty good out of the box.
I think just coming out of the box strong
is so underrated.
Other than that,
I feel like you could technically
do everything in Rust.
It's just a lot more groundwork.
100%. I think it's
also like a modern developer experience
between the
two ecosystems.
How was the experience of
working, they're
operating between Python and Rust.
I know that they work pretty
well together, but how was
your experience with that?
Oh, it was just night and day compared
to c++ like with c++ like there are some tools around it but they're not great so usually you
end up writing stuff in c++ and then use something like pybind or you use something like cython to
kind of bridge the two and writing cython code just sucks it's not python which is not fun it's
not c++ what you're used to.
It becomes this weird in-between.
But with Rust, they have this project called Py03.
Yep.
And it's been amazing.
You kind of just write your Rust code,
declare what you want it to be,
and it just magically works in Python.
Nice, nice.
I think we should have another episode at some point,
going deeper into that stuff, because it's very fascinating.
There are some very interesting lessons in terms of
how to build a good developer experience
from these really complicated systems.
Building a compiler and all the ecosystem around the compiler
because it's not just the compiler itself.
It's huge.
So I think we should do that at some
point. What we're seeing
is the data ecosystem. It's funny,
the whole Python data ecosystem is kind of
migrating to Rust. It's kind of cool.
If you look up, Polaroid is written in Rust as well.
I believe you guys had
ByteWax on the show as well.
They're the same thing.
100%. Stuff like with
materialized,
all these things,
like a timely data flow.
Yeah, there's a lot of work
getting done right now
when it comes to data
using REST.
This is very fascinating.
And you mentioned
a couple of different projects there.
And I want to ask you,
there are many things happening, right? A lot of innovation.
We see Poland, for example. There are even stuff like the IBIS, I don't know how it's called,
like they pronounce it. Whatever. There are maybe projects out there that they start from
the data frame concept or the Pandas concept, and they try to build on top of that. As a person who's in a way doing something similar,
tell us a little bit more about how you feel about what's going on right now in the industry,
things that get you excited, what you are paying attention at, and what you would recommend us also to pay attention to.
Yeah, I mean, Pandas is sticky.
Pandas is very sticky.
And, you know, I had a hard time understanding why for a while.
And then I went to PyData,
which is a conference on a lot of the numerical tools within
Python.
I went to a talk that was
teaching Pandas
developers or Pandas users
how to use Python.
It occurred to me that there's an entire
population of people
who know Pandas but not
Python, which had never occurred to me.
Yeah.
Wow. This is That's wow. Okay.
But they were teaching like, oh, this is a for loop. This is how you make a function. Like they were teaching like these operations where I was like, wow,
like I never realized that people who know a framework within a language, but not the language.
And so I think when you build tools that kind of cater to that crowd, you kind of unlock
a lot of the data scientists and users who are used to these types of tools.
That's why I think IBIS is really cool because it gives you the API of something like Pandas,
but then you can target a backend like BigQuery or whatever else where you don't need to change
that much code.
Yep.
Yep.
A hundred percent. or whatever else where you don't need to change that much code. Yep, 100%. That's a very interesting project.
They have a crazy amount of support for different backends,
which is awesome.
You can use from DuckDB to, I don't know,
like Trino and Snowflake or whatever,
and don't use the same code.
It's very interesting.
Sorry, I interrupted you.
No, that's cool.
Yeah.
I mean, I, these things, I think the data from concept is here to stay.
I just think what the future data frame looks like is still unclear.
What should we be looking at for getting like a glimpse of the future
around that stuff, like who are are the teams outside of Venture,
obviously, doing interesting stuff in this space?
I think there's a lot of cool concepts
that we should pay attention to
and I think are important for the future.
I think one of the things that DuckDB is doing,
which is fantastic, is one is I can just query data
without worrying about the format or where it is
or anything like that. So I think the concept of like, the format doesn't really matter anymore.
I don't have to think about like, you know, Spark, you have to be very particular what exact version
of Parquet you're using and whatnot. Like that concept should just go away, right? The next part
is being federated. I don't have to ingest my data to query it. I should just be able to like,
give it an S3 path or a Google storage path and it should just work.
I think those are concepts that are a must-have in whatever new tool comes out.
The thing I think DuckDB is not right, I think, is being distributed. I think it's really important
because not everything for the future is going to be
on one machine. I think a lot of
teller data that might be a case for some companies, but there are cases where you need to go
distributed or handle fault tolerance. I think that's one of the things that DAT is focusing on.
And then finally, I think one of the things I'm really passionate about is
making sure these systems can integrate well with enterprise tooling. We've used JDBC for a long time,
and I hate it, and I think most people do too,
but a lot of these new open formats,
like the one that Arrow is building,
the Voltron, is really cool.
I'm really looking forward to that as well.
Yeah, Voltron and the Arrow ecosystem,
I think it does some very interesting things
and has a very interesting amplification effect
to this industry.
It's very interesting to see
what they are building and what the effect
that these things have.
They have a very strong relationship to
partners also, right?
Our core is built on
Arrow as well for our de-civilization.
I think it's an important
tool and it kind of just makes you interrupt
really easily.
Yeah, I mean
I think interoperability
is always a big issue
in the data infrastructure
space, which
it seems that Arrow is managing
to change that. Obviously
things do not change from one day to the other, but
it's amazing to see how fast, for example,
systems
like BigQuery and Snowflake speak
right now, like Arrow, right?
And that says a lot about
how
important and how powerful the concept
is. All right.
I want to give some time also to
Eric, because I'm monopolizing
the conversation here.
We definitely need to get you back.
We have a lot to talk about.
But before I give the microphone back to Eric,
something that you want to share with us about Eventual and Daft
that is exciting and is coming up soon?
Yeah, so Daft is doing its 0.1 release. We're
fully going into beta. We have a lot of
really cool features, including our entire new
core built in Rust. We're supporting
these different types of what we call Daft types,
like supporting images, videos,
and these other data types very
naturally. We're planning to do a launch
at the end of the month, and we implore you
to check it out. Our website is
getdaft.io. Check it out and
start the GitHub. Awesome. Eric. Yes. I think we have time for one more question and I wanted to
zoom out a little bit or maybe even say one more topic because I rarely stick to one more question. I wanted to zoom out a little bit and talk about the different ways
you see currently and then envision seeing eventual and sort of the related, you know,
in depth and the related technologies being used. And so if you think about the sort of obvious
ones, even from our conversation, right? You know, processing imagery in the context
of a self-driving car, right? Or algorithms that run, you know, to provide certain recommendations
and, you know, in an app like Instagram, right? Which is very image heavy, right?
But what are some of the other, like, interesting ways that you see this being used on complex data?
I mean, you mentioned you can
operate on audio files and other stuff like that. But there are sort of some of the obvious ones
that make sense to anyone who's listening. But what are some of the more non-obvious ones
that you think will be really influential? Yeah, it's a good question. So right now for us,
we're focusing on the most underserved market, which is people dealing with complex,
you know, these things like images, videos, and what you mentioned.
But in the spectrum of complex data, there are things in between that are still unserved.
Like the example of, I think the big one I think of is recommendation systems.
So if you have something like Facebook and they're trying to recommend you,
oh, what post they should show you, some of the data that they process for that is a user in one of the columns might be a
list of interactions that they might have had or a list of like options of what they've
done.
It's something that's kind of complex, but not like super complex.
But right now, if we try to operate that in existing systems, that would run very, very
slowly.
Yep.
So even these things of like nested data
is actually very slow in existing systems.
And that's something that we're planning to tackle next.
Yeah.
Fascinating.
All right.
Well, congratulations on Eventual and Daft.
Super exciting.
We'll have you back on
so we can continue to dig into all of our juicy questions.
Thank you so much, Sammy.
Thank you.
Thanks for having me.
Costas, talking with Sammy, what an incredible story, right? If you are in a Tesla and you're
driving down the highway and you let go of the wheel and the car is safely carrying you at 70
miles per hour without you giving the vehicle any input, Sammy is a huge part of why that's possible because of what he's
worked on. And I can't tell you how much I love that he drives an old BMW from the 90s that's a
stick shift, and he works on it himself as sort of a car guy, which is so great. I mean, that sort of, you know, that sort of
wonderful, wonderful story doesn't come along every day. So that was possibly my favorite part
of the episode. But I also just really appreciated, really appreciated his thoughtful perspective
on just the problem of dealing with complex data in general. And I was astounded by when, you know,
towards the end of the episode,
when we talked about like, okay,
you have like imagery, you know,
for Instagram and self-driving cars.
And obviously that's a huge problem space
for complex data.
You asked him what other things,
and he said, I mean, actually just hierarchical data,
you know, nested data is unbelievably slow when you
try to work with it at scale.
And so it's like, oh yeah, this is still really early innings for sort of solving
problems around complex data.
So excited to see what eventual grows into.
Yeah, a hundred percent for me.
Okay. Yeah, 100%. For me, okay, it was like an amazing opportunity that's like talking with him because, first of all, we talked a lot about like things that are, how to the end is like the vision there is like to build
a new type of data warehouse that can be used by ML people that are working with non-tabular data.
But it's interesting to see like how many times, no matter like what we were talking about,
we ended up talking about the developer experience and how important this is and how also
like silent this developer experience can be like, I think like what he said
and what he said about the Pandas ecosystem was incredible.
Yeah.
I would never like to hear something like that.
So tooling is important.
It's like the foundations that we need if we want to accelerate progress, right?
And that's why like I really talking with Sami because he gave like some like very
deep insights of why tooling and developer experience is important and how this can
be addressed and how eventual is doing it right for the use case that they have, like
in their minds and the problems that they are trying to solve.
So let's have him back to the show again soon.
I'm sure we have more to talk about with him.
Indeed.
Well, subscribe if you haven't.
Tell a friend and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week. We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.