The Data Stack Show - 161: The Intersection of Generative AI and Data Infrastructure with Chang She of LanceDB
Episode Date: October 25, 2023Highlights from this week’s conversation include:Chang’s background and journey with Pandas (6:26)The persisting challenges in data collection and preparation (10:37)The resistance to change in us...ing Python for data workflows (13:05)AI hype and its impact (14:09)The success and evolution of Pandas as a data framework (20:04)The vision for a next-generation data infrastructure (26:48]LanceDB's file and table format (34:35)Trade-Offs in Lance Format (42:45)Introducing the Vector Database (46:30)The split between production and serving databases (51:14)The importance of unstructured data and multimodal use cases (57:01)The potential of generative AI and the balance between value and hype (1:01:34)Changing expectations of interacting with information systems (1:13:53)Final thoughts and takeaways (1:15:32)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Costas, today we're talking with really a legend, I think is probably an appropriate term.
Chong Chi is one of the original co-authors of the Pandas library. So, you know So we're going back to a time before the modern cloud data warehouse,
when that work started. Absolutely fascinating story. And now he's working on some pretty
incredible tooling around unstructured data. And another fascinating story there,
and actually a lot in between. And this isn't going to surprise you, but I actually want to ask about the Panda story.
I do want to talk about LanceDB, which is what he's working on.
But the Panda's library was, it came out of the financial sector, which is really interesting.
And in a time when, you know, the technology they were using, we would consider legacy.
And now it's lingua franca for people worldwide who are doing data science workflows.
And so the chance to ask him that story, I think is going to be really exciting.
But yeah, you probably have tons of questions about that, but also Lance DB.
Yeah, a hundred percent.
I mean, it's first of all, we're talking about the person who has been around
building foundational technology for data for like many years now.
So we definitely have like to have a conversation with him, like about
Pandas and like the experience there because
i think you know like history tends like to repeat itself right so i'm sure that like there are many
lessons to learn from what it means like to what it meant back then like to bring pandas to the
market and like to the community out there and these lessons they're definitely like applicable
also today like with new technologies which i which I think is even more important today because of living in this moment in time where AI and tele-lens and all these new technologies around data are coming out, but we're still trying to figure out what's the best way to work with these technology. So that's definitely something that we should start the conversation with.
And obviously, talk about LensDB and see what made him get into building a new paradigm
in storing data, a table format, and what can happen on top of that, and what it means
to build a data lake that is, let's say, AI native.
What it means to build data infrastructure that can support the new use cases and the new technologies around like AI and ML.
So I think it's going to be a fascinating conversation.
And he's also like an amazing person himself, like very humble and very fun to talk with.
And there's going to be like a lot to learn from him.
So let's go and do it.
I agree.
Let's do it.
Kong, welcome to the Data Stack Show.
It's really an honor to have you on.
Thank you, Eric.
I'm excited to be here.
All right.
Well, we want to dig into all things LanceDB, but of course, we have to go back in history
first.
So you started your career as a quant in the finance industry.
So give us the overview and the narrative arc, if you will, of what led you to founding
Lance.
Yeah, absolutely.
So quite a journey.
So my name is Chung.
I'm CEO co-founder of Lance CB.
I've been building data and machine learning tooling for almost two decades at this point.
As you mentioned, I started out my career as a financial quant.
And then I got involved in Python open source.
And I was one of the original co-authors of the Pandas library.
And after that became popular,
started a company for Cloud BI,
got acquired by Cloudera.
And then I was VP of Engineering at 2BTV,
where I built a lot of the recommendation systems,
ML op systems, and experimentation systems.
And throughout that whole experience,
I felt that tooling for tabular data was getting better and better. But when I was looking at unstructured data tooling, it was sort of a mess.
And at TubiTV, it was a streaming company, so we dealt a lot with images, videos, and other unstructured assets. And any project involved unstructured data
always took three to five times as long. My co-founder at the time was working at Cruise,
so he saw similar problems, but even at an even bigger scale. So we got together and sort of tried
to figure out what the problem was. And our conclusion was, it was because the data
engineering data infrastructure for AI was not built on solid ground. Everything was optimized
for tabular data and systems, you know, a decade old. And so once you built on top of this
unshaky foundation, things start to fall apart a little bit, right? It's like trying to build
a skyscraper
on top of a foundation for like a three-story condo. Yep. Makes total sense. Before we dig
into Lance, can we hear a little bit of the backstory about pandas? I mean, I think it's
really interesting to me for a number of reasons. I think, you know, when you think about open
source technologies, a lot of times you think sort of trickle down from, you know, the big companies that are, you know, I mean, in some
ways, right, which you experienced where, you know, there's these huge issues, but sort of something
like an, you know, that has become as popular as Pandas sort of arising out of the financial
industry is just interesting. So can you give us a little bit of the backstory there? Yeah. So we'd have to really go back in time to when I first started
working as a quant. So this was 2006. And at that time, data scientist wasn't really a job title.
When I graduated, I knew I loved working with data. And at that time, if you like
working with data, you went into quant finance. As a junior analyst, I spent a lot of time on
data engineering and data preparation, right? Loading data from the various data vendors that
we have, producing reports and data checks, validation, integrating that into our main sort of feature store,
which was just a Microsoft SQL server
at the time.
I was going to ask,
you always say feature store,
but it was probably...
Yeah, feature store also wasn't a word
at that time.
But there was a lot of like,
the scripts were written in Java
and the reports were produced in VBScript. And there was a lot of
Excel reports flying around. There was no data versioning. There was barely code versioning.
And everything was just a huge mess. And fast forward a couple of years,
one day, my colleague and roommate at this time, Wes McKinney, came up to me and said, hey, look at this thing I've been working on in Python.
And it was a sort of a closed source, you know, proprietary library for data preparation that he built in his group.
We were working at the same fund.
I sort of immediately fell in love with it.
And I was like, oh, this is the best thing ever.
And I started using it in my group and trying to push for using that
and also pushing for Python over Java as sort of the MDB script
as the predominant data preparation tools and things like that.
And so, you know, initially there was definitely a lot of pushback on, oh, but Python
is not compiled, therefore it's not safe. Or like, you know, why do we want to use this when we
already have a bunch of code written? So it took us a while to sort of get buy-in. And then it also took a while then to get the company to agree
to actually open source the thing. And this was in an era sort of a little bit after the financial
crisis. At that time, Wall Street and hedge funds in general were extremely anti-open source.
Everything was considered sort of secret sauce. And there was a lot of unwillingness to open that. And it took maybe six months of work
from Wes to actually make that happen. And sort of the final trigger was essentially
him quitting to start a PhD program before they sort of relented and say, okay, fine,
we'll make this open source. Wow. I mean, you know, working on pandas sort of in the wake of
the financial crisis, what a unique experience. One question that comes up as you tell that story
that's really interesting is that you're sort
of talking about a period of time where a lot of the tools that are just the lingua franca of
anyone working in data, right? Whether you're more in the data engineering end of the spectrum or
sort of MLAI end of the spectrum that are really cloud warehouses and data lakes and
Python-based workflows, et cetera. But it was really interesting. One thing you said was,
I spent a lot of time on data collection and data preparation. You actually hear the same
phrase today, even though from a tooling standpoint, it's a wildly different landscape and far more
advanced than it was back then. Why do you think that is? Because people are saying the same thing
well over a decade later. Yeah. I think the problems are different today. And maybe this
is something that I think Kostas has lots of thoughts on here as well, given his experience. But I think in my day as a junior analyst, biggest problems were into an FTP, and sometimes it just didn't arrive on time. And most of these processes were very manual. And I think at that time, dataset sizes might be a lot more downstream and has to do a lot more with scale and then sort of these manual connections.
I do think that data sort of data accuracy and cleanliness is a problem that just hasn't been solved.
And I think a lot of it is just because the data that we work with
is generated by real world processes. And by definition, they're just super dirty.
And I think probably a third big factor is, you know, in finance, there was always a very big
focus on data quality and data cleanliness. I remember going through just the data with a fine-tooth comb to figure out,
okay, did we forget to record stock split merger acquisition?
Or did this share price look wrong because there was some data error. And because
the data
being wrong
has an outsized impact in those
use cases. But we can only
handle it at small scale at that point.
And now I
think with internet
data, if your log
or event data is
wrong a couple times out of the billion, it's not going to affect
your processes, your BI dashboards all that much. So I think the problems are different, but
it's sort of that commonality of data being generated by real-world processes is still
the same. And so I think that at the core, that's why we still hear those same complaints over and over. Yeah. Fascinating. Okay. One more question for me before we dive into Lance.
So you talked about this paradigm of trying to essentially sell this idea of using Python
and there being resistance to that, which, you know, looking back, it's like,
whoa, that sounds crazy, you know, because going from Java to Python for these
kind of workflows, it makes sense. But of course, when you're in that situation,
and there are sort of established incumbent processes, frameworks, et cetera, people can be resistant to change.
Do you see a similar paradigm happening today, especially with the recent hype around AI,
where there's a smaller group advocating for a new way to do things and there's resistance
against that? Is there a modern day analog that you see in the industry?
That's a good question.
I mean, yeah, it's so certainly hard to say because I think I'm so I live in San Francisco
now.
Right.
So I think my bubble is basically the small group of, you know, all the different small groups of people being very crazy about trying new things that seem crazy.
So in my immediate circles, it's actually hard to say.
I think, you know, all I hear is like, oh, have you tried this new thing that has like, that came out two days ago and has you know a hundred thousand stars
on github already and that kind of it's actually hard for me to say but i i do think that there's
there's sort of a very big sort of impulse function that that makes its way out so you know
in the san francisco in general sort of tech Silicon Valley tech bubble, it's very much like, oh, AI is like chat GPT is so over.
Now it's whatever the latest open source model is. whereas if you know actually go out and talk to some like you know normal people in normal places
they're like oh yeah i've heard about this vaguely but i don't really know what it is because it
doesn't have impact on my daily life yet yeah yeah super interesting yeah all right well thank you for
entertaining my questions about pandas tell us aboutDB. What is it and what problem does it solve?
Yeah, actually, so before we dive into that,
I actually like,
Kostas, given your experience,
I'd love to hear your take on some of that too.
I feel like you must have
very interesting stories there too.
Yeah, I mean, specifically for like the AI craziness
or like... Well, more about like, you know, sort of how has problems in data engineering evolved, right?
When you first started out your career versus now, what are the things that you think people don't understand in data that they should?
Yeah.
I mean, I think the best way to understand the evolution
of data infrastructure or, like, I don't know,
data technology in general is to see,
to observe, like, the changes in the people involved
in the lifecycle of data, right?
If you are, like, I'm sure, like, you've seen also that stuff involved in the lifecycle of data, right?
If you are, like, I'm sure like you've seen also that stuff because you were
like working on these things like back then, but like we're talking about like close to 2010, 2008, there was like a wealth of like systems coming out,
especially like from the big tech, like in, like in the Bay area, like from
Twitter, like from LinkedIn, some of them became like very successful systems like Kafka, Bay Area, like from Twitter, like from LinkedIn.
Some of them became like very successful systems like Kafka, for example, right?
But these systems were coming out like from people, like let's say type of engineer that
was primarily the type of like, I build systems.
Someone came to me and they were like oh we have this huge problem
that scale right now and we don't know how
like to deal with it like go and figure it out
so
the conversation would be
more around
like the primities of like
distributed systems well like all these things
right which these people like
use their language actually because they're like
systems engineers, right?
Engineers.
But if we fast forward
today and take a
typical data engineer,
they have nothing to do
with that stuff.
They are people that are coming more from
the data, let's say,
domain, than from
the systems engineering domain, right?
And that's inevitable to happen because as something becomes more and more common, we
need more and more people to go and work with that stuff.
So we can't assume that everyone will become the systems engineer of Leaf, Uber, Meta,
whatever, to go and solve problems out there, right?
And if we see the people like in the titles and like they're all like the data scientists,
the data engineers, the ML engineers, right?
And track the evolution there,
I think this can help us a lot like to understand
both why like some technologies are needed
or why Python is important, right?
Because yeah, of course, like these people are focusing more on like the data, not Python is important, right? Because, yeah, of course, these people are focusing more on the data, not on the infrastructure. Yes, writing in a
dynamic language like Python, when you run, you might end up service-breaking, right? But when
you work with data, that's not the case because you're experimenting primarily. You're not putting something in production that's going to be a web server, right?
So it's all these things that they change the way that we interact with technology.
And let's say the weight of what is important, I think, changes.
And the developer experience has to change.
And that's what I think is the best indicator at the end of where things are today and where they should go.
And that's a question that I want to ask you actually about pandas.
Because why, in your opinion, pandas was so successful at the end?
Because you had ways to deal with that stuff before,
right?
Somewhere like,
it's not like a new problem
appear out of the blue,
right?
But what like made Pandas
like the de facto,
let's say framework
for anyone who is
more in the data side of things,
like the data scientists,
for example,
based on your experience,
what like,
what you've seen out there?
Yeah, absolutely.
So I think a lot of this was just making it really easy to deal with real world data.
So when we first started out, it was very clear to us that Pandas had a lot of value
because we were using it on a daily basis for business-critical processes and systems.
But for a stranger, in the beginning, it was actually kind of hard for them to understand.
Because at the time, there were sort of a couple of different competing projects.
And in the beginning, now Pandas 2.0 plus is also arrow based. But in the very beginning, pandas sort of mechanically pandas was just a thin wrapper around NumPy. And so a lot of the data veterans at the time really dismissed pandas as, oh, this is just a wrapper of convenience functions around NumPy. And I'll just use
NumPy because I'm smarter than the average data person, and I'll just code up
all this stuff myself. But I think what was
successful, what made Pandas successful, you know, most
of this credit obviously goes to Wes, was
we focused on sort of one vertical, one set
of problems at a time.
And we just made it, you know, 10 times easier to deal with data within that domain.
And so for people in that domain, it was very clear, oh, if I use pandas, you know, it's,
you know, I save like a ton of time than using the alternatives.
And then over time, we got a lot of pull requests and feature requests from the open source
community in adjacent domains.
And we sort of slowly expanded over time and sort of that.
And then finally the, the, the sort of advent of data science and explosion of data science and popularity
finally made pandas into the popular library that it is today.
Yeah, I think you put it very well.
I think you used some very strong terms there that I think describe the situation.
Not just with pandas like with
every technology out there like you talk about veterans yeah like you always going to have like
people who know how to I don't know go like down to the kernel and like do something crazy there
right but how many people are like have like the time like to get into this level of expertise
right so I think like when you you get to the critical mass,
where a problem is big enough for the economy out there
that needs mass adoption,
then there are things that become more important
than, let's say, the underlying efficiency.
And that's efficiency for access to the technology.
And that's what I think pandas also did.
It's not like it's this like beautiful, perfect in a scientific way, like way of solving like
problems, like, but it is pragmatic and it is something that like people like understand
as a tool that helps them be more productive, right?
It's a little bit of like more like a product approach,
I would say, like in technology.
But at the end, it is important.
And we see that like many times with stuff like,
like see like Snowflake, for example, right?
It's not like we didn't have data warehouses before,
but suddenly data warehouses became much more accessible
to business users because they paid more attention into like,
okay, these people don't know what vacuuming is.
Like why they have to vacuum their tables, right?
Why they would learn that thing?
Like it's like even engineers hate like doing vacuuming
like on the table, you know?
So I think that there's like a lot of value in that.
And it's like things get really exciting
because that's the point where a technology is ready for mass adoption.
That's my opinion.
And I think we are on another critical point when it comes to data.
And I think a lot of that stuff is going to be accelerated
because of AI and ML.
Because data will have much more impact
in many more different areas of like
everyday life.
So more data needs to be processed, more data needs to be prepared and stored and like collected
and labeled and like all that stuff.
So the question today is like, yeah, how we build the next generation of infrastructure
for data that is going to get us us to 2013 and beyond. The way that, let's say, the system Spark
or whatever was built in 2008, 2010, 2012
brought us to where we are today.
And I think LanceDB is one of these solutions out there.
So tell us a little bit more about that,
like how Lans is changing
and what gaps it's filling
that the previous generation of data infra had.
Yeah, absolutely.
I think when we looked at the problems
dealing with unstructured data,
I think what we see is unstructured data like images and text and all that,
they're data that's really hard to ETL. The data that's very hard for sort of tabular data systems
to deal with. So you get really bad performance. What you end up having to do is kind of make
multiple copies of the data. One copy might be in a format
that's good for analytics, and the one copy that's good for training, and another copy that's good
for debugging in a different format. And then you end up having to have different compute systems on
top of these different formats. And then you have to then sort of create this Potemkin workflow on top of tooling that makes
it a lot harder right you can sort of hide it for a time all this mess under the hood but
it's a very leaky abstraction and over time it just sort of comes to the fore and so for us
you know our goal is to essentially fix that with the Riot Foundation.
And I think if you look at the history of data, every new generation of technology has
come with data infrastructure that's optimized for it.
So you start with something like Oracle for when database systems were first coming to
the fore and becoming popular.
And then when the internet became popular,
got a lot of like JSON data to deal with.
And, you know, that's why NoSQL systems,
particularly Mongo, became super popular.
Then it was, you know, Spark, Hadoop, and then Snowflake.
And I think if you look out, you know,
five, 10 years down the road,
you know, AI is this next generation,
big generation of technology. And I think there
needs to be new data infrastructure that's optimized for AI. So that's sort of the core
mission for LandCB. We're trying to make a next generation lake house for AI data.
So the idea is that if you're managing large scale unstructured datasets. With Lans CB, you'll be able to analyze, train, evaluate,
and serve just several times faster
with 10 times less development for effort
at a fraction of the cost, right?
The first product that we're putting out there
is a Lans CB, the vector database,
but we'll have a lot more exciting things to follow as well.
Okay, that's awesome.
Okay, let's do the following.
You mentioned some very interesting stuff here,
like how you deal with unstructured data, for example, right?
So let's, especially for our audience out there,
which probably, okay, they haven't,
many of them might never have liked to deal with this type of data at scale.
Let's do the following.
Let's describe a pipeline, a typical pipeline of dealing with this data without LensDB.
How someone would do it in a data lake or lake house today.
And then follow up with what LensDB is adding to that.
So a data engineer who never had to deal with that until now can get a glimpse of what it means to work with this type of data.
Yeah.
So for AI data, so for traditional data, a lot of the data generation processes are generating things in like JSON or CSV, or a lot of systems are just going straight to Parquet.
And your life is kind of a lot easier then.
But with AI, a lot of times you're getting hardware data.
So you might be getting, let's say, a bunch of protobuf data that's coming off of sensors. That's a time series to go with that.
You've got a bunch of images that correspond in time with some of those
observations.
And then you might have some text,
right.
That's produced by a user or,
you know,
that is associated with certain products or something like that.
Right.
So,
so off the bat, you've got this plethora of data
in different formats that maybe either comes in from your API
or off of a Kafka stream or something like that.
Maybe the first stage is that gets dumped into some location in S3,
and then you would end up having to write some code to stitch that together and make some of the metadata gets stored as a parquet file, maybe if you're lucky, or in some table.
And then your images are elsewhere, right? And then you have to maybe, if your data engineer is good,
they know to convert the protobuf into some sane format for analytics, right?
And then so you have essentially, and then you have some JSON metadata
for like debugging and quick access, right?
So right off the bat, so you have these three pieces of data
that you have to coordinate all across your pipeline
and it doesn't really change.
And then when you get to training,
a lot of people are using, let's say like TF record.
So there's, you have to convert data
from these raw images from S3
and then into TF records or some format,
tensor format.
And then once, and then you go you know, TF records or some format, tensor format. And then once,
and then you go through
your training pipeline
and then once that comes out,
then you need to do model eval.
And then like, you know,
like TF records
and other tensor formats
are not that great.
So you have to then
convert that back
and then join it now
with your metadata
because then you need to slice
and dice your data
to see a model evaluation in different subsets of your data set, things like that.
So that's what the pipeline looks like right now, even before it makes it into production.
So most of your effort is spent in this managing lots of different pieces pieces of data trying to match them on some key
that may or not may may not be reliable and switching between different formats as you go
through these different stages right with lance the earlier you can switch into lance format the
the easier it becomes where you can store all of that data together. And whether you're doing scans or debugging
where you're pulling 10 observations out of a million
or something like that,
Lance still performs very, very well.
And so once you convert into Lance,
a lot of the pipeline down the road becomes simpler.
So you have one piece of data to deal with.
Lance is integrated with Apache Arrow.
So all of your familiar tooling
is already compatible with it.
And so you can sort of start to treat
that messy pile of data
as a much more organized table.
You know, I love math.
In math, it's like you always try to reduce a problem
to a previously known or solved state.
And so I think Lance is that, you know,
we want AI data to look and feel much more like tabular data.
And then everything is a lot easier.
You can apply a lot of the same tooling and principles.
Yeah, that makes total sense.
And actually, it's one of the things that I think,
like vendors in the ML ops,
let's say space, like I think not failed,
but maybe like there were like some mistakes there that ended up in like
creating silos actually between like the different like parts of data
infrastructure.
Like there was a lot of like replication of data infrastructure there just like to do the ML and then of course like
data
duplication is like a very
hard problem I don't think like people
realize how hard it is like
to keep like consistent copies
of data
it might sound like silly like especially
someone who okay like uses
technology and much more like casual
but like it is one of the biggest problems.
It's really hard to ensure that always your data is going to be consistent.
There are some very strong trade-offs there.
That's what we've learned from distributed systems, for example.
To me, and that's what I like also from what you are saying, like it makes sense like to enrich or
like enhance like the infrastructure that exists out there and like bring the new, like to reduce
the new paradigm to something existing than trying like to create something completely separated
and just ignore what was like done like so far. So I think, I mean, personally, at least I think that like,
it's a very good decision
when it comes like
to Lans.
All right.
So tell us a little bit
more technical stuff
around like Lans.
Is it like,
it is a table format,
I guess.
We're talking about tables here.
And it allows you like
to mix
very heterogeneous
type of like data.
So it's not like just
the tabular data that we have until now.
It is based on Parquet, right?
There is Parquet used behind the scenes.
Is this correct or I'm wrong here?
Oh, it's actually not Parquet-based.
So Lance is actually both a file format and a table format.
So the idea with Parquet is that the data layout
is not optimized for
unstructured data
and not optimized for
random access,
which is important there. So we actually
sort of wrote our own
file format plus the table format
from scratch. So this is
written in Rust.
I think maybe this goes back to Eric's question from earlier.
It's like Rust is one of those things that it might not be ubiquitous yet,
but it's definitely gaining popularity.
And I think Rust and Python plays really well together.
And just that combination of both safety and performance, plus the ease of package management, is something that I think is very unique.
And it's pretty amazing as a developer.
It's also sort of very easy to pick up.
So we actually started out writing Lance in C++. And at the beginning of this year,
we made a decision for those same reasons to switch over to Rust. And we were Rust newbies.
And I think we were learning Rust as we were rewriting. And even then, I think it took us
about three weeks for me and Lei to rewrite roughly four, four and a half months of C++ code.
And I think more importantly, we just felt a lot more confident with every release to be able to say, this is not going to segfault if you just look at it wrong.
Yeah, yeah, yeah, 100%. I think you touched a very interesting point here, which, again, I think connects with the Pandas conversation that we had and how these technologies become more-end, back-end kind of thing
in the application development
having kind of a similar paradigm
where it comes to the systems development
which is
going to be extremely
powerful and we see that already
with so many
good tools coming out in the Python ecosystem
actually being developed
in the back ecosystem actually being developed in the back end,
let's say, with Rust.
So whoever builds the libraries there for the bindings,
I think they've done an amazing job with Python.
But all right, that's awesome, actually,
because my next question would be about how do you deal with the columnar nature of Parquet?
And, yeah, like, okay, you've already answered that.
It's not like columnar format anymore.
But my question now is like, okay, let's say I have infrastructure already in place, right?
I have, like, my Parquet data lake.
Like Parquet is like the de facto like solution out there
when it comes to building lake lakes.
Does it mean that if I want like to use LANS,
like I have to go there and like convert everything into LANS?
And like, how is the migration or like the operability
between the different like storage formats working out there?
Yeah, this, so I mean, the short answer is yes,
you have to sort of migrate your data
if it's in Parquet or other formats.
This process, fortunately, is very easy.
It's literally two lines of code,
one to read existing formats into Arrow
and one to write it into Lance.
And I think this wider topic here is very exciting.
I think West actually just published a recent blog post on composable data systems.
And I think this is, this is the next big revolution in data systems.
I'm very excited about that.
So you can, you know, you previously, when you were building a database,
you had to literally build a whole database from like, the, you know, the parser to the planner,
you know, the execution engine, the storage, like indexing, you have to literally build everything.
Whereas now there's so many components out there that you can sort of innovate on one piece,
but create a whole system, but just using open source components that play well together.
This is what makes Apache Arrow such an important, and in my opinion, one of the most underrated projects in this whole ecosystem.
You don't see it.
You don't hear about it.
You're using the higher level tooling, but projects like Apache Arrow makes it just 10 times easier to build new
tools and for these different tools to work well with each other. Yeah, yeah, 100%. I totally agree
on that. And we should do, I don't know, at some point we should have an episode just talking about
Arrow, to be honest. Because as you say, it is okay, work more on the system side of like data.
They know about it obviously, but I think it is like the unsung hero of like what is
happening right now because it did create like the substrate, like to go and build like
more modular systems over like the data.
So we should do that at some point.
All right.
So let's go back to Lance.
So data is like, okay, why did you have to build like this new, like a way of like storing the data, right?
You mentioned something already, like the point queries that, okay,
columnar systems are not built for that.
Uh, but there's always like this problem between there's also, let's say, bulk work that you need to do that makes
volumetric systems more performant. And also more points, the kind of queries that you need when you share something, for example.
How do you, let's say, balance these two with Lance. Yeah.
So Lance format actually is a columnar file format,
but the data is just laid out in a way that it supports both fast scans and fast point queries.
And so originally we designed it because of the pain points
that ML engineers voiced around dealing with image data.
So for debugging purposes or sampling purposes, you often wanted to get something like top 100 images that spread out across 1 million images or 100 million images.
And so with Parquet, you have to read out a lot more data
just to get one.
So your performance is very bad.
And so we sort of designed it for that.
And the sort of happy accident was
once we designed it that way,
we realized if you can support really fast random access,
so I think just purely on micro benchmarks
on like just taking a bunch of rows out of a big
data set, we beat Parquet by about a thousand times in terms of performance, right? If you're
talking about that order of magnitude of performance improvement, then it makes it a lot
easier and it makes a lot more sense to start building rich indices on top of the file format. So you can, and this is
what led to LAN-CB, the vector database. So it's now on top of the format, we have a vector index
that can support, you know, vector search, full-text search, we can support SQL, and also
all the data is also stored together. And so this is something that I think like other vector databases can't do
is that the actual image storage
and other things have to go somewhere else.
And so now you go back to that complex state
of having to manage multiple systems.
And so for us, it was, I would say,
like a happy accident that came from
a good foundational design choice.
Is there some kind of of trade-off there?
I mean, what is the price that someone has to pay to have this kind of, let's say, performance
and flexibility at the same time?
Definitely.
So the trade-off here is that if you want to support fast random access, it's much harder to do data compression.
So you can't do file-level compression anymore.
You have to do sort of either within block
or just record-level compression.
So here, if you have pure tabular data,
then your file sizes in Lance will be bigger,
maybe like 50% bigger or 30% bigger than it would be in Parquet.
So that's the trade-off there.
Now for AI, let's say you're storing image blobs
in this dataset.
Now these image blobs are compressed
at the record level already.
So file level compression actually doesn't matter.
And the whole dataset size
is dominated by your image column, right?
So then for AI, actually,
this trade-off makes a lot of sense
because you're not really sacrificing that much.
Yeah, makes sense, makes sense.
That's makes total sense.
Like the trade-off there between like space
and like time complexity.
So like it's, I think it's like anyone who has done
any kind of like computer science, computer engineering,
like it's one of the most fundamental things.
Like, okay, we are going to store more information
so we can do like things faster and like vice versa.
So it depends on like what you optimize for in any case.
Okay.
So, all right.
We have, by the way, Lansas, like the format is open source, right?
Like it's something out there, like people can go and like use it, play around, do whatever
they want with it.
There's also like some tooling, I guess, around it, right?
Like you have like some tools that you can convert
like Parquet into Lance, for example.
And also the opposite.
Is it also possible to go from Lance to Parquet if you want?
Yep.
It's also the same two lines of code.
You read it onto Arrow and write into the other format.
Okay.
But what happens then if you have, let's say a lance file that has also
like images inside and you want to go to parquet, like how is this
going to be stored in the parquet?
Yeah.
So right now the, you know, the storage type is just bytes.
So it'd be for images, it would be like bytes.
Or if you're storing just image URLsls then they just be plain string columns right
so a lot of the so so we're we're making extension types in arrow to enrich the ecosystems right so
arrow right now does not understand like images or you know videos or audio so we're going to
start making these image extension types for Arrow that certainly will work
well with Lance, but can be made to work well with Parquet and other formats as well.
And so that way, top-level tooling can understand, oh, this column of bytes is an image rather than
just, oh, this is just a bunch of bytes. And so then your visualization tooling,
BI tooling, data engineering
pipelines can make much smarter
decisions and inference based
on these things.
That makes total sense.
Okay, so we talked
about the underlying technology, which is
open source, when it comes
to the table and the file format.
But there's also a product on top of that, right?
Yeah.
So tell us a bit about the product.
What is the product?
Yeah.
So I love open source through and through.
So if money wasn't an object, I'd certainly spend my whole day just working on open source tooling.
But I think certainly it's very exciting also to build a product that the market
and folks want and will use.
So on top of Lance Format,
we're building a Lance CBE vector database.
That's sort of the first step
in our overall AI leg house.
And what makes this vector database different
is one, it's embedded. So you can start
in 10 seconds just by pip installing. There's no Docker to mess with. There's no external services.
The format actually makes it a lot more scalable, right? So, you know, on a single node,
I can do billion scale vector search within 10 milliseconds. It's also very flexible, right? Because you can
store any kind of data that you want, and you can run queries across a bunch of different paradigms.
And that whole combination makes it a lot easier for our users to get really high quality retrieval
and simplifies their production stack for systems and code.
And I think another really big benefit for LanceDB is the ecosystem integration.
So a lot of people have told me, once I started using it,
it's, oh, this is like if Pandas and vector databases
had a love child and called it LanceDB.
It was for people who are, you know, experimenting
and working with sort of putting data in and data preparation, all that. It was just much easier
to load data in and out of Lance CB with the existing tooling that they made.
And so this, you know, we again go back to our discussion of like, how do we sort of try to bring
new things back into an old paradigm
and use the existing tooling to solve these new problems. And I think one of the things that
new sort of vector databases, I think someone coined the term new SQL, N-E-W SQL,
as this new generation of databases. I don't know how I feel about that,
but certainly I think like
this new generation of databases
have kind of forgotten lessons,
painful lessons we've learned
over the last decade
of like data warehousing development, right?
So like columnar storage is not a thing
in a lot of these new databases.
Separation of compute and storage is not a thing in these new databases.
And so I think it's something that is very much worth doing,
especially as you're scaling up. And those are the things that we're building into LAN-CV
that we're offering for generative AI users
that I think are pretty exciting.
All right.
One last question from me, and then I'll give the microphone back to Eric.
And it is like related to like what you were mentioning right
now about like database systems.
So, okay.
We have in traditional, right?
Like there was like the OLAP and the OLTP kind of paradigm there, right?
And they usually, not usually, like we're still up today,
they serve, let's say, a very different like workload, right?
And that, of course, like dictates many different trade doors,
like different people involved in all these things.
So hearing you about like LensDB, like saying, oh, that's great.
Like now that I can have like my appendings there, for example, and I don't need like
to go to another system, like to do my filters or like whatever was the metadata.
But if I want to build an application, I still going to need another data store.
That's probably like a Postgres database, right?
Where I'm going to have like some part of like, like the business logic living there. Like the state is going to be managed for the application Postgres database, right? Where I'm going to have like some part of like the business logic living there.
Like the state is going to be managed for the application in this system, right?
And AI is one of these things that, you know, like it feels at least to me that it's a much more like front-end kind of like data technology at the end than like doing building pipelines for like data
warehousing, right?
So how we can bridge this too, because there's still like a dichotomy there, right?
Like I'll still have my Postgres with my application states and LansDB that is going
like to have like the abandonings there and like whatever else like I need.
First of all, do you think that it is a problem?
And if it is a problem, like, do you see Lance like trying
to solve this in the future?
Alex Bialik- Yeah.
That's a really great question.
I think there's like two sort of gaps in what you were mentioning.
One is this in production, the production OLTP kind of database
and transactional database
versus the data store that's needed for AI serving.
So that's one kind of split right now.
The other is going from development
or research into production, right?
So this is like what you use in your data lake
versus what you use in production.
So I think the second one,
I think that's a much easier question.
So for Lance, you know,
because of the fact that we're good for random access as well,
you literally can use the same piece of data
in your lake house and also in production serving.
And this is something that is pretty exciting to me data in your lake house and also in production serving.
And this is something that is pretty exciting to me because there's very few data technologies that's good enough for both.
The first question, I think, there's no sort of absolute best answer.
So in my experience, I've seen installations where the production transactional store is
also the AI feature store and serving
store.
And although I would say that at scale, as companies scale up, that tends to be less
and less true.
And a lot of times these AI serving stores that supports vector search workloads have much more stringent requirements,
and the workloads tend to be much more CPU-intensive.
And so when you mix the two together, you end up creating trouble for both types of workloads.
So at scale, a lot of companies find it easier to separate the two.
Yeah, I'm not sure i think at small scale i think it's perfectly fine to have a single store it you know simplifies
your stack and you keep everything together although i think the tooling and the user
experience around like you know postgres let's say for vector search is kind of wonky and the syntax is kind of bad.
And if you want high-quality retrieval,
you then have to figure out how to do full-text search index on your own
and then combine the stuff.
And performance also tend not to be great.
So I think the answer certainly depends.
It mostly depends on your
scale, and your use cases. If your sort of AI use case is very light and very small. And you can
certainly sort of put your expertise around that production database, and just use whatever, you
know, PG vector and full text index that comes with Postgres.
That's certainly sufficient, but, you know, the sort of larger, more serious production
installations tend to be separate.
I think it'll stay that way.
Yeah.
And I think, and correct me if I'm wrong here, but I would assume that like the AI workloads
on the front end part, like primarily read type of workloads, like you primarily want
like to be able
to concurrently and really fast read and serve results.
Well, okay, obviously when you're managing the state of an application, it's like read
and write heavy, right?
Like you need transactions, you need all these things.
They're like a very different set of trade doves there.
So it sounds like it's hard to put them together at scale, at least, which makes sense.
All right.
One last thing before Eric gets the microphone.
How someone would play around with LANs?
What are your recommendations where they should go, both for the technology itself and also the product?
Yeah.
So the easiest way to start with LAN-CV is just pip install LAN-CV.
And then in our GitHub, we have a repository.
It's under LAN-CV slash vectordb dash recipes. And there's about a dozen or so worked examples and notebooks,
both in JavaScript and also in Python
that you can use LAN-CB and just step through.
And so these are like, you know,
building recommender systems,
building chatbots with chatgpt,
using the LAN-CB integration
with Langchain and Lama Index,
just, you know, building a host of tools.
We'll add to it
more and more as time goes on.
If you want to find out more about
the format itself,
go to the Lance CB
slash Lance repo. That's the file format.
There's a lot of reading material
and benchmarks.
A lot of it is,
if you're
familiar with Rust or C++,
you can also learn a lot just going through the Rust core code base. There's a lot of
interesting things that we do in there. Sounds good. All right, Eric, I'm sorry for
monopolizing the conversation, but the microphone is yours. No, it's absolutely fascinating. But Chang, one thing that I'm interested in
is the changes you're seeing in the landscape around data itself.
So when we think about unstructured data like images, et cetera,
of course, we can think about things like self-driving cars
or various AI applications like that.
But in an increasingly media-heavy world, do you see unstructured data as becoming a much
larger proportion of the data that companies are dealing with?
Yeah. Yeah, absolutely. I mean, I think, you know, I think people have like a
terabyte of photos just on their iPhones these days. So it's, it's, I think it's going to just
become much more important. And the data set sizes will dominate tabular data. And a lot of the use
cases will also become multimodal. So if you're, a media-heavy world, when you have lots of images and videos,
how do you organize that data
and how do you query that data
also becomes critical, right?
So you want to be able to ask
your set of like a billion images
some questions in natural language
or using SQL or something like that.
And a lot of that is going to rely on
extracting features from the images, but also a lot of that is going to rely on extracting features from
the images, but also a lot of times like embedding the images and using vector search and a combination
of these things. So I think that's going to become a lot more important in the next few years as
just AI and enterprise data becomes more multimodal. I also think that the relationship between data and downstream consumers will change.
I would say before AI and before machine learning, it was a very much waterfall-y way of designing
these pipelines where you come up with a schema
and you load data into it in that schema.
And then you sort of publish this and downstream consumers are like, okay, I can use this.
And maybe this is wonky for my use case, but this is what I got.
But I think now it's much more important that data pipeline stays very close to AI and ML
because the use cases there will determine the kind of schema, the kind of transformations,
and the trade-offs that you make with data.
Much more important than before.
Yeah, I think totally agree.
And I think one of the things that is going to accelerate this,
and I'm really fascinated to see how this plays out, but one of the interesting things about AI
in general is that it produces large quantities of unstructured data, right? And so you essentially
have a system that you're building using unstructured data that produces
a, you know, a massive amount of additional unstructured data, right? And so you have this
system that's a loop where in order to sort of meet the demand for, you know, additional AI
applications, like it's going to require a significant amount of infrastructure for
unstructured data.
Yeah, totally.
I mean, I think, especially in generative AI, right? If you have a million users producing new images, that's going to be kind of crazy.
Yeah, or even just unstructured chat conversations, even themselves as an entity.
Okay, one last question,
because we're right at the buzzer here, but where did the name Lance come from?
So we were just thinking about like, you know, AI data unstructured data being sort of these
large, heavy blogs and how do you, how do you deal with them and still be very performant and so we're thinking about you know things that seem
you know fast but but also has this connotation that we can deal with heavy things and
and so i i think we were watching some like you know some like you know fantasy movie i forgot
the name and there was like uh jousting tournament
and so we're like okay we're calling it lance i love it yeah that is that is actually a great
analogy lances are gigantic but they're used in like fast motion you know sort of high impact
situation so yeah uh yeah so i'd love to to ask one question for you guys too,
which is, you know, we spent a lot of time in the last hour talking about sort of, you know,
what's old in the evolution.
I'd love to get your take on what's new as well.
So in generative AI, obviously,
it's sort of the hot thing today.
So there's a lot of potential value
that we can clearly see.
There's also some hype, right?
So in your opinion, what do you think is the one,
the most underrated thing in generative AI
and what's the most overhyped thing?
That's a great question.
You know, I think that the, I think the most underrated,
I think one of the most underrated things will be the use cases that are not very sexy, but will essentially eliminate like very low value human work. So if we think about, just one example, a friend called me the other day,
and they work at a company that processes huge quantities of PDFs in the medical space or
whatever. And they actually, because of the need for discovering these pieces of information in
them, and the formats are all very disparate, and it's very painful.
And so they literally have thousands of people who brute force this. you know, it's like, oh, well, you can like process, you know, you can get the information
that you need with a very high level of accuracy from files that are notoriously difficult to work
with that are in any format and in any order, right? That doesn't sound great. But I think
what excites me about that is, okay, if you take all of those people and free them up to be creative
with their work, as opposed to just sort of doing
brute force, you know, manual looking through PDFs for, you know, sort of needle in a haystack
information. I think that type of thing has the potential, you know, and that's just one example
and there are thousands across industries, has the potential to really unlock a lot of human
creativity to solve problems that's currently trapped in pretty low-level work. I think that's
really exciting. I think probably the most overhyped piece of it, and I haven't thought
through this, so you're getting this. I mean, I've thought through it a little bit, but I'll just do it live.
Is this notion that this is just going to take over people's jobs? And I'll give you a specific
example. I think that there's certainly potential for that, but I think the way that the media is
portraying that is really wide of the mark. because one example recently is that I was working with a
group who is trying to, not trying to, but using LLMs to create long form content around a certain
topic to drive SEO, right? And I think a lot of people think like, okay, this is just sort of a silver
bullet where if you can give it a prompt and then you get something back, right? And it's that easy.
And like, okay, so SEO, the people who think critically about that content,
their jobs are all gone, right? And in reality, I think on the ground,
what we see at least is that there's sort of this,
there's kind of two modes.
There's one where it's like a very primitive, like you give a prompt and you ask for something
and what you get back is astoundingly good
for how simple the input is and how low effort it is. But when you need to
solve for like a very particular use case, you actually have to get very good at using these
tools. And it's not simple, right? Like understanding prompts, understanding all
of the knobs that are available in a tool like chat GPT.
And then, I mean, things like what we're finding is that it's actually very useful to use the LLM tool itself to do prompt development for you, right? And so you get this sort of
iterative loop of prompts that can produce a prompt that actually gives you the output that
you want, right? You know, you're dealing with it
on an API level at that point.
And so, I don't know,
I just think that's overhyped
where it's like, man, to get really good at this,
like you actually have to be very creative
and get really deep into all of the ways
to tune it to actually use it.
So I don't know, that's my hot take.
You know, this is really interesting
is that it reminds me of the sort of hype cycle
in history around autonomous vehicles, right?
Is that the last 10 years,
every year was like,
oh, fully autonomous vehicles
are coming out next year.
And this is the kind of thing
this is very much similar
where it's like if your vehicle
is autonomous 80% of the time
in demos, it's amazing.
But you still have to hire a full time driver. So
so that driver is now losing his job. So it feels the feel feels like it's very similar here.
Yeah, for sure. And I think, you know, I don't know. I mean, I'm certainly not.
I'm not trying to suggest that there won't be some sort of massive displacement. And I think as adoption grows, a lot of those things will
be productized, right? And so certainly I see a future where it gets to a point where you can
sort of productize those things, but sort of the mass near-term displacement, I don't think is a
reality because you can't just give it a sentence and get back something that's highly accurate if you want to go beyond very simple use cases.
Totally.
Yeah, should I go next, Eric?
Yes.
Okay.
I'll start with what I find, I think, a very fascinating and very underrated kind of aspect of AI.
And that's that actually it enables a kind of like data flywheel.
And what do I mean by that?
For me, and obviously like, okay, I'm really into like the data stuff.
So I tend to like to look into more of that stuff.
But the reality is that there's a lot of data out there,
like much more data
than what we can like today like process.
And big part of that,
like effectively,
and big part of that is also because like
there is like a lack of structure
around this data.
And I think that LLMs can really like help,
like accelerate like the process of actually
creating
datasets
and products that they can lead
to products over these datasets at the end.
And that's, for me,
it's a very fascinating aspect of that.
Just the fact of
I can give a text
and it's not that I'll give a summary
behind of that. It is that I can actually
get it in a machine understandable format like as a json that has like very specific like predefined
like semantics that then I can use with the rest of the technology that I have there like to do
things that's like it's almost like a superpower right so i see like the value of like
adding like the pictures in there for example like all the audio files but when the audio file
turns into like text and after that turns into like columns of like topics or like tags or like paragraphs or speakers that's crazy because if you wanted to do that
today until today you pretty much had like to be someone like meta or like google that could hire
like thousands of people that would go and annotate that stuff right so that's like one of the things
that i think that like people other estimates underestimate when it comes to these systems.
I know it's a little bit more on the back end of things, but I think that's where the value starts being created.
And when it comes to the hype, I think one of the things that's... Especially for people who are spending a lot of time on Twitter, seeing all these things of like,
oh, okay, now you don't need developers to go and build applications.
You can just use chat GPT and autopilot and copilot.
And you can go and build a full product and make millions out of that.
That's like, okay, I'm sure that's bullshit.
There's no way that works, right?
It is an amazing tool for developers.
Amazing tool for developers.
I think it's, like, the first time that I could say, like,
after all these, like, years in tech,
like, that there's, like, a truly innovative new tool
for helping developers to be more productive.
But we are not any way close to, like, you know,
having a robot developer to build applications
and like this thing does not exist and the other thing that i think like people like
tend like to forget is that everyone says that probably okay like customer support for example
is going to be fully automated with ai and like robots and all that stuff. But they forget that when people reach out for help, part of the help is also like to
connect.
There's like human empathy there and human relationships.
And these things like cannot be emulated at the end, right?
Like you can at some level, you can have companions, you can have like an AI that you talk to and
like all that stuff, right?
But at the end, doing business and working,
and I'm sure like companies at any scale will know about that
and they already know about that.
Like putting a face in front of the company,
it is important for the business itself.
So again, it will make customer success much more productive
and the people there being more creative
and like also having a more fulfilling job at the end.
But it's not like suddenly we are going to fire everyone who is like, you know, solving problems for people picking up the phone and we are going to have like AI to do that.
And I'm very curious to see what's going to happen like with the in the creative industries.
I think there's like very interesting things especially like in cinema i think we are just going to see like an explosion of creativity at
the end that's like my my feeling so there's a lot of value in my opinion i mean i know that
like people think oh it might be another crypto situation but i think it's a very different
situation there's a lot of work that has to be done to enable it still but i think it's a very different situation. There's a lot of work that has to be done to enable it still,
but I think the future looks very interesting.
It certainly does.
How about you?
Certainly, you guys already took my number one answer,
so I have to come down.
I love asking you these questions because every conversation I have,
I think everyone comes up with better answers than me. I think on the overhyped front,
I would say there's a lot of excitement about autonomous agents. And I think we are at least sort of a year, if not more, away from really making that work very well.
Just what I see is that agents really struggle with that last mile accuracy that's required
for production and also performance.
If you have a complex question, you have to break that down into, or a task, you have
to break that down into multiple, a task, you have to break that down into multiple,
possibly a pretty long agent chain.
And these chains can start adding up in terms of time.
It takes like, you know, minutes or things like that where it just becomes not interactive
and it's much faster just to build something
sort of special purpose.
So I think this like,
everything is going to be become autonomous
and we'll never have to work again thing is not coming quite yet um in terms of underrated
i think yeah i i totally agree i think there's a lot of like the less sexy things that are
i think have the potential to produce a lot of value. So one big thing I see is I think it's going to change people's expectations
of how they can interact with information systems and knowledge bases.
So most websites and applications have a little search feature.
And without an exception, they all kind of suck.
And it's because they're all based on sort of text and, you know,
syntactical search. I think with the popularity of generative AI, our expectation is going to
just drastically change. Every search box becomes a semantic search box. And so any sort of tool
that doesn't live up to that promise in the next year or so, I think, is going to have trouble
retaining a lot of users.
You're going to go from,
oh, this search sucks because, of course, search sucks,
to, oh, your search sucks.
I'm going to go to your competitor who has semantic search built up already.
Yeah, I agree.
I was going to say,
if you think about the support agent
or the agent piece of it,
I would actually combine those two and say that the first
wave of that is going to be like better search. So for example, like if you think about documentation,
it's really compelling to have this idea that, you know, it's like, okay, well, you have all
these docs and people, you know, have to comb through the docs to find this really specific answer to their question and the search around it sucks because no one has time to
like redo the indices with every new piece of data point and that's a like horribly manual process
right but at the same time if you give wrong information in an automated way relative to
documentation if someone's like building an
application or something, I mean, you know, we're doing something, a critical like data workload
that's going to inform a ML model. That's going to like do really important things for downstream
users. Like you can't really get that wrong. Right. And so I agree, like the first wave of
that's not going to be like docs go away and it's just a
chat bot that gives you your answer it's going to be like it will help you search so much faster
and better than you ever have before totally awesome well Chong this has been such a fun
conversation great question thanks for making it a conversation and yeah we'd love to have you back
on to dig back into all sorts of other fun topics.
Thank you. Thank you for having me. This was a lot-author of the Pandas Library, which is legendary. And the fact that he has built
multiple high-impact technologies, is a multi-time, multi-exit founder, building data tooling,
and sort of the data and ML ops space. I mean, all of those things are, it's really incredible. But when you talk with him, if you didn't know who he was,
you would just think this is just one of those really curious, really passionate,
really smart founders. And you said at the very beginning that he's humble, but I mean, that's almost an understatement. He would treat anyone
on the same level as him, no matter their level of accomplishment or technical expertise.
That really stuck out to me. And I also think the other thing that was really great about this
episode was it wasn't like he came out and
said, you know, I have an opinion about the way the world should be. And like, this is why we're
doing things like the Lance DB way. He just kind of had a very calm explanation of the problem
and a really good set of reasoning for why he needed to create a new file format, right? Which is
like shocking to hear, you know, because it's like, whoa, you know, you have like Parquet exists,
why do this, right? So it sounds really shocking on face value, but I mean, his description was
really compelling. And the story of how they actually sort of almost backed into creating
a vector database,
you know, because they invented this file format.
Just an incredible episode.
Yeah, yeah.
I mean, Chang is like one of these rare cases where you have both like an innovator and
a builder, which is like, I mean, it's hard to find an innovator.
It's hard to find a builder.
It's like even harder to find someone who combines these two.
And at the same time being like down to earth like like him i think this episode like has pretty much like everything
i mean it has like lessons from the past that can be like super helpful like to understand
like how we should approach and like solve problems today. And there's like a lot of things like to learn from the story of pandas
that are applicable today for everyone who's trying like to build tooling
around AI and ML.
It has, what I really enjoyed was actually probably like the first time
that we talked about something I think very important,
which is how the infrastructure needs to evolve in order to accommodate these new use cases and actually accelerate innovation with like AI and ML, which is still like work in progress.
And I think Chang provided like some amazing insight of like what are the right directions to do that. And he said some very interesting things about not creating silos, how like, you
know, gave like a very interesting example, like from like mathematics, where he said
that, you know, in mathematics, like when you have a new problem, you try to reduce
it to a known problem, right?
And that's like how we should also like build technology.
We drew like an amazing insight, to be honest.
And I think it's something that, especially like builders keep like to forget
and tend to like either like replicate
or create bloated solutions and all that stuff.
So there's like a lot of wisdom in this episode.
I think anyone who's like a data engineer
and wants to get like a glimpse of the future
of what it means like to like work
with the next generation of like data platforms,
they should definitely tune in and listen to Chang.
I agree.
Really an incredible episode.
Subscribe if you haven't.
You'll get notified when this episode goes live
on your podcast platform of choice.
And of course, tell a friend.
Many exciting guests coming down the line for you.
And we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rutterstack.com.