The Data Stack Show - 161: The Intersection of Generative AI and Data Infrastructure with Chang She of LanceDB

Episode Date: October 25, 2023

Highlights from this week’s conversation include:Chang’s background and journey with Pandas (6:26)The persisting challenges in data collection and preparation (10:37)The resistance to change in us...ing Python for data workflows (13:05)AI hype and its impact (14:09)The success and evolution of Pandas as a data framework (20:04)The vision for a next-generation data infrastructure (26:48]LanceDB's file and table format (34:35)Trade-Offs in Lance Format (42:45)Introducing the Vector Database (46:30)The split between production and serving databases (51:14)The importance of unstructured data and multimodal use cases (57:01)The potential of generative AI and the balance between value and hype (1:01:34)Changing expectations of interacting with information systems (1:13:53)Final thoughts and takeaways (1:15:32)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, today we're talking with really a legend, I think is probably an appropriate term. Chong Chi is one of the original co-authors of the Pandas library. So, you know So we're going back to a time before the modern cloud data warehouse,
Starting point is 00:00:47 when that work started. Absolutely fascinating story. And now he's working on some pretty incredible tooling around unstructured data. And another fascinating story there, and actually a lot in between. And this isn't going to surprise you, but I actually want to ask about the Panda story. I do want to talk about LanceDB, which is what he's working on. But the Panda's library was, it came out of the financial sector, which is really interesting. And in a time when, you know, the technology they were using, we would consider legacy. And now it's lingua franca for people worldwide who are doing data science workflows. And so the chance to ask him that story, I think is going to be really exciting.
Starting point is 00:01:40 But yeah, you probably have tons of questions about that, but also Lance DB. Yeah, a hundred percent. I mean, it's first of all, we're talking about the person who has been around building foundational technology for data for like many years now. So we definitely have like to have a conversation with him, like about Pandas and like the experience there because i think you know like history tends like to repeat itself right so i'm sure that like there are many lessons to learn from what it means like to what it meant back then like to bring pandas to the
Starting point is 00:02:17 market and like to the community out there and these lessons they're definitely like applicable also today like with new technologies which i which I think is even more important today because of living in this moment in time where AI and tele-lens and all these new technologies around data are coming out, but we're still trying to figure out what's the best way to work with these technology. So that's definitely something that we should start the conversation with. And obviously, talk about LensDB and see what made him get into building a new paradigm in storing data, a table format, and what can happen on top of that, and what it means to build a data lake that is, let's say, AI native. What it means to build data infrastructure that can support the new use cases and the new technologies around like AI and ML. So I think it's going to be a fascinating conversation. And he's also like an amazing person himself, like very humble and very fun to talk with.
Starting point is 00:03:26 And there's going to be like a lot to learn from him. So let's go and do it. I agree. Let's do it. Kong, welcome to the Data Stack Show. It's really an honor to have you on. Thank you, Eric. I'm excited to be here.
Starting point is 00:03:38 All right. Well, we want to dig into all things LanceDB, but of course, we have to go back in history first. So you started your career as a quant in the finance industry. So give us the overview and the narrative arc, if you will, of what led you to founding Lance. Yeah, absolutely. So quite a journey.
Starting point is 00:04:02 So my name is Chung. I'm CEO co-founder of Lance CB. I've been building data and machine learning tooling for almost two decades at this point. As you mentioned, I started out my career as a financial quant. And then I got involved in Python open source. And I was one of the original co-authors of the Pandas library. And after that became popular, started a company for Cloud BI,
Starting point is 00:04:29 got acquired by Cloudera. And then I was VP of Engineering at 2BTV, where I built a lot of the recommendation systems, ML op systems, and experimentation systems. And throughout that whole experience, I felt that tooling for tabular data was getting better and better. But when I was looking at unstructured data tooling, it was sort of a mess. And at TubiTV, it was a streaming company, so we dealt a lot with images, videos, and other unstructured assets. And any project involved unstructured data always took three to five times as long. My co-founder at the time was working at Cruise,
Starting point is 00:05:13 so he saw similar problems, but even at an even bigger scale. So we got together and sort of tried to figure out what the problem was. And our conclusion was, it was because the data engineering data infrastructure for AI was not built on solid ground. Everything was optimized for tabular data and systems, you know, a decade old. And so once you built on top of this unshaky foundation, things start to fall apart a little bit, right? It's like trying to build a skyscraper on top of a foundation for like a three-story condo. Yep. Makes total sense. Before we dig into Lance, can we hear a little bit of the backstory about pandas? I mean, I think it's
Starting point is 00:05:57 really interesting to me for a number of reasons. I think, you know, when you think about open source technologies, a lot of times you think sort of trickle down from, you know, the big companies that are, you know, I mean, in some ways, right, which you experienced where, you know, there's these huge issues, but sort of something like an, you know, that has become as popular as Pandas sort of arising out of the financial industry is just interesting. So can you give us a little bit of the backstory there? Yeah. So we'd have to really go back in time to when I first started working as a quant. So this was 2006. And at that time, data scientist wasn't really a job title. When I graduated, I knew I loved working with data. And at that time, if you like working with data, you went into quant finance. As a junior analyst, I spent a lot of time on
Starting point is 00:06:52 data engineering and data preparation, right? Loading data from the various data vendors that we have, producing reports and data checks, validation, integrating that into our main sort of feature store, which was just a Microsoft SQL server at the time. I was going to ask, you always say feature store, but it was probably... Yeah, feature store also wasn't a word
Starting point is 00:07:16 at that time. But there was a lot of like, the scripts were written in Java and the reports were produced in VBScript. And there was a lot of Excel reports flying around. There was no data versioning. There was barely code versioning. And everything was just a huge mess. And fast forward a couple of years, one day, my colleague and roommate at this time, Wes McKinney, came up to me and said, hey, look at this thing I've been working on in Python. And it was a sort of a closed source, you know, proprietary library for data preparation that he built in his group.
Starting point is 00:08:02 We were working at the same fund. I sort of immediately fell in love with it. And I was like, oh, this is the best thing ever. And I started using it in my group and trying to push for using that and also pushing for Python over Java as sort of the MDB script as the predominant data preparation tools and things like that. And so, you know, initially there was definitely a lot of pushback on, oh, but Python is not compiled, therefore it's not safe. Or like, you know, why do we want to use this when we
Starting point is 00:08:37 already have a bunch of code written? So it took us a while to sort of get buy-in. And then it also took a while then to get the company to agree to actually open source the thing. And this was in an era sort of a little bit after the financial crisis. At that time, Wall Street and hedge funds in general were extremely anti-open source. Everything was considered sort of secret sauce. And there was a lot of unwillingness to open that. And it took maybe six months of work from Wes to actually make that happen. And sort of the final trigger was essentially him quitting to start a PhD program before they sort of relented and say, okay, fine, we'll make this open source. Wow. I mean, you know, working on pandas sort of in the wake of the financial crisis, what a unique experience. One question that comes up as you tell that story
Starting point is 00:09:44 that's really interesting is that you're sort of talking about a period of time where a lot of the tools that are just the lingua franca of anyone working in data, right? Whether you're more in the data engineering end of the spectrum or sort of MLAI end of the spectrum that are really cloud warehouses and data lakes and Python-based workflows, et cetera. But it was really interesting. One thing you said was, I spent a lot of time on data collection and data preparation. You actually hear the same phrase today, even though from a tooling standpoint, it's a wildly different landscape and far more advanced than it was back then. Why do you think that is? Because people are saying the same thing
Starting point is 00:10:33 well over a decade later. Yeah. I think the problems are different today. And maybe this is something that I think Kostas has lots of thoughts on here as well, given his experience. But I think in my day as a junior analyst, biggest problems were into an FTP, and sometimes it just didn't arrive on time. And most of these processes were very manual. And I think at that time, dataset sizes might be a lot more downstream and has to do a lot more with scale and then sort of these manual connections. I do think that data sort of data accuracy and cleanliness is a problem that just hasn't been solved. And I think a lot of it is just because the data that we work with is generated by real world processes. And by definition, they're just super dirty. And I think probably a third big factor is, you know, in finance, there was always a very big focus on data quality and data cleanliness. I remember going through just the data with a fine-tooth comb to figure out, okay, did we forget to record stock split merger acquisition?
Starting point is 00:12:18 Or did this share price look wrong because there was some data error. And because the data being wrong has an outsized impact in those use cases. But we can only handle it at small scale at that point. And now I think with internet
Starting point is 00:12:39 data, if your log or event data is wrong a couple times out of the billion, it's not going to affect your processes, your BI dashboards all that much. So I think the problems are different, but it's sort of that commonality of data being generated by real-world processes is still the same. And so I think that at the core, that's why we still hear those same complaints over and over. Yeah. Fascinating. Okay. One more question for me before we dive into Lance. So you talked about this paradigm of trying to essentially sell this idea of using Python and there being resistance to that, which, you know, looking back, it's like,
Starting point is 00:13:27 whoa, that sounds crazy, you know, because going from Java to Python for these kind of workflows, it makes sense. But of course, when you're in that situation, and there are sort of established incumbent processes, frameworks, et cetera, people can be resistant to change. Do you see a similar paradigm happening today, especially with the recent hype around AI, where there's a smaller group advocating for a new way to do things and there's resistance against that? Is there a modern day analog that you see in the industry? That's a good question. I mean, yeah, it's so certainly hard to say because I think I'm so I live in San Francisco
Starting point is 00:14:19 now. Right. So I think my bubble is basically the small group of, you know, all the different small groups of people being very crazy about trying new things that seem crazy. So in my immediate circles, it's actually hard to say. I think, you know, all I hear is like, oh, have you tried this new thing that has like, that came out two days ago and has you know a hundred thousand stars on github already and that kind of it's actually hard for me to say but i i do think that there's there's sort of a very big sort of impulse function that that makes its way out so you know in the san francisco in general sort of tech Silicon Valley tech bubble, it's very much like, oh, AI is like chat GPT is so over.
Starting point is 00:15:15 Now it's whatever the latest open source model is. whereas if you know actually go out and talk to some like you know normal people in normal places they're like oh yeah i've heard about this vaguely but i don't really know what it is because it doesn't have impact on my daily life yet yeah yeah super interesting yeah all right well thank you for entertaining my questions about pandas tell us aboutDB. What is it and what problem does it solve? Yeah, actually, so before we dive into that, I actually like, Kostas, given your experience, I'd love to hear your take on some of that too.
Starting point is 00:15:56 I feel like you must have very interesting stories there too. Yeah, I mean, specifically for like the AI craziness or like... Well, more about like, you know, sort of how has problems in data engineering evolved, right? When you first started out your career versus now, what are the things that you think people don't understand in data that they should? Yeah. I mean, I think the best way to understand the evolution of data infrastructure or, like, I don't know,
Starting point is 00:16:35 data technology in general is to see, to observe, like, the changes in the people involved in the lifecycle of data, right? If you are, like, I'm sure, like, you've seen also that stuff involved in the lifecycle of data, right? If you are, like, I'm sure like you've seen also that stuff because you were like working on these things like back then, but like we're talking about like close to 2010, 2008, there was like a wealth of like systems coming out, especially like from the big tech, like in, like in the Bay area, like from Twitter, like from LinkedIn, some of them became like very successful systems like Kafka, Bay Area, like from Twitter, like from LinkedIn.
Starting point is 00:17:06 Some of them became like very successful systems like Kafka, for example, right? But these systems were coming out like from people, like let's say type of engineer that was primarily the type of like, I build systems. Someone came to me and they were like oh we have this huge problem that scale right now and we don't know how like to deal with it like go and figure it out so the conversation would be
Starting point is 00:17:34 more around like the primities of like distributed systems well like all these things right which these people like use their language actually because they're like systems engineers, right? Engineers. But if we fast forward
Starting point is 00:17:49 today and take a typical data engineer, they have nothing to do with that stuff. They are people that are coming more from the data, let's say, domain, than from the systems engineering domain, right?
Starting point is 00:18:06 And that's inevitable to happen because as something becomes more and more common, we need more and more people to go and work with that stuff. So we can't assume that everyone will become the systems engineer of Leaf, Uber, Meta, whatever, to go and solve problems out there, right? And if we see the people like in the titles and like they're all like the data scientists, the data engineers, the ML engineers, right? And track the evolution there, I think this can help us a lot like to understand
Starting point is 00:18:39 both why like some technologies are needed or why Python is important, right? Because yeah, of course, like these people are focusing more on like the data, not Python is important, right? Because, yeah, of course, these people are focusing more on the data, not on the infrastructure. Yes, writing in a dynamic language like Python, when you run, you might end up service-breaking, right? But when you work with data, that's not the case because you're experimenting primarily. You're not putting something in production that's going to be a web server, right? So it's all these things that they change the way that we interact with technology. And let's say the weight of what is important, I think, changes. And the developer experience has to change.
Starting point is 00:19:20 And that's what I think is the best indicator at the end of where things are today and where they should go. And that's a question that I want to ask you actually about pandas. Because why, in your opinion, pandas was so successful at the end? Because you had ways to deal with that stuff before, right? Somewhere like, it's not like a new problem appear out of the blue,
Starting point is 00:19:50 right? But what like made Pandas like the de facto, let's say framework for anyone who is more in the data side of things, like the data scientists, for example,
Starting point is 00:20:02 based on your experience, what like, what you've seen out there? Yeah, absolutely. So I think a lot of this was just making it really easy to deal with real world data. So when we first started out, it was very clear to us that Pandas had a lot of value because we were using it on a daily basis for business-critical processes and systems. But for a stranger, in the beginning, it was actually kind of hard for them to understand.
Starting point is 00:20:35 Because at the time, there were sort of a couple of different competing projects. And in the beginning, now Pandas 2.0 plus is also arrow based. But in the very beginning, pandas sort of mechanically pandas was just a thin wrapper around NumPy. And so a lot of the data veterans at the time really dismissed pandas as, oh, this is just a wrapper of convenience functions around NumPy. And I'll just use NumPy because I'm smarter than the average data person, and I'll just code up all this stuff myself. But I think what was successful, what made Pandas successful, you know, most of this credit obviously goes to Wes, was we focused on sort of one vertical, one set of problems at a time.
Starting point is 00:21:29 And we just made it, you know, 10 times easier to deal with data within that domain. And so for people in that domain, it was very clear, oh, if I use pandas, you know, it's, you know, I save like a ton of time than using the alternatives. And then over time, we got a lot of pull requests and feature requests from the open source community in adjacent domains. And we sort of slowly expanded over time and sort of that. And then finally the, the, the sort of advent of data science and explosion of data science and popularity finally made pandas into the popular library that it is today.
Starting point is 00:22:13 Yeah, I think you put it very well. I think you used some very strong terms there that I think describe the situation. Not just with pandas like with every technology out there like you talk about veterans yeah like you always going to have like people who know how to I don't know go like down to the kernel and like do something crazy there right but how many people are like have like the time like to get into this level of expertise right so I think like when you you get to the critical mass, where a problem is big enough for the economy out there
Starting point is 00:22:50 that needs mass adoption, then there are things that become more important than, let's say, the underlying efficiency. And that's efficiency for access to the technology. And that's what I think pandas also did. It's not like it's this like beautiful, perfect in a scientific way, like way of solving like problems, like, but it is pragmatic and it is something that like people like understand as a tool that helps them be more productive, right?
Starting point is 00:23:23 It's a little bit of like more like a product approach, I would say, like in technology. But at the end, it is important. And we see that like many times with stuff like, like see like Snowflake, for example, right? It's not like we didn't have data warehouses before, but suddenly data warehouses became much more accessible to business users because they paid more attention into like,
Starting point is 00:23:46 okay, these people don't know what vacuuming is. Like why they have to vacuum their tables, right? Why they would learn that thing? Like it's like even engineers hate like doing vacuuming like on the table, you know? So I think that there's like a lot of value in that. And it's like things get really exciting because that's the point where a technology is ready for mass adoption.
Starting point is 00:24:10 That's my opinion. And I think we are on another critical point when it comes to data. And I think a lot of that stuff is going to be accelerated because of AI and ML. Because data will have much more impact in many more different areas of like everyday life. So more data needs to be processed, more data needs to be prepared and stored and like collected
Starting point is 00:24:31 and labeled and like all that stuff. So the question today is like, yeah, how we build the next generation of infrastructure for data that is going to get us us to 2013 and beyond. The way that, let's say, the system Spark or whatever was built in 2008, 2010, 2012 brought us to where we are today. And I think LanceDB is one of these solutions out there. So tell us a little bit more about that, like how Lans is changing
Starting point is 00:25:05 and what gaps it's filling that the previous generation of data infra had. Yeah, absolutely. I think when we looked at the problems dealing with unstructured data, I think what we see is unstructured data like images and text and all that, they're data that's really hard to ETL. The data that's very hard for sort of tabular data systems to deal with. So you get really bad performance. What you end up having to do is kind of make
Starting point is 00:25:42 multiple copies of the data. One copy might be in a format that's good for analytics, and the one copy that's good for training, and another copy that's good for debugging in a different format. And then you end up having to have different compute systems on top of these different formats. And then you have to then sort of create this Potemkin workflow on top of tooling that makes it a lot harder right you can sort of hide it for a time all this mess under the hood but it's a very leaky abstraction and over time it just sort of comes to the fore and so for us you know our goal is to essentially fix that with the Riot Foundation. And I think if you look at the history of data, every new generation of technology has
Starting point is 00:26:32 come with data infrastructure that's optimized for it. So you start with something like Oracle for when database systems were first coming to the fore and becoming popular. And then when the internet became popular, got a lot of like JSON data to deal with. And, you know, that's why NoSQL systems, particularly Mongo, became super popular. Then it was, you know, Spark, Hadoop, and then Snowflake.
Starting point is 00:26:58 And I think if you look out, you know, five, 10 years down the road, you know, AI is this next generation, big generation of technology. And I think there needs to be new data infrastructure that's optimized for AI. So that's sort of the core mission for LandCB. We're trying to make a next generation lake house for AI data. So the idea is that if you're managing large scale unstructured datasets. With Lans CB, you'll be able to analyze, train, evaluate, and serve just several times faster
Starting point is 00:27:28 with 10 times less development for effort at a fraction of the cost, right? The first product that we're putting out there is a Lans CB, the vector database, but we'll have a lot more exciting things to follow as well. Okay, that's awesome. Okay, let's do the following. You mentioned some very interesting stuff here,
Starting point is 00:27:51 like how you deal with unstructured data, for example, right? So let's, especially for our audience out there, which probably, okay, they haven't, many of them might never have liked to deal with this type of data at scale. Let's do the following. Let's describe a pipeline, a typical pipeline of dealing with this data without LensDB. How someone would do it in a data lake or lake house today. And then follow up with what LensDB is adding to that.
Starting point is 00:28:35 So a data engineer who never had to deal with that until now can get a glimpse of what it means to work with this type of data. Yeah. So for AI data, so for traditional data, a lot of the data generation processes are generating things in like JSON or CSV, or a lot of systems are just going straight to Parquet. And your life is kind of a lot easier then. But with AI, a lot of times you're getting hardware data. So you might be getting, let's say, a bunch of protobuf data that's coming off of sensors. That's a time series to go with that. You've got a bunch of images that correspond in time with some of those observations.
Starting point is 00:29:12 And then you might have some text, right. That's produced by a user or, you know, that is associated with certain products or something like that. Right. So, so off the bat, you've got this plethora of data
Starting point is 00:29:29 in different formats that maybe either comes in from your API or off of a Kafka stream or something like that. Maybe the first stage is that gets dumped into some location in S3, and then you would end up having to write some code to stitch that together and make some of the metadata gets stored as a parquet file, maybe if you're lucky, or in some table. And then your images are elsewhere, right? And then you have to maybe, if your data engineer is good, they know to convert the protobuf into some sane format for analytics, right? And then so you have essentially, and then you have some JSON metadata for like debugging and quick access, right?
Starting point is 00:30:22 So right off the bat, so you have these three pieces of data that you have to coordinate all across your pipeline and it doesn't really change. And then when you get to training, a lot of people are using, let's say like TF record. So there's, you have to convert data from these raw images from S3 and then into TF records or some format,
Starting point is 00:30:43 tensor format. And then once, and then you go you know, TF records or some format, tensor format. And then once, and then you go through your training pipeline and then once that comes out, then you need to do model eval. And then like, you know, like TF records
Starting point is 00:30:55 and other tensor formats are not that great. So you have to then convert that back and then join it now with your metadata because then you need to slice and dice your data
Starting point is 00:31:04 to see a model evaluation in different subsets of your data set, things like that. So that's what the pipeline looks like right now, even before it makes it into production. So most of your effort is spent in this managing lots of different pieces pieces of data trying to match them on some key that may or not may may not be reliable and switching between different formats as you go through these different stages right with lance the earlier you can switch into lance format the the easier it becomes where you can store all of that data together. And whether you're doing scans or debugging where you're pulling 10 observations out of a million or something like that,
Starting point is 00:31:50 Lance still performs very, very well. And so once you convert into Lance, a lot of the pipeline down the road becomes simpler. So you have one piece of data to deal with. Lance is integrated with Apache Arrow. So all of your familiar tooling is already compatible with it. And so you can sort of start to treat
Starting point is 00:32:13 that messy pile of data as a much more organized table. You know, I love math. In math, it's like you always try to reduce a problem to a previously known or solved state. And so I think Lance is that, you know, we want AI data to look and feel much more like tabular data. And then everything is a lot easier.
Starting point is 00:32:36 You can apply a lot of the same tooling and principles. Yeah, that makes total sense. And actually, it's one of the things that I think, like vendors in the ML ops, let's say space, like I think not failed, but maybe like there were like some mistakes there that ended up in like creating silos actually between like the different like parts of data infrastructure.
Starting point is 00:33:00 Like there was a lot of like replication of data infrastructure there just like to do the ML and then of course like data duplication is like a very hard problem I don't think like people realize how hard it is like to keep like consistent copies of data it might sound like silly like especially
Starting point is 00:33:20 someone who okay like uses technology and much more like casual but like it is one of the biggest problems. It's really hard to ensure that always your data is going to be consistent. There are some very strong trade-offs there. That's what we've learned from distributed systems, for example. To me, and that's what I like also from what you are saying, like it makes sense like to enrich or like enhance like the infrastructure that exists out there and like bring the new, like to reduce
Starting point is 00:33:51 the new paradigm to something existing than trying like to create something completely separated and just ignore what was like done like so far. So I think, I mean, personally, at least I think that like, it's a very good decision when it comes like to Lans. All right. So tell us a little bit more technical stuff
Starting point is 00:34:12 around like Lans. Is it like, it is a table format, I guess. We're talking about tables here. And it allows you like to mix very heterogeneous
Starting point is 00:34:22 type of like data. So it's not like just the tabular data that we have until now. It is based on Parquet, right? There is Parquet used behind the scenes. Is this correct or I'm wrong here? Oh, it's actually not Parquet-based. So Lance is actually both a file format and a table format.
Starting point is 00:34:41 So the idea with Parquet is that the data layout is not optimized for unstructured data and not optimized for random access, which is important there. So we actually sort of wrote our own file format plus the table format
Starting point is 00:35:00 from scratch. So this is written in Rust. I think maybe this goes back to Eric's question from earlier. It's like Rust is one of those things that it might not be ubiquitous yet, but it's definitely gaining popularity. And I think Rust and Python plays really well together. And just that combination of both safety and performance, plus the ease of package management, is something that I think is very unique. And it's pretty amazing as a developer.
Starting point is 00:35:36 It's also sort of very easy to pick up. So we actually started out writing Lance in C++. And at the beginning of this year, we made a decision for those same reasons to switch over to Rust. And we were Rust newbies. And I think we were learning Rust as we were rewriting. And even then, I think it took us about three weeks for me and Lei to rewrite roughly four, four and a half months of C++ code. And I think more importantly, we just felt a lot more confident with every release to be able to say, this is not going to segfault if you just look at it wrong. Yeah, yeah, yeah, 100%. I think you touched a very interesting point here, which, again, I think connects with the Pandas conversation that we had and how these technologies become more-end, back-end kind of thing in the application development
Starting point is 00:36:48 having kind of a similar paradigm where it comes to the systems development which is going to be extremely powerful and we see that already with so many good tools coming out in the Python ecosystem actually being developed
Starting point is 00:37:04 in the back ecosystem actually being developed in the back end, let's say, with Rust. So whoever builds the libraries there for the bindings, I think they've done an amazing job with Python. But all right, that's awesome, actually, because my next question would be about how do you deal with the columnar nature of Parquet? And, yeah, like, okay, you've already answered that. It's not like columnar format anymore.
Starting point is 00:37:36 But my question now is like, okay, let's say I have infrastructure already in place, right? I have, like, my Parquet data lake. Like Parquet is like the de facto like solution out there when it comes to building lake lakes. Does it mean that if I want like to use LANS, like I have to go there and like convert everything into LANS? And like, how is the migration or like the operability between the different like storage formats working out there?
Starting point is 00:38:07 Yeah, this, so I mean, the short answer is yes, you have to sort of migrate your data if it's in Parquet or other formats. This process, fortunately, is very easy. It's literally two lines of code, one to read existing formats into Arrow and one to write it into Lance. And I think this wider topic here is very exciting.
Starting point is 00:38:26 I think West actually just published a recent blog post on composable data systems. And I think this is, this is the next big revolution in data systems. I'm very excited about that. So you can, you know, you previously, when you were building a database, you had to literally build a whole database from like, the, you know, the parser to the planner, you know, the execution engine, the storage, like indexing, you have to literally build everything. Whereas now there's so many components out there that you can sort of innovate on one piece, but create a whole system, but just using open source components that play well together.
Starting point is 00:39:07 This is what makes Apache Arrow such an important, and in my opinion, one of the most underrated projects in this whole ecosystem. You don't see it. You don't hear about it. You're using the higher level tooling, but projects like Apache Arrow makes it just 10 times easier to build new tools and for these different tools to work well with each other. Yeah, yeah, 100%. I totally agree on that. And we should do, I don't know, at some point we should have an episode just talking about Arrow, to be honest. Because as you say, it is okay, work more on the system side of like data. They know about it obviously, but I think it is like the unsung hero of like what is
Starting point is 00:39:52 happening right now because it did create like the substrate, like to go and build like more modular systems over like the data. So we should do that at some point. All right. So let's go back to Lance. So data is like, okay, why did you have to build like this new, like a way of like storing the data, right? You mentioned something already, like the point queries that, okay, columnar systems are not built for that.
Starting point is 00:40:20 Uh, but there's always like this problem between there's also, let's say, bulk work that you need to do that makes volumetric systems more performant. And also more points, the kind of queries that you need when you share something, for example. How do you, let's say, balance these two with Lance. Yeah. So Lance format actually is a columnar file format, but the data is just laid out in a way that it supports both fast scans and fast point queries. And so originally we designed it because of the pain points that ML engineers voiced around dealing with image data. So for debugging purposes or sampling purposes, you often wanted to get something like top 100 images that spread out across 1 million images or 100 million images.
Starting point is 00:41:24 And so with Parquet, you have to read out a lot more data just to get one. So your performance is very bad. And so we sort of designed it for that. And the sort of happy accident was once we designed it that way, we realized if you can support really fast random access, so I think just purely on micro benchmarks
Starting point is 00:41:42 on like just taking a bunch of rows out of a big data set, we beat Parquet by about a thousand times in terms of performance, right? If you're talking about that order of magnitude of performance improvement, then it makes it a lot easier and it makes a lot more sense to start building rich indices on top of the file format. So you can, and this is what led to LAN-CB, the vector database. So it's now on top of the format, we have a vector index that can support, you know, vector search, full-text search, we can support SQL, and also all the data is also stored together. And so this is something that I think like other vector databases can't do is that the actual image storage
Starting point is 00:42:29 and other things have to go somewhere else. And so now you go back to that complex state of having to manage multiple systems. And so for us, it was, I would say, like a happy accident that came from a good foundational design choice. Is there some kind of of trade-off there? I mean, what is the price that someone has to pay to have this kind of, let's say, performance
Starting point is 00:42:55 and flexibility at the same time? Definitely. So the trade-off here is that if you want to support fast random access, it's much harder to do data compression. So you can't do file-level compression anymore. You have to do sort of either within block or just record-level compression. So here, if you have pure tabular data, then your file sizes in Lance will be bigger,
Starting point is 00:43:32 maybe like 50% bigger or 30% bigger than it would be in Parquet. So that's the trade-off there. Now for AI, let's say you're storing image blobs in this dataset. Now these image blobs are compressed at the record level already. So file level compression actually doesn't matter. And the whole dataset size
Starting point is 00:43:49 is dominated by your image column, right? So then for AI, actually, this trade-off makes a lot of sense because you're not really sacrificing that much. Yeah, makes sense, makes sense. That's makes total sense. Like the trade-off there between like space and like time complexity.
Starting point is 00:44:09 So like it's, I think it's like anyone who has done any kind of like computer science, computer engineering, like it's one of the most fundamental things. Like, okay, we are going to store more information so we can do like things faster and like vice versa. So it depends on like what you optimize for in any case. Okay. So, all right.
Starting point is 00:44:26 We have, by the way, Lansas, like the format is open source, right? Like it's something out there, like people can go and like use it, play around, do whatever they want with it. There's also like some tooling, I guess, around it, right? Like you have like some tools that you can convert like Parquet into Lance, for example. And also the opposite. Is it also possible to go from Lance to Parquet if you want?
Starting point is 00:44:55 Yep. It's also the same two lines of code. You read it onto Arrow and write into the other format. Okay. But what happens then if you have, let's say a lance file that has also like images inside and you want to go to parquet, like how is this going to be stored in the parquet? Yeah.
Starting point is 00:45:14 So right now the, you know, the storage type is just bytes. So it'd be for images, it would be like bytes. Or if you're storing just image URLsls then they just be plain string columns right so a lot of the so so we're we're making extension types in arrow to enrich the ecosystems right so arrow right now does not understand like images or you know videos or audio so we're going to start making these image extension types for Arrow that certainly will work well with Lance, but can be made to work well with Parquet and other formats as well. And so that way, top-level tooling can understand, oh, this column of bytes is an image rather than
Starting point is 00:46:00 just, oh, this is just a bunch of bytes. And so then your visualization tooling, BI tooling, data engineering pipelines can make much smarter decisions and inference based on these things. That makes total sense. Okay, so we talked about the underlying technology, which is
Starting point is 00:46:20 open source, when it comes to the table and the file format. But there's also a product on top of that, right? Yeah. So tell us a bit about the product. What is the product? Yeah. So I love open source through and through.
Starting point is 00:46:33 So if money wasn't an object, I'd certainly spend my whole day just working on open source tooling. But I think certainly it's very exciting also to build a product that the market and folks want and will use. So on top of Lance Format, we're building a Lance CBE vector database. That's sort of the first step in our overall AI leg house. And what makes this vector database different
Starting point is 00:47:03 is one, it's embedded. So you can start in 10 seconds just by pip installing. There's no Docker to mess with. There's no external services. The format actually makes it a lot more scalable, right? So, you know, on a single node, I can do billion scale vector search within 10 milliseconds. It's also very flexible, right? Because you can store any kind of data that you want, and you can run queries across a bunch of different paradigms. And that whole combination makes it a lot easier for our users to get really high quality retrieval and simplifies their production stack for systems and code. And I think another really big benefit for LanceDB is the ecosystem integration.
Starting point is 00:47:52 So a lot of people have told me, once I started using it, it's, oh, this is like if Pandas and vector databases had a love child and called it LanceDB. It was for people who are, you know, experimenting and working with sort of putting data in and data preparation, all that. It was just much easier to load data in and out of Lance CB with the existing tooling that they made. And so this, you know, we again go back to our discussion of like, how do we sort of try to bring new things back into an old paradigm
Starting point is 00:48:26 and use the existing tooling to solve these new problems. And I think one of the things that new sort of vector databases, I think someone coined the term new SQL, N-E-W SQL, as this new generation of databases. I don't know how I feel about that, but certainly I think like this new generation of databases have kind of forgotten lessons, painful lessons we've learned over the last decade
Starting point is 00:48:56 of like data warehousing development, right? So like columnar storage is not a thing in a lot of these new databases. Separation of compute and storage is not a thing in these new databases. And so I think it's something that is very much worth doing, especially as you're scaling up. And those are the things that we're building into LAN-CV that we're offering for generative AI users that I think are pretty exciting.
Starting point is 00:49:28 All right. One last question from me, and then I'll give the microphone back to Eric. And it is like related to like what you were mentioning right now about like database systems. So, okay. We have in traditional, right? Like there was like the OLAP and the OLTP kind of paradigm there, right? And they usually, not usually, like we're still up today,
Starting point is 00:49:49 they serve, let's say, a very different like workload, right? And that, of course, like dictates many different trade doors, like different people involved in all these things. So hearing you about like LensDB, like saying, oh, that's great. Like now that I can have like my appendings there, for example, and I don't need like to go to another system, like to do my filters or like whatever was the metadata. But if I want to build an application, I still going to need another data store. That's probably like a Postgres database, right?
Starting point is 00:50:21 Where I'm going to have like some part of like, like the business logic living there. Like the state is going to be managed for the application Postgres database, right? Where I'm going to have like some part of like the business logic living there. Like the state is going to be managed for the application in this system, right? And AI is one of these things that, you know, like it feels at least to me that it's a much more like front-end kind of like data technology at the end than like doing building pipelines for like data warehousing, right? So how we can bridge this too, because there's still like a dichotomy there, right? Like I'll still have my Postgres with my application states and LansDB that is going like to have like the abandonings there and like whatever else like I need. First of all, do you think that it is a problem?
Starting point is 00:51:07 And if it is a problem, like, do you see Lance like trying to solve this in the future? Alex Bialik- Yeah. That's a really great question. I think there's like two sort of gaps in what you were mentioning. One is this in production, the production OLTP kind of database and transactional database versus the data store that's needed for AI serving.
Starting point is 00:51:34 So that's one kind of split right now. The other is going from development or research into production, right? So this is like what you use in your data lake versus what you use in production. So I think the second one, I think that's a much easier question. So for Lance, you know,
Starting point is 00:51:54 because of the fact that we're good for random access as well, you literally can use the same piece of data in your lake house and also in production serving. And this is something that is pretty exciting to me data in your lake house and also in production serving. And this is something that is pretty exciting to me because there's very few data technologies that's good enough for both. The first question, I think, there's no sort of absolute best answer. So in my experience, I've seen installations where the production transactional store is also the AI feature store and serving
Starting point is 00:52:26 store. And although I would say that at scale, as companies scale up, that tends to be less and less true. And a lot of times these AI serving stores that supports vector search workloads have much more stringent requirements, and the workloads tend to be much more CPU-intensive. And so when you mix the two together, you end up creating trouble for both types of workloads. So at scale, a lot of companies find it easier to separate the two. Yeah, I'm not sure i think at small scale i think it's perfectly fine to have a single store it you know simplifies
Starting point is 00:53:12 your stack and you keep everything together although i think the tooling and the user experience around like you know postgres let's say for vector search is kind of wonky and the syntax is kind of bad. And if you want high-quality retrieval, you then have to figure out how to do full-text search index on your own and then combine the stuff. And performance also tend not to be great. So I think the answer certainly depends. It mostly depends on your
Starting point is 00:53:47 scale, and your use cases. If your sort of AI use case is very light and very small. And you can certainly sort of put your expertise around that production database, and just use whatever, you know, PG vector and full text index that comes with Postgres. That's certainly sufficient, but, you know, the sort of larger, more serious production installations tend to be separate. I think it'll stay that way. Yeah. And I think, and correct me if I'm wrong here, but I would assume that like the AI workloads
Starting point is 00:54:19 on the front end part, like primarily read type of workloads, like you primarily want like to be able to concurrently and really fast read and serve results. Well, okay, obviously when you're managing the state of an application, it's like read and write heavy, right? Like you need transactions, you need all these things. They're like a very different set of trade doves there. So it sounds like it's hard to put them together at scale, at least, which makes sense.
Starting point is 00:54:49 All right. One last thing before Eric gets the microphone. How someone would play around with LANs? What are your recommendations where they should go, both for the technology itself and also the product? Yeah. So the easiest way to start with LAN-CV is just pip install LAN-CV. And then in our GitHub, we have a repository. It's under LAN-CV slash vectordb dash recipes. And there's about a dozen or so worked examples and notebooks,
Starting point is 00:55:27 both in JavaScript and also in Python that you can use LAN-CB and just step through. And so these are like, you know, building recommender systems, building chatbots with chatgpt, using the LAN-CB integration with Langchain and Lama Index, just, you know, building a host of tools.
Starting point is 00:55:46 We'll add to it more and more as time goes on. If you want to find out more about the format itself, go to the Lance CB slash Lance repo. That's the file format. There's a lot of reading material and benchmarks.
Starting point is 00:56:01 A lot of it is, if you're familiar with Rust or C++, you can also learn a lot just going through the Rust core code base. There's a lot of interesting things that we do in there. Sounds good. All right, Eric, I'm sorry for monopolizing the conversation, but the microphone is yours. No, it's absolutely fascinating. But Chang, one thing that I'm interested in is the changes you're seeing in the landscape around data itself. So when we think about unstructured data like images, et cetera,
Starting point is 00:56:39 of course, we can think about things like self-driving cars or various AI applications like that. But in an increasingly media-heavy world, do you see unstructured data as becoming a much larger proportion of the data that companies are dealing with? Yeah. Yeah, absolutely. I mean, I think, you know, I think people have like a terabyte of photos just on their iPhones these days. So it's, it's, I think it's going to just become much more important. And the data set sizes will dominate tabular data. And a lot of the use cases will also become multimodal. So if you're, a media-heavy world, when you have lots of images and videos,
Starting point is 00:57:27 how do you organize that data and how do you query that data also becomes critical, right? So you want to be able to ask your set of like a billion images some questions in natural language or using SQL or something like that. And a lot of that is going to rely on
Starting point is 00:57:43 extracting features from the images, but also a lot of that is going to rely on extracting features from the images, but also a lot of times like embedding the images and using vector search and a combination of these things. So I think that's going to become a lot more important in the next few years as just AI and enterprise data becomes more multimodal. I also think that the relationship between data and downstream consumers will change. I would say before AI and before machine learning, it was a very much waterfall-y way of designing these pipelines where you come up with a schema and you load data into it in that schema. And then you sort of publish this and downstream consumers are like, okay, I can use this.
Starting point is 00:58:35 And maybe this is wonky for my use case, but this is what I got. But I think now it's much more important that data pipeline stays very close to AI and ML because the use cases there will determine the kind of schema, the kind of transformations, and the trade-offs that you make with data. Much more important than before. Yeah, I think totally agree. And I think one of the things that is going to accelerate this, and I'm really fascinated to see how this plays out, but one of the interesting things about AI
Starting point is 00:59:13 in general is that it produces large quantities of unstructured data, right? And so you essentially have a system that you're building using unstructured data that produces a, you know, a massive amount of additional unstructured data, right? And so you have this system that's a loop where in order to sort of meet the demand for, you know, additional AI applications, like it's going to require a significant amount of infrastructure for unstructured data. Yeah, totally. I mean, I think, especially in generative AI, right? If you have a million users producing new images, that's going to be kind of crazy.
Starting point is 00:59:56 Yeah, or even just unstructured chat conversations, even themselves as an entity. Okay, one last question, because we're right at the buzzer here, but where did the name Lance come from? So we were just thinking about like, you know, AI data unstructured data being sort of these large, heavy blogs and how do you, how do you deal with them and still be very performant and so we're thinking about you know things that seem you know fast but but also has this connotation that we can deal with heavy things and and so i i think we were watching some like you know some like you know fantasy movie i forgot the name and there was like uh jousting tournament
Starting point is 01:00:46 and so we're like okay we're calling it lance i love it yeah that is that is actually a great analogy lances are gigantic but they're used in like fast motion you know sort of high impact situation so yeah uh yeah so i'd love to to ask one question for you guys too, which is, you know, we spent a lot of time in the last hour talking about sort of, you know, what's old in the evolution. I'd love to get your take on what's new as well. So in generative AI, obviously, it's sort of the hot thing today.
Starting point is 01:01:20 So there's a lot of potential value that we can clearly see. There's also some hype, right? So in your opinion, what do you think is the one, the most underrated thing in generative AI and what's the most overhyped thing? That's a great question. You know, I think that the, I think the most underrated,
Starting point is 01:01:44 I think one of the most underrated things will be the use cases that are not very sexy, but will essentially eliminate like very low value human work. So if we think about, just one example, a friend called me the other day, and they work at a company that processes huge quantities of PDFs in the medical space or whatever. And they actually, because of the need for discovering these pieces of information in them, and the formats are all very disparate, and it's very painful. And so they literally have thousands of people who brute force this. you know, it's like, oh, well, you can like process, you know, you can get the information that you need with a very high level of accuracy from files that are notoriously difficult to work with that are in any format and in any order, right? That doesn't sound great. But I think what excites me about that is, okay, if you take all of those people and free them up to be creative
Starting point is 01:03:03 with their work, as opposed to just sort of doing brute force, you know, manual looking through PDFs for, you know, sort of needle in a haystack information. I think that type of thing has the potential, you know, and that's just one example and there are thousands across industries, has the potential to really unlock a lot of human creativity to solve problems that's currently trapped in pretty low-level work. I think that's really exciting. I think probably the most overhyped piece of it, and I haven't thought through this, so you're getting this. I mean, I've thought through it a little bit, but I'll just do it live. Is this notion that this is just going to take over people's jobs? And I'll give you a specific
Starting point is 01:03:54 example. I think that there's certainly potential for that, but I think the way that the media is portraying that is really wide of the mark. because one example recently is that I was working with a group who is trying to, not trying to, but using LLMs to create long form content around a certain topic to drive SEO, right? And I think a lot of people think like, okay, this is just sort of a silver bullet where if you can give it a prompt and then you get something back, right? And it's that easy. And like, okay, so SEO, the people who think critically about that content, their jobs are all gone, right? And in reality, I think on the ground, what we see at least is that there's sort of this,
Starting point is 01:04:51 there's kind of two modes. There's one where it's like a very primitive, like you give a prompt and you ask for something and what you get back is astoundingly good for how simple the input is and how low effort it is. But when you need to solve for like a very particular use case, you actually have to get very good at using these tools. And it's not simple, right? Like understanding prompts, understanding all of the knobs that are available in a tool like chat GPT. And then, I mean, things like what we're finding is that it's actually very useful to use the LLM tool itself to do prompt development for you, right? And so you get this sort of
Starting point is 01:05:37 iterative loop of prompts that can produce a prompt that actually gives you the output that you want, right? You know, you're dealing with it on an API level at that point. And so, I don't know, I just think that's overhyped where it's like, man, to get really good at this, like you actually have to be very creative and get really deep into all of the ways
Starting point is 01:05:58 to tune it to actually use it. So I don't know, that's my hot take. You know, this is really interesting is that it reminds me of the sort of hype cycle in history around autonomous vehicles, right? Is that the last 10 years, every year was like, oh, fully autonomous vehicles
Starting point is 01:06:13 are coming out next year. And this is the kind of thing this is very much similar where it's like if your vehicle is autonomous 80% of the time in demos, it's amazing. But you still have to hire a full time driver. So so that driver is now losing his job. So it feels the feel feels like it's very similar here.
Starting point is 01:06:33 Yeah, for sure. And I think, you know, I don't know. I mean, I'm certainly not. I'm not trying to suggest that there won't be some sort of massive displacement. And I think as adoption grows, a lot of those things will be productized, right? And so certainly I see a future where it gets to a point where you can sort of productize those things, but sort of the mass near-term displacement, I don't think is a reality because you can't just give it a sentence and get back something that's highly accurate if you want to go beyond very simple use cases. Totally. Yeah, should I go next, Eric? Yes.
Starting point is 01:07:14 Okay. I'll start with what I find, I think, a very fascinating and very underrated kind of aspect of AI. And that's that actually it enables a kind of like data flywheel. And what do I mean by that? For me, and obviously like, okay, I'm really into like the data stuff. So I tend to like to look into more of that stuff. But the reality is that there's a lot of data out there, like much more data
Starting point is 01:07:47 than what we can like today like process. And big part of that, like effectively, and big part of that is also because like there is like a lack of structure around this data. And I think that LLMs can really like help, like accelerate like the process of actually
Starting point is 01:08:06 creating datasets and products that they can lead to products over these datasets at the end. And that's, for me, it's a very fascinating aspect of that. Just the fact of I can give a text
Starting point is 01:08:20 and it's not that I'll give a summary behind of that. It is that I can actually get it in a machine understandable format like as a json that has like very specific like predefined like semantics that then I can use with the rest of the technology that I have there like to do things that's like it's almost like a superpower right so i see like the value of like adding like the pictures in there for example like all the audio files but when the audio file turns into like text and after that turns into like columns of like topics or like tags or like paragraphs or speakers that's crazy because if you wanted to do that today until today you pretty much had like to be someone like meta or like google that could hire
Starting point is 01:09:15 like thousands of people that would go and annotate that stuff right so that's like one of the things that i think that like people other estimates underestimate when it comes to these systems. I know it's a little bit more on the back end of things, but I think that's where the value starts being created. And when it comes to the hype, I think one of the things that's... Especially for people who are spending a lot of time on Twitter, seeing all these things of like, oh, okay, now you don't need developers to go and build applications. You can just use chat GPT and autopilot and copilot. And you can go and build a full product and make millions out of that. That's like, okay, I'm sure that's bullshit.
Starting point is 01:10:00 There's no way that works, right? It is an amazing tool for developers. Amazing tool for developers. I think it's, like, the first time that I could say, like, after all these, like, years in tech, like, that there's, like, a truly innovative new tool for helping developers to be more productive. But we are not any way close to, like, you know,
Starting point is 01:10:23 having a robot developer to build applications and like this thing does not exist and the other thing that i think like people like tend like to forget is that everyone says that probably okay like customer support for example is going to be fully automated with ai and like robots and all that stuff. But they forget that when people reach out for help, part of the help is also like to connect. There's like human empathy there and human relationships. And these things like cannot be emulated at the end, right? Like you can at some level, you can have companions, you can have like an AI that you talk to and
Starting point is 01:11:01 like all that stuff, right? But at the end, doing business and working, and I'm sure like companies at any scale will know about that and they already know about that. Like putting a face in front of the company, it is important for the business itself. So again, it will make customer success much more productive and the people there being more creative
Starting point is 01:11:24 and like also having a more fulfilling job at the end. But it's not like suddenly we are going to fire everyone who is like, you know, solving problems for people picking up the phone and we are going to have like AI to do that. And I'm very curious to see what's going to happen like with the in the creative industries. I think there's like very interesting things especially like in cinema i think we are just going to see like an explosion of creativity at the end that's like my my feeling so there's a lot of value in my opinion i mean i know that like people think oh it might be another crypto situation but i think it's a very different situation there's a lot of work that has to be done to enable it still but i think it's a very different situation. There's a lot of work that has to be done to enable it still, but I think the future looks very interesting.
Starting point is 01:12:11 It certainly does. How about you? Certainly, you guys already took my number one answer, so I have to come down. I love asking you these questions because every conversation I have, I think everyone comes up with better answers than me. I think on the overhyped front, I would say there's a lot of excitement about autonomous agents. And I think we are at least sort of a year, if not more, away from really making that work very well. Just what I see is that agents really struggle with that last mile accuracy that's required
Starting point is 01:12:55 for production and also performance. If you have a complex question, you have to break that down into, or a task, you have to break that down into multiple, a task, you have to break that down into multiple, possibly a pretty long agent chain. And these chains can start adding up in terms of time. It takes like, you know, minutes or things like that where it just becomes not interactive and it's much faster just to build something sort of special purpose.
Starting point is 01:13:22 So I think this like, everything is going to be become autonomous and we'll never have to work again thing is not coming quite yet um in terms of underrated i think yeah i i totally agree i think there's a lot of like the less sexy things that are i think have the potential to produce a lot of value. So one big thing I see is I think it's going to change people's expectations of how they can interact with information systems and knowledge bases. So most websites and applications have a little search feature. And without an exception, they all kind of suck.
Starting point is 01:14:01 And it's because they're all based on sort of text and, you know, syntactical search. I think with the popularity of generative AI, our expectation is going to just drastically change. Every search box becomes a semantic search box. And so any sort of tool that doesn't live up to that promise in the next year or so, I think, is going to have trouble retaining a lot of users. You're going to go from, oh, this search sucks because, of course, search sucks, to, oh, your search sucks.
Starting point is 01:14:32 I'm going to go to your competitor who has semantic search built up already. Yeah, I agree. I was going to say, if you think about the support agent or the agent piece of it, I would actually combine those two and say that the first wave of that is going to be like better search. So for example, like if you think about documentation, it's really compelling to have this idea that, you know, it's like, okay, well, you have all
Starting point is 01:15:00 these docs and people, you know, have to comb through the docs to find this really specific answer to their question and the search around it sucks because no one has time to like redo the indices with every new piece of data point and that's a like horribly manual process right but at the same time if you give wrong information in an automated way relative to documentation if someone's like building an application or something, I mean, you know, we're doing something, a critical like data workload that's going to inform a ML model. That's going to like do really important things for downstream users. Like you can't really get that wrong. Right. And so I agree, like the first wave of that's not going to be like docs go away and it's just a
Starting point is 01:15:46 chat bot that gives you your answer it's going to be like it will help you search so much faster and better than you ever have before totally awesome well Chong this has been such a fun conversation great question thanks for making it a conversation and yeah we'd love to have you back on to dig back into all sorts of other fun topics. Thank you. Thank you for having me. This was a lot-author of the Pandas Library, which is legendary. And the fact that he has built multiple high-impact technologies, is a multi-time, multi-exit founder, building data tooling, and sort of the data and ML ops space. I mean, all of those things are, it's really incredible. But when you talk with him, if you didn't know who he was, you would just think this is just one of those really curious, really passionate,
Starting point is 01:16:55 really smart founders. And you said at the very beginning that he's humble, but I mean, that's almost an understatement. He would treat anyone on the same level as him, no matter their level of accomplishment or technical expertise. That really stuck out to me. And I also think the other thing that was really great about this episode was it wasn't like he came out and said, you know, I have an opinion about the way the world should be. And like, this is why we're doing things like the Lance DB way. He just kind of had a very calm explanation of the problem and a really good set of reasoning for why he needed to create a new file format, right? Which is like shocking to hear, you know, because it's like, whoa, you know, you have like Parquet exists,
Starting point is 01:17:50 why do this, right? So it sounds really shocking on face value, but I mean, his description was really compelling. And the story of how they actually sort of almost backed into creating a vector database, you know, because they invented this file format. Just an incredible episode. Yeah, yeah. I mean, Chang is like one of these rare cases where you have both like an innovator and a builder, which is like, I mean, it's hard to find an innovator.
Starting point is 01:18:19 It's hard to find a builder. It's like even harder to find someone who combines these two. And at the same time being like down to earth like like him i think this episode like has pretty much like everything i mean it has like lessons from the past that can be like super helpful like to understand like how we should approach and like solve problems today. And there's like a lot of things like to learn from the story of pandas that are applicable today for everyone who's trying like to build tooling around AI and ML. It has, what I really enjoyed was actually probably like the first time
Starting point is 01:18:58 that we talked about something I think very important, which is how the infrastructure needs to evolve in order to accommodate these new use cases and actually accelerate innovation with like AI and ML, which is still like work in progress. And I think Chang provided like some amazing insight of like what are the right directions to do that. And he said some very interesting things about not creating silos, how like, you know, gave like a very interesting example, like from like mathematics, where he said that, you know, in mathematics, like when you have a new problem, you try to reduce it to a known problem, right? And that's like how we should also like build technology. We drew like an amazing insight, to be honest.
Starting point is 01:19:41 And I think it's something that, especially like builders keep like to forget and tend to like either like replicate or create bloated solutions and all that stuff. So there's like a lot of wisdom in this episode. I think anyone who's like a data engineer and wants to get like a glimpse of the future of what it means like to like work with the next generation of like data platforms,
Starting point is 01:20:04 they should definitely tune in and listen to Chang. I agree. Really an incredible episode. Subscribe if you haven't. You'll get notified when this episode goes live on your podcast platform of choice. And of course, tell a friend. Many exciting guests coming down the line for you.
Starting point is 01:20:23 And we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 01:20:49 Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.