The Data Stack Show - 179: Time Series Data Management and Data Modeling with Tony Wang of Stanford University

Episode Date: February 28, 2024

Highlights from this week’s conversation include:Tony's background and research focus (3:35)Challenges in academia and industry (6:15)Ph.D. student's routine (10:47)Academic paper review process (15...:26)Aha moments in research (20:05)Academic lab structure (23:09)The decision to move from hardware to data research (24:43)Research focus on time series data management (27:40)Data modeling in time series and OLAP systems (32:01)Issues and potential solutions for parquet format (37:32)Role of external indices in parquet files (42:19)Tony's open source project (47:11)Final thoughts and takeaways (49:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We have Tony Wang on the Data Stack Show today. Tony, we have a lot to talk about, both academia, the data industry, different kinds of selling, and some cool data stuff in general. But we'll start where we always do. Give us an overview of your background.
Starting point is 00:00:46 Yeah, I'm Tony. I'm a PhD student at Stanford University. One of the few people today still studying data systems and databases. Before that, I was at MIT for four years, studying mostly electrical engineering and hardware engineering and before that you know i was i came to the u.s from china when i was 16 i went to a private boarding school up in new hampshire i love to ski and i bike a lot so california has been pretty good for both it's one of the rare areas where you can where you can drive like four or five hours, hopefully only, and ski and also have, you know, decent weather year round when you're not skiing. Yeah.
Starting point is 00:01:31 And it's a great place for database research too. So you kind of get all, you check all of your boxes. Parts of California are great for database research to be sure. Yes. Yeah. And that's like one of the reasons that I'm really excited to have Tony here today. Eric, like I think it's the first time that we have someone who is actually pursuing a PhD.
Starting point is 00:01:53 We have many people who have successfully done their PhDs and starting companies, but someone who's like in the process of the PhD, I think it's like the first time. So I'm super excited to talk about what it feels like to do that, do research. And learn of course, like what's, let's say the state of the art right now, what like the academia is interested in. And most importantly, what's the connection between that and the industry out there, because there is a continuum, right? Like the things that happen like in university special and stuff like databases, they have
Starting point is 00:02:30 like an impact out there to the systems that we build tomorrow. So super excited to chat about that. What about you, Tony? Like what you would like to talk about today? Sure. I can talk about that. I can also talk about the stuff I'm working on and my thoughts on different data
Starting point is 00:02:49 processing systems and what I hope will become more popular in the future from a technical perspective. Although I know that a lot of beta products are also driven by other aspects as well that I have less of an insight on.
Starting point is 00:03:07 Yep, sounds good. So what do you think, Eric? Should we go and do it? Let's do it. I can't wait. All right, well, this is a really exciting episode for us because you are in the midst of doing a PhD, getting your PhD. And I don't think we've had anyone on the show who's like actively in a PhD program. And so we want to learn all about it.
Starting point is 00:03:35 You're doing some really interesting research on data systems. So let's just start there. Can you tell us what is your main area of study and focus? Because you're close to the end, right? Are you finalizing your thesis? I would hope so. Yeah. So I mostly work on data processing systems and mostly around clouds data processing, around quickly processing data in data lakes that people use today, like Apache Iceberg or Delta Lake, or even just the buckets of Parquet files, which is unfortunately still way too common. Yeah.
Starting point is 00:04:17 Yeah. We had someone on a show recently, the discussion was like, when is, you know, when are we going to move from Parquet? So how did you decide that you wanted to do a phd i mean you know data lakes are obviously like you know very popular in industry and very widely used but you know it's not every day that you meet someone who's you know actually studying those at a phd level so how did you end up going down that track? Well, how did I start to do a PhD was,
Starting point is 00:04:49 and why I ended up at Stanford in particular is, back when I was trying to decide what I was going to do after, because at college, I mostly worked on hardware, like Verilog and FTGA and GPU, low-level CUDA programming. So after that, I applied to some jobs like NVIDIA and then decided, well, maybe I'll pursue hardware research at Stanford University where some of the best hardware research is being done.
Starting point is 00:05:18 Yeah, I turned down my job offer at NVIDIA. In retrospect, maybe that was a poor financial decision. I was going to say, if they were offering options, you know, hindsight's 20-20. You know, I got my offer
Starting point is 00:05:31 in like March 2020 when the stock was like the lowest point from the COVID. So I took as much money as I personally had and I bought NVIDIA stock and then I decided
Starting point is 00:05:42 to just like go do my PhD program. Yeah, but halfway into my first year PhD program, I realized that, why am I in academia doing this stuff? And I look back at people I talked to at NVIDIA, I realized that NVIDIA is just going to dominate the hardware industry
Starting point is 00:06:03 and then the cool stuff is hardware is to dominate the hardware industry. And the cool stuff is hardware is being done in industry. I think it's very hard for people in academia to be able to move the needle in the state of the art in the hardware industry. Oh, interesting. Can you describe why that is a little bit? I mean, so just to make sure you're seeing, you knew from your work studying hardware that NVIDIA was going to be the big player in the market and that was what's happening? I won't name anybody
Starting point is 00:06:32 but I talked to people at NVIDIA I talked to people at AMD, I talked to people at Intel people at NVIDIA they were truly excited to be there it was a level of excitement that I could not discern from people, you know,
Starting point is 00:06:46 and, and video is like a software based company, right? It's really hard for a hardware company to actually like get a software driven culture because like, you know, other companies, maybe the company is started by hardware engineers.
Starting point is 00:07:01 The founders are hardware engineers. So, so like those people get more say and software gets maybe neglected and looked upon as something that's easy and not real. But at NVIDIA, I think it's really incredible how the leadership team is able to foster a culture where five out of six engineers are software engineers
Starting point is 00:07:22 and build this amazing software stack. And that's what really dunes the academic hardware projects because there's many aspects. One is that your project just cannot possibly try to keep out at the very competitive nodes today, like 5 or 7 or whatever. So you might just miss a lot of... You might have an amazing design that works at
Starting point is 00:07:49 2028 or whatever, but you might miss problems that would occur if you were trying to do it at a more competitive technology node. And the other aspect is the software, right? Like you could build some hardware, but, you know, to get people to use it, there's like a long way between Python code and your hardware. Now, of course, there's definitely value
Starting point is 00:08:15 in academic research, right? In designing new, like, hardware designs and stuff like that. That might inspire people in industry to pursue, like, you know, certain architectural decisions, but I was more on the side of trying to do something that can actually be used. And that is not where that is, unfortunately not where I found that should be focusing my time.
Starting point is 00:08:40 Yeah. Yeah. Interesting. And so it sounds like, and maybe I'm drawing the wrong conclusion here, but if academia is not really driving innovation on the hardware side, but it does sound like on, for example, data processing systems, that there is a lot of innovation being driven in academia, and that's why you pursued that path? It's funny because data processing systems, there's also very entrenched players like Snowflake and Oracle, but every now and then, the software, the barrier
Starting point is 00:09:14 to the entry of building a system that's actually useful, I think, is a bit lower. You see cool academic projects like DuckDB, for example, taking huge traction in the data industry. And that's really started as an academic project, right? Like that's not,
Starting point is 00:09:30 they didn't not have the resources of say the AWS Redshift or Snowflake or something. It's just a couple of guys, you know, in the Netherlands. And so it's like this project called Polars, which is like a Rust-based rewrite of Pandas. And that's really started by one guy, you know so it's like this project called Polar's, which is like a Rust-based rewrite of Pandas. And that's really started by one guy, you know, it's like how one person could, you know, it was, you know,
Starting point is 00:09:53 maybe tens of thousands of hours of coding could really try to displace, you know, one of the most popular data analytics libraries out there, Pandas, right? So that's a testament to like, you know, it's your dedicated enough, you can really move the needle there in what people use in the real world.
Starting point is 00:10:14 Yeah, yeah. Okay, can you, this is so interesting. Custis, I have a million questions, so I promise I won't steal the mic for the entire episode. But Tony, what does a typical week look like for you? And I know that's a difficult question because it probably changes, but I know that some of our audience
Starting point is 00:10:31 certainly have done a lot of post-secondary study, but a lot of them probably don't. And so we don't really know what it's like to be a PhD student studying data systems. So can you just give us a glimpse into your role? So I'm very much on the applied side. And I know that people on the theoretical side, their days actually are a bit different. And I wouldn't actually say it's that different from working at a regular job because you
Starting point is 00:11:00 show up and you try to program. And well, it's maybe a bit easier because you try to program. Well, it's maybe a bit easier because you have fewer meetings and code reviews because yeah, like you're there are no code reviews. You can write whatever
Starting point is 00:11:20 you want, but whatever you want has to work. No one's asking you if you wrote your unit test. Yeah, I mean, most academic projects only work on the five benchmarks they write in their paper and nothing else. You just have to kind of get your code into that state, but if you actually want your code
Starting point is 00:11:38 to be, I guess, used elsewhere, it has to go beyond that, but that's typically not inside of the purview of academia. Yeah, that makes sense. Now, you mentioned you're on the applied side, but there are also people, your peers who are working more on the theoretical side, which sounds like a spectrum, but you're both sort of, say, studying data systems. Can you describe that spectrum to us?
Starting point is 00:12:00 What does it look like to be more on the theoretical side? So on the theoretical side, you might be like, so there are a lot of people working. Well, when I say theoretical side, people who are actually there might think they're more on the applied side. Again, it's all a matter of perspective. There are people at Berkeley, for example, working on distributed programming paradigms, like the Hydroflow project that tries to revolutionize how you do cloud programming and stuff like that. So this kind of paradigm shifting theoretical work, I would say. And then, yeah, they would probably spend more time working around programming languages and designing language specifications, doing some proofs, maybe to make sure that things work.
Starting point is 00:12:54 The last time I did a proof was in my real analysis class at MIT. So it's seven years ago. Yeah. Yeah. Yeah. Now, one thing that really struck me when we were chatting before we hit record was that I kind of had made this assumption that being in industry, for example, trying to run a data infrastructure company, you know, is like wildly different than what you do. And your response was, well, you know, not really. I still have to do a lot of sales. As a PhD student, can you explain that concept to us?
Starting point is 00:13:38 It was just so interesting to hear you talk about that. Okay. Like a lot of time of PhD students I spent writing and reviewing and rebutting papers or whatever and trying to change your writing
Starting point is 00:13:54 or your pitch. And then most people will tell you that writing an academic journal publication is like telling a story, which is not too different from what a lot of salespeople have told me about doing sales.
Starting point is 00:14:07 You have to say how your system has novelty, how your system is better than all the other systems out there and worthy of publication. There are people that review your papers and tell you if they think that your system has
Starting point is 00:14:24 struck those goals yeah yeah yeah that's really interesting what could you describe some of the you know like who's the audience on the other end right and you know in industry you're trying to get someone to buy your technology and it's similar but how's the audience different and what are the different audiences in the phd world for what you do yeah so this varies a lot by by discipline in the machine learning discipline your audience and the systems disciplines when you submit your papers typically goes goes through a review process where they're assigned to three or four or five other professors or even graduate
Starting point is 00:15:12 students who are hopefully versed in the research in the area that the paper is purported to be on. Whether that is true or not, there you know there's double blind review where you know people you don't know who your reviewers are and reviewers don't know who you are there's single blind review where you don't know who your reviewers are and reviewers know exactly who you are now like i'm not saying one is better than the other or whatever and there's open review where everybody knows you you know, who the counterparty is. So like academic review process is this huge thing. That's,
Starting point is 00:15:51 you know, people has been experimenting over the years, but yeah, like recently there's problems because in all the disciplines, there's been a huge influx in papers. Like if you look at the number of submitted papers to this conference, over the past, like 10, 20 years, it's just been growing exponentially. So there's a huge strain on the review process.
Starting point is 00:16:12 And as a result, a lot of my peers, for example, in machine learning, might just get shitty reviewers for their papers. For example, a master's student could be reviewing a professor's work. And we just post reviews that are completely incoherent, even to top conferences like NeurIPS or whatever. Now that is obviously a downside, but I mean, nobody has figured out how to do better than this kind of review system. So I guess there are a lot of plus sides to the review system as well. Yep.
Starting point is 00:16:43 That makes sense. Now, how does the audience that you need to sell to, how much does that influence where you choose to focus your study? Or do you still feel like you have a lot of freedom to just pursue what you're interested in? Absolutely. In academia academia there's this culture of novelty like you're actually trying you're absolutely like trying to do something novel that people have not done before so i think this is good you know because maybe the point of academia is to do that but it also limits the kind of work that people can do right for example like if you look at work like like polars for example well so like polars for example
Starting point is 00:17:28 would not be a good academic project it would be very hard to publish that anywhere because it's not really like using novel ideas or you know because then you kind of it's rewriting pandas and rust i mean it's it's obviously awesome and very powerful but like when you say that it's okay that's what it is that's exactly what the reviewer is going to say to reject this paper so you know it kind of limits the scope of the projects that people in academia can do and that could be very limiting at times but otherwise it yeah so it it does encourage like very risky ideas that might not have a good practical implementation at this moment but you know somebody at redshift or snowflake might read this paper and be like hey i know exactly how to use this and actually you know lead to a
Starting point is 00:18:19 significant impact like like other places, right? Yeah, yeah. Just out of curiosity, I know you've written several papers. How long does it take you to write a paper that you feel great about submitting for review? A long time. Writing a paper is a time-consuming process. Yeah.
Starting point is 00:18:45 So like a month or like nine months? Like at least like a week of intensive writing. Well, I mean, hopefully the work you put into writing the benchmarks or writing your actual system should take more than that. That might take a few months to a year. Yeah. Writing the paper, I mean mean i think i spent too little time writing my papers but but yeah people are gonna people like you can never spend more
Starting point is 00:19:14 time writing your papers and if you think about it that's actually a weird perspective right because you're spending all this time in a presentation or whatever when you should actually just be like, maybe writing more unit tests to make sure your system works beyond the five cases written in the... But, you know, it's all a trade-off. And I mean, the proportion of, you know, that you have to done sales versus engineering and reorganizations,
Starting point is 00:19:40 you know, people can make similar arguments, right? Yeah, that makes total sense. I'm interested to know, and I want to, you know, and I similar arguments right yeah yeah it makes total sense i'm interested to know and i want to you know and i know costas has a bunch of questions on the technical side but you know as you pursued your research throughout the phd program have there been any surprising discoveries that you've made that you weren't expecting an aha moment that you had during your like i got there that's a much better way to put it thank you costas an aha moment that you had during your like, I got the that's a much better way to put it. Thank you, Costas. An aha moment. That's why I'm here. I just had
Starting point is 00:20:11 the guy you know, like the hammer. Yeah, I mean, that's also another thing was doing applied research versus theoretical research strike. When you're doing proofs or whatever, back when I was doing those, like there's definitely aha moments where you're like, oh yeah, I could just prove it using this way or that way. But I think when doing applied research, a lot of the things are a sequence of smaller things.
Starting point is 00:20:44 So like you can kind of see the project in your head. You can kind of see where it's going and you'll have a pretty good understanding of what is going to come out at the end. And you're incrementally improving your intermediate steps so that you can get to the end. For example, I'll give you an example,
Starting point is 00:21:04 which is when I'm working on like full text indexing for Apache Iceberg or for logs or whatever, for Parquet files for logs, there were I definitely had some ideas at the beginning of how you could how you could use this specific kind of index to your speed up like substring queries on like terabytes of parquet files or whatever and have the index be only like one percent or 0.1 percent of all the file size but but then the index has problems like maybe low access time or whatever and then and then gradually you start to like look more and more at your index structure and then it just becomes kind of obvious what you should do
Starting point is 00:21:46 once you have spent long enough looking at the algorithm. So I would say it's really like a half a moment because it's like once you've looked long enough at the problem, everything just becomes kind of straightforward. And then it becomes kind of hard to present that in a paper or something because then it's just straightforward. That solution makes total sense. Yeah, so I think like an art of setting papers that I have definitely have not yet mastered this how to present such you know maybe straightforward things in retrospect in a exciting fashion that, you know,
Starting point is 00:22:30 caters to people from who have not thought a lot about this problem. Yeah, yeah, that's super interesting. All right, Kostas, I have to hand the mic over. I'm just going to keep asking questions. That's okay. I think the conversation is super, super interesting, to be honest. So, that's okay. I think the conversation is super, super interesting,
Starting point is 00:22:45 to be honest. So tell me, okay, let's talk a little bit about what you're doing now and let's start with Stanford. You're part of a lab there, I guess. There is a structure in academia, right? So tell us a little bit more about that.
Starting point is 00:23:01 What's the goal of, let's say, your team there or the lab? And how do you think in that? Like, what's the goal of, let's say, your team there or like the lab? And how do you think in that? The academic labs are run very differently. It really depends on the professor. Like some professors are very hands-on and some professors are very hands-off. I have a very hands-off professor, fortunately, so he gives me great freedom in what i can do in my projects and i know other professors who might even write code for students project or um tell the student exactly what to do in his projects so so my professor's not like that so yeah like in my lab different people
Starting point is 00:23:42 might be working on different things that they find interesting with different industry partners, potentially. Some people in my labs working with Avidya. I work with maybe some other industry partners that are trying to use my stuff. Yeah, so really it's driven by you, like what projects you're interested in. Yeah. Is there always a connection with the industry out there? No. You don't have to. You don't have to work on something
Starting point is 00:24:10 that's going to be useful to industry. So what is the value that the industry brings to you as someone who's doing academic work? Well, it kind of helps you ground it in real problems. You might have this awesome idea of how you can do something, but people don't really
Starting point is 00:24:33 care about this. And it's hard to justify why I have to go through all the motions of writing this paper if the system I'm going to build is not useful. Yeah, that makes sense. So, okay, tell us a little bit more about what you are doing now. I mean, you said you were at MIT, you were more into hardware. Somehow you ended up doing research around data processing or data storage. You'll tell us more about that.
Starting point is 00:24:58 But first of all, how did you make this decision to move from hardware to getting into, let's say, more of what we can do with hardware when we have it already? Yeah, so as I mentioned in a pretty long ramble earlier, I think it's hard to do middle-moving working hardware. And it's easier to build real systems that can provide real value to actual people if you're doing some kind of software research. But why data? That's funny. It's a very broad term, actually. So there are many things when we say data.
Starting point is 00:25:40 But why what you're doing now compared to doing, I don't know, like training models or doing AI or doing, I don't know, whatever else. So I used to work on like speeding up natural language processing models or whatever. First year of my PhD program, I actually took a leave and tried to do a startup. And I talked to, you know, hundreds of like potential customers or directors of machine learning,
Starting point is 00:26:06 data science. I'm like, hey, I can make a TensorFlow model 5-10% faster by speeding up matrix multiplier. So I had some code that beat Intel MKL, which is Intel's way of multiplying matrices by
Starting point is 00:26:21 5-10% on the matrix sizes that I was extremely proud of as an academic achievement. But then I talked to these guys and they're like, yeah, you know, the slowest part of us
Starting point is 00:26:33 doing inferences is like getting this metadata from DynamoDB. So that takes 200 milliseconds or something like that. Whereas this matrix multiplying the tens of motion, it takes 200 microseconds to do this right. So this room was really eye-opening experience
Starting point is 00:26:54 and also kind of like really forced me to try to talk to potential customers to understand use cases before I started working on research projects today. Is that, well, of of course it gives you, but that was, you know, kind of, kind of starting point wise, wanted to go into data and I was like, yeah, this, I like a lot of inefficiencies and how people process data and stuff like that, and this, I gradually found data field more and more interesting.
Starting point is 00:27:21 And so that's where I most spend most of my time today. Yeah. So, okay. We found an aha I spend most of my time today. Yeah. So, okay. We found an aha moment here, I think, right, Eric? I guess. Yes. Okay. So tell us more about what you're doing today, right?
Starting point is 00:27:37 What's your focus in your research? Well, I'm mostly focused on, I work on time-saves data management. So I work on trying to build the, like, you know, so take a step back. Maybe it's like, so I think for business data and customer data and generic data management, people are moving to Parquet files, Delta Lake, Iceberg, whatever. They'll work really well. And then you are able to build all kinds of differentiated applications
Starting point is 00:28:05 and dashboards on top of the same data layer. Now, in time-series data management, that is still not the case. People are using Prometheus with its own scaling solution, like Low-Key or Elasticsearch with their own UltraWarm or Cotier, or whatever you call it, to build to S3. And then there may be some other completely different system to manage their traces. And so, so I just think that,
Starting point is 00:28:31 you know, we could probably make like Apache iceberg and Delta Lake work for these time series monitoring use cases and store metrics and logs at high scale and still be able to do the things that Elasticsearch can do. Now, there's a lot of promising recent projects like QuickWidth, for example, that claim huge performance benefits over Elasticsearch, right?
Starting point is 00:28:53 But the problem is it's still its own storage format, right? I really want to be able to store logs and metrics in Parquet files in Apache Iceberg and still be able to empower the use cases that people might want to do in Prometheus and Elasticsearch. Okay. And why there is this divergence between these two data-related, let's say, problems, right? Why we ended up having
Starting point is 00:29:22 systems that are in a way so different between the two, right? Like the Parquet world in one side with the OLAP systems there, and then we have all the time series systems like Prometheus and the rest that you talked about. So why we ended up in this reality? I think it's because that first of all, the Parquet world cannot efficiently support the use cases for
Starting point is 00:29:49 Prometheus and Elasticsearch. For example, if you store all your logs in Parquet files, and you try to do a substring query or some kind of text search, it's no other way than to scan all your logs and start doing regex in Spark or some kind of text search, it's no other way than to scan all your logs
Starting point is 00:30:05 and start doing regex in Spark or whatever. And that is horribly inefficient compared to like Elasticsearch where there is an inverted index that can answer this question in milliseconds. Now for Prometheus, I think it's more of an issue of data modeling. So in Prometheus, you have the notion of time series
Starting point is 00:30:29 and time series is tagged. And if you try to store those in Parquet files, it's not clear how you can do that to have this Prometheus data model translate over to the tabular data model in the Parquet world. What would be the columns? And how would the columns be clustered by? And how to get the kind of performance
Starting point is 00:30:49 that premises can have? And of course, this is not just talking about data models and querying. There's also this big component of real-time capability. Premises and Elasticsearch were first invented and probably still used largely as real-time systems where real-time ingested data can be used in real-time.
Starting point is 00:31:10 Now, then how does this translate over to the Parquet world? Maybe you have some click-house instance that is running and then that spills to Iceberg or Delta for longer-term storage or something like that. But I do believe that there's definitely got to be a bridge there.
Starting point is 00:31:28 So you should be able to do things like run SQL across all your business data as well as your telemetry data and be able to join those sources and try to debug your issues or things like that. Okay, let's talk a little bit more about the data modeling part, what you mentioned. So, how is data modeled in the time series world? And why is this
Starting point is 00:31:53 different compared to what you do in an all-out system with tabular data? So, you think about the data modeling, right? So premises really has a system where there are time series chunks, and they are tagged by string tags. And you should think about your data as these chunks with tags, and you can quickly access a particular chunk. Now, if you think about translating in a tabular world,
Starting point is 00:32:27 you could think about maybe I have a couple of columns, right? One column would be timestamp. Another column would be the tag. And another column would be the value. So you could do it like this. But then what should you sort your tables by? Maybe if you sort your tables by the timestamp, then you would have good ingest performance
Starting point is 00:32:47 because the new data would just be appends. But then, you know, quickly retrieving all the data corresponding to a particular tag would be very slow. So then maybe you should sort your data based on tags, right? But then ingest becomes a problem because like your new data
Starting point is 00:33:05 gets like super small files over a bunch of different partitions. So what are you going to do? I mean, ultimately, I think that, you know,
Starting point is 00:33:16 the premises data model could be implemented on top of Sparky files. And in fact, like I've done that as part of my research project and internships
Starting point is 00:33:24 and whatever. And I do believe that it's possible my research project and internships and whatever. And I do believe that it's possible to do this. There's a particularly good tabular data model and some maybe external indices institutions. Yeah. Okay. That's interesting. So how does Prometheus solve that? Is it because of like how, like, is it the storage problem at the end?
Starting point is 00:33:43 Like how you store the data, like on your storage? Or is it the lack of indexing, let's say in the OLAP world? Because okay. Like traditionally in OLAP like systems, okay. You can think of, let's say partitioning or like, like bucketing and stuff like that as like a lightweight version of like an index index maybe because you consider like what the workload looks like and try like to change the layout to make it like faster. But we don't have the index traditional
Starting point is 00:34:13 like in other systems, right? But so what is like from your like point of view, what's like causing the problem here? So I think Prometheus is like an integrated system. And it integrates the real-time part of how it gets the real-time data, separates them out into these chunks and whatever, and then it can write to these chunks
Starting point is 00:34:37 and it's back in storage. But in the Parquet world, you've got to start piecing together different systems and things like that. That's the first thing. And second, since you talked about indexing, it's very interesting because typically these tags are high cardinality. So databases like M3DB might... There's a great talk by Rob from M3DB, I think, that talks about the kind of inverted integers and the FSTs, like finite state transducers, that they build on these tags to quickly allow retrieval of a particular tag.
Starting point is 00:35:18 So this is the problem. If you have a billion Kubernetes pronouns that are your tags, how do you actually quickly look up where a particular tag and its corresponding chunks are stored? In integrated systems like M3DB, they could have inverted index that
Starting point is 00:35:39 similar to your Lofty search, they can tell you exactly where the time chunks or particular key is stored. Whereas in a Parquet file, if you have a column with a billion potential values, it is very hard or even if they're clustered together, it's like pretty hard upfront with no external indices to figure out which Parquet file your tags are located in without scanning all the headers and footers of all the Parquet files in your data lake or whatever. And well, if your data happens to be sorted by time and this tag is actually
Starting point is 00:36:14 separated across all of the tables, then you can forget about doing this efficiently. So I guess, but not the end of the world. You can definitely build industries on top of these Parquet files that can perform similar functionalities to what an inverted index, an M3DB could do to speed up this process, right? So which is actually a lot of the lines of what I'm doing right now for my research. Okay, so from what I understand, correct me if I'm wrong here, the solution that you see there is not like going and like fundamentally changing the format itself, of course, and implement there to actually bring Parquet
Starting point is 00:37:08 closer to what, let's say, these heavily indexed systems like Elasticsearch do, right? Is this correct? Yeah. Yeah. That's interesting. So, okay. I have a question that it's actually, there's like a lot of conversation lately out there about Parquet, let's say, showing its age, right? Like Parquet was created like 2008, 2009, I don't remember exactly, but it was like at least 10 years ago, right? Very different use cases out there. I mean, obviously, the format is inspired primarily for traditional all-up data warehousing use cases, that they have very different latency
Starting point is 00:37:57 requirements, right? And even the hardware was so different back then and all that stuff. So people, especially driven by this conversation, I think like Zillow's driven by the needs in ML use cases, they start talking about the need of, let's say, upgrading, updating, or substituting maybe Parquet. And there are companies out there that they've built stuff. You have Meta that has these alpha formats, I think it's called. Then you have all the work like from Google, if I remember correctly, with like the Procella system that's kind of like from YouTube, where there is like a lot of stuff there of like how we can complement
Starting point is 00:38:38 or like change the way that we store data compared to Parquet. And of course, like there are also other systems out there, like right now, like LanceDB, for example, right? They have their own format there, trying to accommodate more, let's say, the use cases around ML. So how does the stuff you think about fit in this world, where the industry, in a way way is pushing for new formats.
Starting point is 00:39:06 They actually want to go pretty low, let's say, in the stack and go to the storage layer and rethink, let's say, the format that we are using there. Let me ask you a very simple question. CSVs are horrible in terms of efficiency. But yet yesterday, I downloaded data from a GitHub repo from Alibaba and the data format was in CSV. Do the Alibaba people know better data formats? Of course. But I mean, do they expect their users to? It's a question, right? No, I hear you. And actually, to be honest, I find your answer extremely interesting for someone who's coming from a PhD, to be honest,
Starting point is 00:39:54 because your approach is much more pragmatic and product-oriented than research-oriented. And I 100%, like CSV is there and it's not going away anytime soon. We will still struggle with it. So I get what you are saying. And I think it makes total sense what you're saying. It is important for building, let's say, a system that you can take out there in the market, right? And actually deliver like value. But how you can defend that like in research, how you publish a paper on that?
Starting point is 00:40:31 Because going back to the conversation we had at the beginning, right? About like the novelty. Well, I mean, hopefully like the novelty around my research would be around like these external indices, which are definitely not in Parquet format.
Starting point is 00:40:48 How they can speed up these queries on Parquet files. Yeah. And Parquet is actually not that bad. If you know how to use it properly, Parquet gives you huge flexibility in how you can define your data. For example, if you want random access into a column, people think it's impossible, but
Starting point is 00:41:08 you can just keep the column as unencoded and then you can just random access the bytes. You can change the row group sizes to efficiently retrieve smaller chunks of your data. You can change the number of columns you're going to put in the table alongside with the rule group size to tune the file size. You can change the encodings of the columns. You can even use custom encoding algorithms to encode your columns
Starting point is 00:41:39 before you put them to Parquet. There's so many things that you can do to Parquet that can improve its performance. Now, this is a question of whether all these things are supported by higher-level frameworks like Iceberg or Delta
Starting point is 00:41:53 that have a very opinionated way in how you should be managing these Parquet files for these OLAP workloads. If anything, I think Iceberg and Delta should be more flexible in allowing people to tune their own ways to using their Parquet files.
Starting point is 00:42:08 Then we should be changing the Parquet file format itself, is what I think. Okay, I love that. I really like that. Okay, so let's talk a little bit more about the indexing that you are talking here, right?
Starting point is 00:42:19 Like the part of your research. So when you are talking about external indices, right? What it looks like, and what are the use cases? Because you can index for many different reasons, and with many different algorithms and all that stuff there, but what are you trying to do here with these indices? Yeah, simple. So Postgres has all these kind of indices that allow a regular Postgres database to do wonderful things. Like you can have a JSON B index type
Starting point is 00:42:45 and build like a GIN, generic converted index on it. And then you can suddenly do like JSON pass match and keyword search and all kinds of amazing things, right? So I look at Parquet the same way, you know, Parquet is your data and instead of Postgres pages or whatever,
Starting point is 00:43:00 you've got like Parquet pages or whatever. So you should be able to build industries on these Parquet files who do not have to look like Parquet pages or whatever. So you should be able to build industries on these Parquet files who do not have to look like Parquet files that you can efficiently access at query time that tell you what Parquet pages to read to get your data. Yeah.
Starting point is 00:43:15 So a higher level of that would be like what road groups to fetch to get your data. For example, if you've got a column in a Parquet file that's composed of log messages, you should be able to build like a text index on that column that is a lot smaller than that column itself in terms of storage footprint.
Starting point is 00:43:34 It still supports efficient access from S3, but will quickly tell you what road groups in all your Parquet files and your data they contain the keyword that you're searching for. Similarly, you should be able to build an index on a JSON type in your Parquet. Like, if you keep your JSON as a string in Parquet, you should be
Starting point is 00:43:53 able to build an index on that. For example, it allows you to do, like, snowflake variant type querying. Yeah, without snowflake. Right. So, you should be able to do all these things, but you just can't today. And that's what I'm working on.
Starting point is 00:44:09 And how do you... Okay, so let's say you have the storage there that remains intact with Parquet, and then you bring this new layer on top of it where you can create the synthesis and all the associated metadata to access the like very efficiently. How do you connect that then with the higher level of like in the stack, right?
Starting point is 00:44:32 Like with the query engine itself, like as you said, people already using stuff, they shouldn't move away from that stuff. And I agree with you. Exactly. Likewise also like with the query engines, right? So you might have like something like Trino there? So you might have something like Trino there, or you might have something like Spark, or you might have something, I don't know, like whatever.
Starting point is 00:44:50 How do you expose these indices in a way that can't be, let's say, exploited by these query engines at the end without having to rewrite the query engine? You don't have to rewrite a query engine. So these indices, of course, you to rewrite a query engine. So like this index is, you know, of course you could rewrite a query engine and integrate these indices, but you don't have to. So this is interesting because a lot of query engines that they already look at some metadata
Starting point is 00:45:15 before they even do your query. For example, a SINA or BigQuery will tell you how much they think this query is going to cost you before you execute it. So in the same way, you know, the query engine could query this index and rewrite your query in a good way and dramatically reduce the cost.
Starting point is 00:45:33 For example, if you have your text-based inverted index, and then you're using Spark or Trino as your query engine, you could query the index first to translate your text query, which would require the query engine to read the entire text column into a very selective predicate on maybe the timestamp. So instead of running a query that's like select star where
Starting point is 00:45:56 log I like star ARN 12345 star, you run a query that's like select star from timestamp between x, y. And that's a very small range. And that's provided by the index. Okay.
Starting point is 00:46:13 So you keep your query engine. You just rewrite your query. Sure, but someone has to rewrite the query, right? Yes. This would be like a part of a client library for the index. Okay. Okay. Okay. And then someone had to integrate that as part of the optimizer, for example, of Trino to do that? No. So it does not change the Trino optimizer because it just translates a very expensive predicate into a very cheap predicate. And the Trino optimizer already knows how to do predicate pushdown and all that stuff, where it's very selective, timestamp-based filtering.
Starting point is 00:46:49 Okay. Sounds good. We are close to the end here. I think we could be talking for a couple of more hours and I think we should do it in the future. I think we have a lot to talk about here. But you also have an open source project out there like Quokka, right? Tell us a little bit more about Quokka, what is it, and why people would be interested in it. So I started Quokka as more of trying to bring thought tolerance to a streaming-based query engine like Trino. It's actually a query engine that I wrote
Starting point is 00:47:26 the logical and physical plan optimizers in Python. This is faster than Spark on the EMR by two or three times, but I hit some bottlenecks in trying to actually support SQL. It is very hard today to support SQL in your query engine. And there are a lot of efforts out there to do that.
Starting point is 00:47:52 And in fact, I think just the other day, somebody is trying to propose a generic plugin SQL logical plan and optimize it based on data fusion, which would be good if it works. But yeah, and I am trying to integrate some of my newer research into KUKA, and hopefully KUKA can be the first query engine that's natively
Starting point is 00:48:12 integrated to these indices and building other Parquet files. Okay, that's awesome. Eric, I'll give the microphone back to you, because we can keep talking forever here. But I think I should give you a little bit more time to ask any questions that you might have. Yeah, well, I think we're right at the end.
Starting point is 00:48:29 Actually, one thing I've been thinking about throughout this whole conversation, Tony, is what are you interested in doing after you're done with your PhD? I mean, you're obviously on the applied side. So have you thought much about that? Yeah, I think I might be interested in doing a startup if I can figure out what to do. It's kind of hard to do a startup these days. It's always been hard for them.
Starting point is 00:48:58 But yeah. Or I might work someplace. There are a lot of very cool companies today working on this like new newer observability tools and things like that so yeah very cool well if you end up starting a company or when you go work at a company we'd love to have you back on tell us about finishing the PhD and going into industry.
Starting point is 00:49:25 Yeah, I'm mostly focused on trying to graduate right now. Heads down. All right. Well, Tony, it's been such a good show.
Starting point is 00:49:33 We learned so much and good luck on selling to your audience here in the final stretch. Yes. Yes. Well,
Starting point is 00:49:41 I've already made the sale so I'll know whether they decided to buy or not. Right, yes. It's I've already made a sale, so I'll know whether they decided to buy or not. Right, yes, it's the buy. I'm sure that you understand how difficult that is. Yeah, it's, yeah, closing the deal. Awesome. Well, best of luck and keep us posted.
Starting point is 00:50:00 All right. Thank you very much for your time. We hope you enjoyed this episode of the Data Stack Show. Be sure very much for your time. datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.