The Data Stack Show - 179: Time Series Data Management and Data Modeling with Tony Wang of Stanford University
Episode Date: February 28, 2024Highlights from this week’s conversation include:Tony's background and research focus (3:35)Challenges in academia and industry (6:15)Ph.D. student's routine (10:47)Academic paper review process (15...:26)Aha moments in research (20:05)Academic lab structure (23:09)The decision to move from hardware to data research (24:43)Research focus on time series data management (27:40)Data modeling in time series and OLAP systems (32:01)Issues and potential solutions for parquet format (37:32)Role of external indices in parquet files (42:19)Tony's open source project (47:11)Final thoughts and takeaways (49:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We have Tony Wang on the Data Stack Show today. Tony, we have a lot to talk
about, both academia, the data industry, different kinds of selling, and some cool data stuff in
general. But we'll start where we always do. Give us an overview of your background.
Yeah, I'm Tony. I'm a PhD student at Stanford University. One of the few people today still
studying data systems and databases. Before that, I was at MIT for four years,
studying mostly electrical engineering and hardware engineering and before that you know
i was i came to the u.s from china when i was 16 i went to a private boarding school up in new
hampshire i love to ski and i bike a lot so california has been pretty good for both it's
one of the rare areas where you can where you can drive like four or five hours, hopefully only, and ski and also
have, you know, decent weather year round when you're not skiing.
Yeah.
And it's a great place for database research too.
So you kind of get all, you check all of your boxes.
Parts of California are great for database research to be sure.
Yes.
Yeah.
And that's like one of the reasons that I'm really excited to have Tony here today.
Eric, like I think it's the first time that we have someone who
is actually pursuing a PhD.
We have many people who have successfully done their PhDs and starting companies,
but someone who's like in the process of the PhD, I think it's like the first time.
So I'm super excited to talk about what it feels like to do that, do research.
And learn of course, like what's, let's say the state of the art right now,
what like the academia is interested in.
And most importantly, what's the connection between that and the industry
out there, because there is a continuum, right?
Like the things that happen like in university special and stuff like databases, they have
like an impact out there to the systems that we build tomorrow.
So super excited to chat about that.
What about you, Tony?
Like what you would like to talk about today?
Sure.
I can talk about that. I can also talk about the stuff I'm working on and
my thoughts
on different data
processing systems and what I hope
will become more popular in the future
from a technical
perspective. Although I know that
a lot of beta products are also driven
by other aspects
as well that I
have less of an insight on.
Yep, sounds good.
So what do you think, Eric?
Should we go and do it?
Let's do it. I can't wait.
All right, well, this is a really exciting episode for us
because you are in the midst of doing a PhD,
getting your PhD. And I don't think we've had anyone
on the show who's like actively in a PhD program. And so we want to learn all about it.
You're doing some really interesting research on data systems.
So let's just start there. Can you tell us what is your main area of study and focus?
Because you're close to the end, right? Are you finalizing your thesis?
I would hope so. Yeah. So I mostly work on data processing systems and mostly around
clouds data processing, around quickly processing data in data lakes
that people use today, like Apache Iceberg or Delta Lake, or even just the
buckets of Parquet files, which is unfortunately still way too common.
Yeah.
Yeah.
We had someone on a show recently, the discussion was like, when is, you know,
when are we going to move from Parquet?
So how did you decide that you
wanted to do a phd i mean you know data lakes are obviously like you know very popular in industry
and very widely used but you know it's not every day that you meet someone who's you know actually
studying those at a phd level so how did you end up going down that track?
Well, how did I start to do a PhD was,
and why I ended up at Stanford in particular is,
back when I was trying to decide what I was going to do after,
because at college, I mostly worked on hardware,
like Verilog and FTGA and GPU, low-level CUDA programming.
So after that, I applied to some jobs like NVIDIA
and then decided, well, maybe I'll pursue hardware research
at Stanford University where some of the best hardware research
is being done.
Yeah, I turned down my job offer at NVIDIA.
In retrospect, maybe that was a poor financial decision.
I was going to say,
if they were offering options,
you know,
hindsight's 20-20.
You know,
I got my offer
in like March 2020
when the stock
was like the lowest point
from the COVID.
So I took as much money
as I personally had
and I bought NVIDIA stock
and then I decided
to just like
go do my PhD program.
Yeah, but halfway into my first year PhD program,
I realized that,
why am I in academia doing this stuff?
And I look back at people I talked to at NVIDIA,
I realized that NVIDIA is just going to dominate
the hardware industry
and then the cool stuff is hardware is to dominate the hardware industry. And the cool stuff is hardware is being done in industry.
I think it's very hard for people in academia to be able to move the needle in the state of the art in the hardware industry.
Oh, interesting.
Can you describe why that is a little bit?
I mean, so just to make sure you're seeing, you knew from your work studying hardware that NVIDIA was going to be the big
player in the market and that was
what's happening?
I won't name anybody
but I talked to people at NVIDIA
I talked to people at AMD, I talked to people at Intel
people at NVIDIA
they were truly excited to
be there
it was a level of excitement that I could not discern
from people,
you know,
and,
and video is like a software based company,
right?
It's really hard for a hardware company to actually like get a software
driven culture because like,
you know,
other companies,
maybe the company is started by hardware engineers.
The founders are hardware engineers.
So,
so like those people get more say
and software gets maybe neglected
and looked upon as something that's easy and not real.
But at NVIDIA, I think it's really incredible
how the leadership team is able to foster a culture
where five out of six engineers are software engineers
and build this amazing software stack.
And that's what really dunes the academic hardware projects
because there's many aspects.
One is that your project just cannot possibly
try to keep out at the very competitive nodes today,
like 5 or 7 or whatever.
So you might just miss a lot of...
You might have an amazing design that works at
2028 or whatever, but you might
miss problems that would occur if you were trying to do it at a more competitive
technology node. And the other aspect
is the software, right?
Like you could build some hardware,
but, you know, to get people to use it,
there's like a long way between Python code and your hardware.
Now, of course, there's definitely value
in academic research, right?
In designing new, like, hardware designs
and stuff like that.
That might inspire people in industry
to pursue, like, you know, certain architectural decisions, but I was
more on the side of trying to do something that can actually be used.
And that is not where that is, unfortunately not where I found
that should be focusing my time.
Yeah.
Yeah.
Interesting.
And so it sounds like, and maybe I'm drawing the wrong conclusion here, but if academia is not really driving innovation on the hardware side, but it does sound like on, for example, data processing systems, that there is a lot of innovation being driven in academia, and that's why you pursued that path? It's funny because data processing systems,
there's also very entrenched
players like Snowflake and Oracle,
but every now
and then, the software, the barrier
to the entry of building a system that's
actually useful, I think, is a bit lower.
You see cool academic
projects like DuckDB, for example,
taking huge traction
in the data industry.
And that's really started as an academic project, right?
Like that's not,
they didn't not have the resources of say the AWS Redshift
or Snowflake or something.
It's just a couple of guys, you know, in the Netherlands.
And so it's like this project called Polars,
which is like a Rust-based rewrite of Pandas. And that's really started by one guy, you know so it's like this project called Polar's, which is like a Rust-based rewrite of Pandas.
And that's really started by one guy, you know,
it's like how one person could, you know,
it was, you know,
maybe tens of thousands of hours of coding
could really try to displace, you know,
one of the most popular data analytics libraries
out there,
Pandas, right?
So that's a testament to like, you know,
it's your dedicated enough,
you can really move the needle there in what people use in the real world.
Yeah, yeah.
Okay, can you, this is so interesting.
Custis, I have a million questions,
so I promise I won't steal the mic for the entire episode.
But Tony, what does a typical week look like for you?
And I know that's a difficult question
because it probably changes,
but I know that some of our audience
certainly have done a lot of post-secondary study,
but a lot of them probably don't.
And so we don't really know what it's like
to be a PhD student studying data systems.
So can you just give us a glimpse into your role?
So I'm very much on the applied side.
And I know that people on the theoretical side, their days actually are a bit different.
And I wouldn't actually say it's that different from working at a regular job because you
show up and you try to program.
And well, it's maybe a bit easier because you try to program. Well, it's maybe
a bit easier because you have fewer meetings
and
code reviews because
yeah, like you're
there are no code reviews.
You can write whatever
you want, but whatever you want has to work.
No one's asking you if you wrote your unit test.
Yeah, I mean, most academic
projects only work on the five
benchmarks they write in their paper and nothing
else.
You just have to kind of get your code
into that state, but if you actually want your code
to be, I guess, used elsewhere, it has
to go beyond that, but that's typically
not inside of the purview of academia.
Yeah, that makes sense.
Now, you mentioned you're on the applied side, but there are also people, your peers who
are working more on the theoretical side, which sounds like a spectrum, but you're both
sort of, say, studying data systems.
Can you describe that spectrum to us?
What does it look like to be more on the theoretical side?
So on the theoretical side, you might be like, so there are a lot of people working. Well,
when I say theoretical side, people who are actually there might think they're more on
the applied side. Again, it's all a matter of perspective. There are people at Berkeley,
for example, working on distributed programming paradigms,
like the Hydroflow project that tries to revolutionize how you do cloud programming and stuff like that.
So this kind of paradigm shifting theoretical work, I would say.
And then, yeah, they would probably spend more time working around programming languages and designing language specifications, doing some proofs, maybe to make sure that things work.
The last time I did a proof was in my real analysis class at MIT.
So it's seven years ago.
Yeah.
Yeah. Yeah.
Now, one thing that really struck me when we were chatting before we hit record was that I kind of had made this assumption that being in industry, for example, trying to run a data infrastructure company, you know, is like wildly different than what you do.
And your response was, well, you know, not really.
I still have to do a lot of sales.
As a PhD student, can you explain that concept to us?
It was just so interesting to hear you talk about that.
Okay. Like a lot of time of PhD students
I spent writing
and reviewing
and rebutting papers
or whatever
and trying to change
your writing
or your pitch.
And then most people
will tell you that
writing an academic journal publication
is like telling a story,
which is not too different
from what a lot of salespeople
have told me about doing sales.
You have to say
how your system has novelty,
how your
system is better than all the other
systems out there and worthy of publication.
There are people that review
your papers and tell you if
they think that your system has
struck those goals
yeah yeah yeah that's really interesting what could you describe some of the you know like
who's the audience on the other end right and you know in industry you're trying to get someone to
buy your technology and it's similar but how's the audience different and what are the different
audiences in the phd world for what you do yeah so this varies a lot by by discipline
in the machine learning discipline your audience and the systems disciplines when you submit your
papers typically goes goes through a review
process where they're assigned to three or four or five other professors or even graduate
students who are hopefully versed in the research in the area that the paper is purported to
be on.
Whether that is true or not, there you know there's double blind review where you know people you don't know who your reviewers are and reviewers
don't know who you are there's single blind review where you don't know who your reviewers are and
reviewers know exactly who you are now like i'm not saying one is better than the other or whatever
and there's open review where everybody knows you you know, who the counterparty is.
So like academic review process is this huge thing.
That's,
you know,
people has been experimenting over the years,
but yeah,
like recently there's problems because in all the disciplines,
there's been a huge influx in papers.
Like if you look at the number of submitted papers to this conference,
over the past, like 10, 20 years, it's just been growing exponentially.
So there's a huge strain on the review process.
And as a result, a lot of my peers, for example, in machine learning, might just get shitty reviewers for their papers.
For example, a master's student could be reviewing a professor's work.
And we just post reviews that are completely incoherent,
even to top conferences like NeurIPS or whatever.
Now that is obviously a downside, but I mean, nobody has figured out how to do better than
this kind of review system.
So I guess there are a lot of plus sides to the review system as well.
Yep.
That makes sense. Now, how does the audience that you need to sell to, how much does that influence where you choose to focus your study?
Or do you still feel like you have a lot of freedom to just pursue what you're interested in?
Absolutely.
In academia academia there's
this culture of novelty like you're actually trying you're absolutely like trying to do
something novel that people have not done before so i think this is good you know because maybe
the point of academia is to do that but it also limits the kind of work that people can do right
for example like if you look at work like like polars for example well so like polars for example
would not be a good academic project it would be very hard to publish that anywhere because
it's not really like using novel ideas or you know because then you kind of it's rewriting
pandas and rust i mean it's it's obviously awesome and very powerful but like when you say that it's okay that's what it is
that's exactly what the reviewer is going to say to reject this paper so you know it kind of limits
the scope of the projects that people in academia can do and that could be very limiting at times
but otherwise it yeah so it it does encourage like very risky ideas that might not have a good practical
implementation at this moment but you know somebody at redshift or snowflake might read
this paper and be like hey i know exactly how to use this and actually you know lead to a
significant impact like like other places, right?
Yeah, yeah.
Just out of curiosity,
I know you've written several papers.
How long does it take you to write a paper that you feel great about submitting for review?
A long time.
Writing a paper is a time-consuming process.
Yeah.
So like a month or like nine months?
Like at least like a week of intensive writing.
Well, I mean, hopefully the work you put into writing the benchmarks
or writing your actual system should take more than that.
That might take a few months to a year.
Yeah.
Writing the paper, I mean mean i think i spent too
little time writing my papers but but yeah people are gonna people like you can never spend more
time writing your papers and if you think about it that's actually a weird perspective right because
you're spending all this time in a presentation or whatever when you should actually just be like, maybe writing more unit tests
to make sure your system works
beyond the five cases written in the...
But, you know, it's all a trade-off.
And I mean, the proportion of, you know,
that you have to done sales
versus engineering and reorganizations,
you know, people can make similar arguments, right?
Yeah, that makes total sense. I'm interested to know, and I want to, you know, and I similar arguments right yeah yeah it makes total sense
i'm interested to know and i want to you know and i know costas has a bunch of questions on
the technical side but you know as you pursued your research throughout the phd program have
there been any surprising discoveries that you've made that you weren't expecting an aha moment that
you had during your like i got there that's a much better way to put it thank you costas an aha moment that you had during your like, I got the
that's a much better way to put it. Thank you, Costas. An aha
moment. That's why I'm here. I just had
the guy you know, like the hammer.
Yeah, I mean, that's also another thing was doing applied
research versus theoretical research strike. When you're doing proofs or whatever,
back when I was doing those,
like there's definitely aha moments where you're like, oh yeah,
I could just prove it using this way or that way.
But I think when doing applied research,
a lot of the things are a sequence of smaller things.
So like you can kind of see the project in your head.
You can kind of see where it's going
and you'll have a pretty good understanding
of what is going to come out at the end.
And you're incrementally improving
your intermediate steps
so that you can get to the end.
For example, I'll give you an example,
which is when I'm working
on like full text indexing for Apache Iceberg or for logs or whatever, for
Parquet files for logs, there were I definitely had some ideas at the
beginning of how you could how you could use this specific kind of index to
your speed up like substring queries on like terabytes of parquet files or whatever and have
the index be only like one percent or 0.1 percent of all the file size but but then the index has
problems like maybe low access time or whatever and then and then gradually you start to like
look more and more at your index structure and then it just becomes kind of obvious what you should do
once you have spent long enough looking at the algorithm. So I would say it's really like a
half a moment because it's like once you've looked long enough at the problem, everything just becomes
kind of straightforward. And then it becomes kind of hard to present that in a paper or something because
then it's just straightforward. That solution makes total sense.
Yeah, so I think like an art of setting papers that I have definitely have not yet mastered this
how to present such you know maybe straightforward things in retrospect in a
exciting fashion
that, you know,
caters to people from
who have not
thought a lot about this problem.
Yeah, yeah, that's super interesting. All right, Kostas,
I have to hand the mic over. I'm just going to keep
asking questions.
That's okay. I think the conversation is
super, super interesting, to be honest. So, that's okay. I think the conversation is super, super interesting,
to be honest.
So tell me, okay,
let's talk a little bit about
what you're doing now
and let's start with Stanford.
You're part of a lab there, I guess.
There is a structure in academia, right?
So tell us a little bit more about that.
What's the goal of,
let's say, your team there
or the lab? And how do you think in that? Like, what's the goal of, let's say, your team there or like the lab? And how do you
think in that? The academic labs are run very differently. It really depends on the professor.
Like some professors are very hands-on and some professors are very hands-off.
I have a very hands-off professor, fortunately, so he gives me great freedom in what i can do in my projects and i know other professors
who might even write code for students project or um tell the student exactly what to do
in his projects so so my professor's not like that so yeah like in my lab different people
might be working on different things that they find interesting with different industry partners, potentially.
Some people in my labs working with Avidya.
I work with maybe some other industry partners that are trying to use my stuff.
Yeah, so really it's driven by you, like what projects you're interested in.
Yeah. Is there always a connection with the industry
out there? No.
You don't have to.
You don't have to work on something
that's going to be useful to industry.
So what is
the value that the industry brings to
you as someone who's doing
academic work?
Well, it kind of
helps you ground it in real problems.
You might have this awesome idea of how you can do something, but people don't really
care about this.
And it's hard to justify why I have to go through all the motions of writing this paper
if the system I'm going to build is not useful.
Yeah, that makes sense.
So, okay, tell us a little bit more about what you are doing now.
I mean, you said you were at MIT, you were more into hardware.
Somehow you ended up doing research around data processing or data storage.
You'll tell us more about that.
But first of all, how did you make this decision to move from hardware
to getting into, let's say, more of what we can do with hardware when we have it already?
Yeah, so as I mentioned in a pretty long ramble earlier, I think it's hard to do middle-moving working hardware.
And it's easier to build real systems that can provide real value to actual people if you're doing some kind of software research.
But why data?
That's funny.
It's a very broad term, actually.
So there are many things when we say data.
But why what you're doing now compared to doing, I don't know, like training models or doing AI or doing,
I don't know, whatever else.
So I used to work on like speeding up natural language processing models
or whatever.
First year of my PhD program, I actually took a leave
and tried to do a startup.
And I talked to, you know, hundreds of like potential customers
or directors of machine learning,
data science. I'm like, hey, I can make
a TensorFlow model
5-10% faster
by speeding up matrix multiplier.
So I had some code that
beat Intel MKL, which is
Intel's way of multiplying matrices
by
5-10%
on the matrix sizes
that I was extremely proud of
as an academic achievement.
But then I talked to these guys
and they're like,
yeah, you know,
the slowest part of us
doing inferences
is like getting this metadata
from DynamoDB.
So that takes 200 milliseconds
or something like that.
Whereas this matrix multiplying the tens of motion,
it takes 200 microseconds to do this right.
So this room was really eye-opening experience
and also kind of like really forced me
to try to talk to potential customers
to understand use cases
before I started working on research projects today.
Is that, well, of of course it gives you, but that was, you know, kind of,
kind of starting point wise, wanted to go into data and I was like, yeah,
this, I like a lot of inefficiencies and how people process data and stuff like
that, and this, I gradually found data field more and more interesting.
And so that's where I most spend most of my time today.
Yeah. So, okay. We found an aha I spend most of my time today. Yeah.
So, okay.
We found an aha moment here, I think, right, Eric?
I guess.
Yes.
Okay.
So tell us more about what you're doing today, right?
What's your focus in your research?
Well, I'm mostly focused on,
I work on time-saves data management.
So I work on trying to build the, like, you know, so take a step back.
Maybe it's like, so I think for business data and customer data and generic data management,
people are moving to Parquet files, Delta Lake, Iceberg, whatever.
They'll work really well.
And then you are able to build all kinds of differentiated applications
and dashboards on top of the same data layer.
Now, in time-series data management, that is still not the case.
People are using Prometheus with its own scaling solution,
like Low-Key or Elasticsearch with their own UltraWarm or Cotier,
or whatever you call it, to build to S3.
And then there may be some other completely different system to manage their traces.
And so,
so I just think that,
you know,
we could probably make like Apache iceberg and Delta Lake work for these
time series monitoring use cases and store metrics and logs at high scale
and still be able to do the things that Elasticsearch can do.
Now, there's a lot of promising recent projects
like QuickWidth, for example,
that claim huge performance benefits
over Elasticsearch, right?
But the problem is
it's still its own storage format, right?
I really want to be able to store
logs and metrics in Parquet files
in Apache Iceberg
and still be able to empower the use cases that
people might want to do in Prometheus and Elasticsearch. Okay. And why there is this
divergence between these two data-related, let's say, problems, right? Why we ended up having
systems that are in a way so different between the two, right?
Like the Parquet world in one side with the OLAP systems there,
and then we have all the time series systems like Prometheus
and the rest that you talked about.
So why we ended up in this reality?
I think it's because that first of all, the Parquet world
cannot efficiently support
the use cases for
Prometheus and Elasticsearch.
For example, if you store all your logs
in Parquet
files,
and you try to do a substring
query or some kind of text
search, it's no other
way than to scan all your logs and start doing regex in Spark or some kind of text search, it's no other way than to scan all your logs
and start doing regex in Spark or whatever.
And that is horribly inefficient
compared to like Elasticsearch
where there is an inverted index
that can answer this question in milliseconds.
Now for Prometheus,
I think it's more of an issue of data modeling.
So in Prometheus, you have the notion of time series
and time series is tagged.
And if you try to store those in Parquet files,
it's not clear how you can do that
to have this Prometheus data model
translate over to the tabular data model in the Parquet world.
What would be the columns?
And how would the columns be clustered by?
And how to get the kind of performance
that premises can have?
And of course, this is not just talking about data models
and querying.
There's also this big component of real-time capability.
Premises and Elasticsearch were first invented and probably
still used largely as real-time systems where real-time
ingested data can be used
in real-time.
Now, then
how does this translate over to the
Parquet world? Maybe you have some
click-house instance that is running
and then that spills to Iceberg
or Delta for longer-term
storage or something like that.
But I do believe that there's definitely got to be a bridge there.
So you should be able to do things like run SQL across all your business data
as well as your telemetry data and be able to join those sources
and try to debug your issues or things like that.
Okay, let's talk a little bit more about the data modeling part,
what you mentioned. So,
how is data modeled in the time series
world?
And why is this
different compared to what you do
in an all-out
system with tabular data?
So,
you think about
the data modeling, right?
So premises really has a system where there are time series chunks, and they are tagged by string tags.
And you should think about your data as these chunks with tags, and you can quickly access a particular chunk. Now, if you think about translating in a tabular world,
you could think about maybe I have a couple of columns, right?
One column would be timestamp.
Another column would be the tag.
And another column would be the value.
So you could do it like this.
But then what should you sort your tables by?
Maybe if you sort your tables by the timestamp,
then you would have good ingest performance
because the new data would just be appends.
But then, you know, quickly retrieving
all the data corresponding to a particular tag
would be very slow.
So then maybe you should sort your data
based on tags, right?
But then ingest becomes a problem
because like your new data
gets like super small files
over a bunch
of different partitions.
So what are you going to do?
I mean,
ultimately,
I think that,
you know,
the premises data model
could be implemented
on top of Sparky files.
And in fact,
like I've done that
as part of my
research project
and internships
and whatever. And I do believe that it's possible my research project and internships and whatever.
And I do believe that it's possible to do this.
There's a particularly good tabular data model and some maybe external indices institutions.
Yeah.
Okay.
That's interesting.
So how does Prometheus solve that?
Is it because of like how, like, is it the storage problem at the end?
Like how you store the data, like on your storage?
Or is it the lack of indexing, let's say in the OLAP world?
Because okay.
Like traditionally in OLAP like systems, okay.
You can think of, let's say partitioning or like, like bucketing and stuff like
that as like a lightweight version of like an index index maybe because you consider like what the workload looks like
and try like to change the layout to make it like faster.
But we don't have the index traditional
like in other systems, right?
But so what is like from your like point of view,
what's like causing the problem here?
So I think Prometheus is like an integrated system.
And it integrates the real-time part
of how it gets the real-time data,
separates them out into these chunks and whatever,
and then it can write to these chunks
and it's back in storage.
But in the Parquet world,
you've got to start piecing together different systems and things like that.
That's the first thing.
And second, since you talked about indexing, it's very interesting because typically these tags are high cardinality.
So databases like M3DB might... There's a great talk by Rob from M3DB, I think, that talks about the kind of inverted
integers and the FSTs, like finite state transducers, that they build on these tags to quickly allow
retrieval of a particular tag.
So this is the problem.
If you have a billion Kubernetes pronouns that are your tags, how do you actually quickly look up
where a particular tag and its
corresponding chunks are stored?
In
integrated systems like M3DB,
they could have
inverted index that
similar to your Lofty search, they can tell you
exactly where the time chunks
or particular key is stored.
Whereas in a Parquet file, if you have a column with a billion potential values, it is very
hard or even if they're clustered together, it's like pretty hard upfront with no external
indices to figure out which Parquet file your tags are located in without scanning all the
headers and footers of all the Parquet files in your data lake or whatever.
And well, if your data happens to be sorted by time and this tag is actually
separated across all of the tables, then you can forget about doing this efficiently.
So I guess, but not the end of the world. You can definitely build industries on top of these Parquet files
that can perform similar functionalities to what an inverted index,
an M3DB could do to speed up this process, right?
So which is actually a lot of the lines of what I'm doing right now
for my research.
Okay, so from what I understand, correct me if I'm wrong here,
the solution that you see there is not like going and like fundamentally changing the format itself, of course, and implement there to actually bring Parquet
closer to what, let's say, these heavily indexed systems like Elasticsearch do, right? Is this
correct? Yeah. Yeah.
That's interesting. So, okay. I have a question that it's actually, there's like a lot of conversation lately
out there about Parquet, let's say, showing its age, right?
Like Parquet was created like 2008, 2009, I don't remember exactly, but it was like
at least 10 years ago, right?
Very different use cases out there. I mean, obviously, the format is inspired primarily
for traditional all-up data warehousing use cases, that they have very different latency
requirements, right? And even the hardware was so different back then and all that stuff.
So people, especially driven by this conversation,
I think like Zillow's driven by the needs in ML use cases, they start talking about the need of,
let's say, upgrading, updating, or substituting maybe Parquet. And there are companies out there
that they've built stuff. You have Meta that has these alpha formats, I think it's called.
Then you have all the work like from Google, if I remember correctly,
with like the Procella system that's kind of like from YouTube,
where there is like a lot of stuff there of like how we can complement
or like change the way that we store data compared to Parquet.
And of course, like there are also other systems out there,
like right now, like LanceDB, for example, right?
They have their own format there,
trying to accommodate more, let's say,
the use cases around ML.
So how does the stuff you think about fit in this world,
where the industry, in a way way is pushing for new formats.
They actually want to go pretty low, let's say, in the stack and go to the storage layer and
rethink, let's say, the format that we are using there. Let me ask you a very simple question.
CSVs are horrible in terms of efficiency. But yet yesterday, I downloaded data
from a GitHub repo from Alibaba and the data format was in CSV. Do the Alibaba people know
better data formats? Of course. But I mean, do they expect their users to? It's a question, right?
No, I hear you.
And actually, to be honest,
I find your answer extremely interesting for someone who's coming from a PhD, to be honest,
because your approach is much more pragmatic
and product-oriented than research-oriented.
And I 100%, like CSV is there
and it's not going away anytime soon. We will still
struggle with it. So I get what you are saying. And I think it makes total sense what you're
saying. It is important for building, let's say, a system that you can take out there in the market, right? And actually deliver like value.
But how you can defend that like in research,
how you publish a paper on that?
Because going back to the conversation
we had at the beginning, right?
About like the novelty.
Well, I mean, hopefully like the novelty
around my research would be around
like these external indices,
which are definitely
not in Parquet format.
How they can speed up
these queries on Parquet files.
Yeah. And Parquet is actually
not that bad. If you know how to use it
properly, Parquet gives you huge
flexibility in how you can define your data.
For example, if you want random
access into a column, people think it's impossible, but
you can just keep the column as unencoded and then you can just random access the bytes.
You can change the row group sizes to efficiently retrieve smaller chunks of your data.
You can change the number of columns you're going to put in the table
alongside with the rule group size
to tune the file size.
You can change the encodings of the columns.
You can even use custom encoding algorithms
to encode your columns
before you put them to Parquet.
There's so many things
that you can do to Parquet
that can improve
its performance. Now, this is a question of whether
all these things are supported
by higher-level frameworks like
Iceberg or Delta
that have
a very opinionated way in how you
should be managing these Parquet files
for these OLAP workloads.
If anything, I think Iceberg and Delta
should be more flexible in allowing people
to tune their own ways
to using their Parquet files.
Then we should be changing
the Parquet file format itself,
is what I think.
Okay, I love that.
I really like that.
Okay, so let's talk a little bit more
about the indexing
that you are talking here, right?
Like the part of your research.
So when you are talking about
external indices, right?
What it looks like,
and what are the use cases? Because you can index for many different reasons,
and with many different algorithms and all that stuff there, but what are you trying to do here
with these indices? Yeah, simple. So Postgres has all these kind of indices that allow a regular
Postgres database to do wonderful things. Like you can have a JSON B index type
and build like a GIN,
generic converted index on it.
And then you can suddenly do like JSON pass match
and keyword search
and all kinds of amazing things, right?
So I look at Parquet the same way,
you know, Parquet is your data
and instead of Postgres pages or whatever,
you've got like Parquet pages or whatever.
So you should be able to build industries
on these Parquet files who do not have to look like Parquet pages or whatever. So you should be able to build industries on these Parquet files
who do not have to look like Parquet files
that you can efficiently access at
query time that tell you what Parquet
pages to read to get your data.
Yeah.
So a higher level
of that would be like what road groups to fetch
to get your data. For example, if you've got
a column in a Parquet
file that's composed of log messages,
you should be able to build like a text index
on that column that is a lot smaller than that column itself
in terms of storage footprint.
It still supports efficient access from S3,
but will quickly tell you what road groups
in all your Parquet files and your data
they contain the keyword that you're searching for.
Similarly, you should be able
to build an index on a JSON type
in your Parquet. Like, if you keep your
JSON as a string in Parquet, you should be
able to build an index on that.
For example, it allows you to do, like, snowflake
variant type querying.
Yeah, without snowflake.
Right. So,
you should be able to do all these things,
but you just can't today.
And that's what I'm working on.
And how do you...
Okay, so let's say you have the storage there
that remains intact with Parquet,
and then you bring this new layer on top of it
where you can create the synthesis
and all the associated metadata
to access the like very efficiently.
How do you connect that then with the higher level of like in the stack, right?
Like with the query engine itself, like as you said, people already using stuff, they
shouldn't move away from that stuff.
And I agree with you.
Exactly.
Likewise also like with the query engines, right?
So you might have like something like Trino there? So you might have something like Trino there,
or you might have something like Spark,
or you might have something, I don't know, like whatever.
How do you expose these indices in a way that can't be,
let's say, exploited by these query engines at the end
without having to rewrite the query engine?
You don't have to rewrite a query engine.
So these indices, of course, you to rewrite a query engine. So like this index is, you know,
of course you could rewrite a query engine and integrate these indices, but you don't have to.
So this is interesting because a lot of query engines
that they already look at some metadata
before they even do your query.
For example, a SINA or BigQuery will tell you
how much they think this query is going to cost you
before you execute it.
So in the same way, you know,
the query engine could query this index
and rewrite your query in a good way
and dramatically reduce the cost.
For example, if you have your text-based inverted index,
and then you're using Spark or Trino as your query engine,
you could query the index first to translate your text query,
which would require the query engine to read the entire
text column into a very
selective predicate on maybe the timestamp.
So instead of running
a query that's like select star where
log I like star
ARN
12345 star, you run
a query that's like select star from
timestamp between x, y.
And that's a very small range.
And that's provided by the index.
Okay.
So you keep your query engine.
You just rewrite your query.
Sure, but someone has to rewrite the query, right?
Yes.
This would be like a part of a client library for the index. Okay. Okay. Okay. And then someone had to integrate that as part of the optimizer,
for example, of Trino to do that? No. So it does not change the Trino
optimizer because it just translates a very expensive predicate into a very cheap predicate.
And the Trino optimizer already knows how to do predicate pushdown and all that stuff, where it's very selective, timestamp-based filtering.
Okay. Sounds good. We are close to the end here. I think we could be talking for a couple of more
hours and I think we should do it in the future. I think we have a lot to talk about here.
But you also have an open source project out there like Quokka, right?
Tell us a little bit more about Quokka, what is it, and why people would be interested
in it.
So I started Quokka as more of trying to bring thought tolerance to a streaming-based
query engine like Trino. It's actually a query engine
that I wrote
the logical and physical
plan optimizers in Python.
This is faster than Spark on the
EMR by two or three times, but
I hit some bottlenecks in trying
to actually support SQL.
It is very hard today to support SQL in your query engine.
And there are a lot of efforts out there to do that.
And in fact, I think just the other day,
somebody is trying to propose a generic plugin SQL logical plan
and optimize it based on data fusion,
which would be good if it works.
But yeah, and I am trying to
integrate some of my newer research
into KUKA, and hopefully KUKA can
be the first query engine that's natively
integrated to these indices and building
other Parquet files.
Okay, that's awesome. Eric,
I'll give the microphone back to you,
because we can keep talking forever here.
But I think I should
give you a little bit more time to ask any questions that you might have.
Yeah, well, I think we're right at the end.
Actually, one thing I've been thinking about throughout this whole conversation, Tony, is what are you interested in doing after you're done with your PhD?
I mean, you're obviously on the applied side.
So have you thought much about that? Yeah, I think I
might be interested in
doing a startup if I can figure out what to do.
It's kind of hard
to do a startup these days.
It's always been hard for them.
But yeah.
Or I might work
someplace. There are a lot of
very cool companies today
working on this
like new newer observability tools and things like that so yeah very cool well if you end up
starting a company or when you go work at a company we'd love to have you back on
tell us about finishing the PhD and going into industry.
Yeah,
I'm mostly focused
on trying to graduate
right now.
Heads down.
All right.
Well, Tony,
it's been such a good show.
We learned so much
and good luck
on selling
to your audience
here in the final stretch.
Yes.
Yes.
Well,
I've already made the sale
so I'll know
whether they decided to buy or not. Right, yes. It's I've already made a sale, so I'll know whether they decided to buy or not.
Right, yes, it's the buy.
I'm sure that you understand how difficult that is.
Yeah, it's, yeah, closing the deal.
Awesome.
Well, best of luck and keep us posted.
All right.
Thank you very much for your time.
We hope you enjoyed this episode of the Data Stack Show. Be sure very much for your time. datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers.
Learn how to build a CDP on your data warehouse
at rutterstack.com.