Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Aditya Parameswaran
Episode Date: October 21, 2024In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work.Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya'...s story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Links:EPIC Data LabAnswering Queries using Humans, Algorithms and Databases (CIDR'11)Potter’s Wheel: An Interactive Data Cleaning System (VLDB'01)Online Aggregation (SIGMOD'97)Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases (INFOVIS'00)Coping with Rejection PonderYou can find Aditya on:TwitterLinkedInGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast, Jack here.
Today we have another installment of our high-impact series and the podcast is brought to you by Pometry.
Pometry are the developers behind Raftree, the open-source temporal graph analytics engine for Python and Rust.
Raftree supports time-traveling, multi-layer, and comes out of the box with advanced analytics like community evolution,
dynamic scoring, and temporal motifs mining.
It is blazingly fast, scales to hundreds of millions of edges on your laptop,
and connects directly to all your data science tooling,
including Pandas, PyG, and Langchain.
Go check out what the Pomtree guys are doing at www.raftree.com
where you can dive into their tutorial for the new 0.80 release.
We're going to be talking to in this episode Aditya Parameswaran. Aditya is an associate
professor at the University of California, Berkeley, where he co-directs the Epic Data Lab.
This is a lab that is targeted at low slash no-code data tooling with a special emphasis
on social justice applications.
Aditya also was, until its recent acquisition by Snowflake, the president of Ponder, which was a company that he founded with his students based on popular data science tools, and that was developed
at Berkeley. Aditya also, as he's describing your research interests now, Aditya kind of develops
human-centered tools for scalable data science, and he does this by synthesizing techniques from data systems and human-computer interaction.
Aditya, welcome to the show.
Thanks for having me, Jack. It's a pleasure to be here.
The pleasure's all ours.
So for our new listeners, any new listeners we've got today,
the High Impact series was inspired by a blog post on most influential database papers by Ryan Marcus. And Aditya holds the,
I guess, the number one spot for 2011 with his CIDR paper. Let me get the title correct here.
It is answering queries using humans, algorithms, and databases. And so, but before we do get into
that, it's customary on the podcast to start off with your
story, your journey and get into kind of know more about you rather than the kind of the highlight
reel that I read out at the start of the show. So yeah, I guess in your own words, Nidhiya,
kind of what has been your journey so far? Tell us more about your career.
Sure. So yeah, so just to give you a little bit of background. So I started as a PhD student.
I did my undergrad in India at one of the Indian Institutes of Technology.
Then in 2007, I started as a PhD student at Stanford working.
Actually, at the time, I wasn't even sure what I wanted to work on.
So I just wanted to do computer science and I wanted to do research. So I started my PhD at Stanford.
So that led to this paper in 2011.
Sorry, 2011 at CIDA on crowdsourcing, following which in 2013, I started as a postdoc at MIT
for a year, following which I ended up at the University of Illinois, Urbana-Champaign,
two hours south of Chicago. And then in 2019, I moved back to the Bay Area and started as faculty
at UC Berkeley. So that's my journey. Somewhere along the way, once I moved back to the Bay Area,
we started a company that got acquired by Snowflake last year. So yeah, originally,
when I started as a PhD student, I wasn't sure I wanted
to do databases. I sampled a bunch of different topics. I worked a little bit on recommender
systems, information extraction. And then in about 2010, I started getting fascinated by
crowdsourcing, which is what led to this CIDR 2011 paper. Awesome just kind of going back there right to the
very beginning you said that you wanted to you knew you wanted to go into research and you knew
you wanted to kind of be it being computer science what was the sort of the original sort of motivation
that even when you're sort of like a 10 year old 15 year old did you always knew like kind of you
wanted to pursue academia initially as sort of a career path or what was the attraction as well? I was okay so going back to when I was a kid I was fascinated by programming there was this time
when I was I think maybe 10-12 years old we had a computer science class and I was so and we were
doing programming in GW basic right so that was so that dates me a little bit. And so we had a
project to complete, which was sort of like a, let's call it a semester long project back then.
And I, so we used to do it after hours. So it was like, you spend time in the computer lab from
three to five working on this project. And I was so, I got my project done really quickly. So in a matter of, let's say,
a month, then the remaining two months, I would go for on the computer lab, helping out others
finish their projects. And the Nalia, the bugs, the Nalia, the logic, the more fun I had kind of
like disentangling their code bases and figuring out what they wanted to actually do. So that led
to my fascination with computer science broadly. And all that I knew I
wanted to do was do computer science, do programming, and just to raise the level of
difficulty a bit, right? So that led me to pursue an undergrad degree in computer science,
one of the IITs. And then I was like, okay, I could go become a software engineer in Google.
And I had an offer right out of undergrad to do that. But I was like, okay, I could go become a software engineer in Google. And I had an offer right out of undergrad to do that.
But I was like, okay, let me raise the level of difficulty a little bit.
I want to solve hard problems.
And then I applied for a PhD program and got into Stanford.
So I was very happy about that.
I originally applied in program analysis.
So that was my area of interest when I applied for a PhD program.
But what I realized then,
what I realized after having spent some time doing program analysis,
that most of the problems there are undecidable.
And I was like, I don't want to work on stuff that is so hard.
I want it to be hard, but not so hard that it's undecidable.
So somewhere between easy and undecidable. A little bit, yeah. A little bit, a little bit it's undecidable. So somewhere between easy and undecidable. Turn it typically down a little bit.
Yeah, yeah, yeah.
A little bit, a little bit lower than undecidable.
Yes.
So that's what I chance upon databases
and data management broadly defined.
The thing that drew me to databases
and data management is it's very,
well, yes, there's a classical field of database systems.
And yes, there are a lot of people
who work on classical problems in database systems.
But in some sense, it gives you a little bit of a blank canvas to do stuff that spans theory and systems in all areas that touch data management.
So anything that doesn't even look like a database system.
This paper from CIDR 2011 certainly doesn't look like a database system. This paper from CIDR 2011 certainly doesn't look like a database
system, but it still falls under the banner of data management and the database systems field,
which I think is really nice for the research community. It's pretty welcoming. It's pretty
open-minded. So I found my way to that community. Nice. Yeah. It's got something for everyone,
right? Which is a good community to be part of for sure cool so let's talk about answering queries using humans algorithms
and databases then so cast your mind back to 20 2011 then give us the elevator pitch and yeah tell
us about tell us about the background of this work and the motivation and that kind of went into it. Yeah. So 2011 or 2010, when I was in search for a thesis
topic, right? So I was like, I had worked on recommendation systems for a little bit and
information extraction, but I wasn't really, my heart wasn't in either of those topics. And I
really wanted to do something different. And then along came crowdsourcing. So this was,
thanks to the advent of platforms like Mechanical Tuck, where you could pay people like a dollar to do tasks for you, be it labeling data, doing rankings, search results, content moderation, you name it, right? So you have the power of humans at your fingertips to do various sorts
of data processing tasks. And while the HCI community, the human computer interaction
community started looking at crowdsourcing and figuring out, okay, how do you best harness the
crowds from an interface standpoint? I felt that there was a natural opportunity here to think of
things from a data processing standpoint, especially given that crowds are
error prone, they take a long time to answer questions. So how do you figure out how do you
best do broader tasks using humans? So if you wanted to filter a bunch of items, if you want
to sort a bunch of items, if you wanted to cluster a bunch of items, think your standard relational
primitives, and then figure out how to best orchestrate that
with the crowd, given certain objectives, be it cost, accuracy, latency. So once we started
framing that vision, around that same time, Alkis Polizotis, who is my co-author on the paper,
was visiting Stanford. And turns out my advisor at the time, Hector Garcia Molina, was already involved in a
different side of people, the different student. So he's like, if you want to collaborate with
Alkis, go for it. And so I ended up collaborating with Alkis and we fleshed out this paper together.
It was my first real paper where I felt like I was driving it end to end and Alkis treated me like a peer to his credit, which is very
generous of him because he was already, I think, a tenured faculty member at UC Santa Cruz.
And so we had a blast writing this paper. I think that even though I'm framing this as,
how do you best harness a crowd for data processing? The vision of the paper goes
beyond that. We were claiming that
this would be not just crowds, but also machine learning primitives. So you could imagine that
you could, that's where the algorithms in the title comes from. So you can imagine,
instead of asking the crowd, you could ask a machine learning model to label some data or
process some data in some way. So that was the story behind how the paper came to be.
So what impact has the paper had over the years? Have you followed its progress and seen kind of
the citations and been like, oh, it's inspired this, it's inspired that?
Yeah. So I think the crowdsourcing work in the database community had, I think, quite a bit of an impact in terms of other folks trying to figure out.
Once we set out that vision of saying, hey, how do you take the cloud and harness it for data processing tasks?
And to be fair and giving credit where credit's due, there were a couple of other groups that were doing who came up with similar vision that are on the same time. So there was CrowdDB from Berkeley, as well as Quirk, Q-U-R-K from MIT. And those two systems happened at roughly the same time as ours.
So, I mean, citations get allocated the way they are. It's a little random, but they were,
I would say they are concurrent work to ours. And since then, there's been a bunch of different
work trying to, how do you fine tune
this, right? How do you crank up the wheel and make it so that you get as much mileage for your
buck, right? And so I would say hundreds, if not thousands of papers have been written by the
database community on how to best harness a crowd. Now, some of this has gone in more esoteric
directions, perhaps more niche directions than I would like.
But I think the basic ideas around thinking about how do you account for mistakes made by crowd workers and how do you assemble kind of error-proof pipelines with crowds, I think, is an effect of all of the work that we did. And I think perhaps even more relevant is
two different kind of ways this line of work influenced practice. I think one way this line
of work influenced practice is this large kind of content moderation and tagging and labeling shops that a lot of the large internet companies do
as a matter of just gathering training data.
And this they used to do as early as 2015, 2012 onwards.
So they were Facebook, the Facebooks and Googles,
then Facebooks.
The Googles of the world
were spending millions on crowdsourcing,
and our approaches would have helped them save, let's say, 20% to 50% of those costs, right?
Which is not an insubstantial amount at the time. Now, though, with large language models being
even more hungry for human-labeled data. And with the reinforcement learning approaches that crucially rely on human feedback,
it's even more important to revisit
some of these crowdsourcing approaches,
though I don't know if some of the folks
who've actually been using crowds for training
large-scale models have actually referred back
to this literature.
So yeah, I think it could be even more influential
than it has been.
So yeah, I think that's one more influential than it has been so yeah that's i think that's one
one note on how it's been influential the other way it's been influential is in my
own work and i think this is we're just starting some exciting new work which is draws on the
lessons of crowdsourcing but in a completely different domain so i'm happy to talk about that
or we can switch to a different question if you'd like yeah no that's actually a really nice segue into kind of what i was going to ask
next is kind of what are you currently working on at the moment so yeah let's talk about that
yeah yeah so this is so i have a bunch of different projects but i wanted to mention one that has was
directly influenced by this which is the the fact that if you were to, right now, LLMs are wonderful, right?
It's really easy to get your large-signature model
to write you a poem or whatever, right?
Like it's an act as a pair programmer.
So all of that is great.
But in many cases,
you want to have the large-signature model
repeatedly process some type of data, right?
So for example, you want to get summaries
for each of the products
that you are showing to a user, right? Or you want to process a bunch of reviews for sentiment
and generate a sentiment for each of those reviews, right? Sentiment score for each of those reviews.
Now, LLMs are very much like the crowd, right? In that they are error-prone, they easily hallucinate, make mistakes,
pay attention to certain portions of your prompt over others,
just like humans pay attention more to certain portions
of the instructions over others.
And so can we revisit all of this literature, right?
Like the crowdsourcing literature
and figure out how to best orchestrate LLMs
for data processing tasks,
which is going to become even more and more important as we start to rely on LLM for various
types of production application. So this was 10 years, maybe 12 years to the day, I guess,
inside of 2023, we set forth a vision for how to use the principles from crowdsourcing declarative
crowdsourcing as we called it for prompt engineering for llms right how do you best
harness llms just the way we set out we could harness crowds that's fascinating stuff yeah i
mean because i mean i find myself using llms more and more in my day-to-day life but yeah like they do
have errors right they hallucinate they create some kind of funky stuff the citations is a is a
is a is an interesting one I think where you say oh I don't know go find me some citations for
for this thing in distributed transactions whatever and they'll look really really realistic
but I know they're bogus because you can't find them on google scholar right so you can't kind
of harnessing that proud and and I don't know how far along with kind of realizing
the vision you are but how do you treat that how do you map the llms to like being it's like one
lm a human essentially or is it how do you treat them what's the sort of the model that yeah so
there are similarities and differences so i i spoke about the similarities and that they are
they hallucinate they they make mistakes,
they pay attention to certain portions of the prompt
and so on.
LLMs are also different
from humans in that
first is pragmatic stuff
like cost models and so on.
So LLM, you have to pay
based on inputs and outputs
and there's a cost per token.
Humans don't operate that way.
I mean, you fix a price
for a task and you go for it.
Humans and LLMs both make mistakes
in unpredictable ways,
but you could,
the one thing about humans
is that you could ask other humans
and you could say,
hey, this person said this,
what do you think, right?
And so you could,
by asking enough humans,
you could get to the,
get to convergence on
what the right answer is.
In the LLM world, that's less easy to do.
You could certainly turn up the temperature knob for an LLM and get more kind of randomness
in your LLM answers, or you could ask other LLMs, but there is an inherent lack of independence
there in some sense that the LLMs are but there is an inherent lack of independence there in some sense that
the LLMs are all kind of drawing on similar sources. So you're not truly getting independent
observations. So that I think is a little bit of a difference. Yeah, so I think those would be the
most prominent ones, the cost model and how the fact that in the crowdsourcing world, you could
sort of ask multiple humans and figure out what the right answer is. But in the L crowdsourcing world, you could sort of ask multiple humans
and figure out what the right answer is.
But in the LLM world, it's a little different.
That said, some of the same principles
in terms of techniques still apply.
You could, so one of the big techniques
in the crowdsourcing literature was,
rather than doing a big task all en masse,
how do you decompose it into smaller tasks
that you know can be done accurately, right?
So imagine if you wanted to sort a thousand items. With humans, they're going to make mistakes if you
ask them to sort a thousand items. But if you ask them, hey, rank these five items for me,
they're more likely to do that correctly. Very similar with LLMs. In fact, if you ask them to
sort a thousand items, what they'll end up doing, unlike humans who might struggle with the task and do something, but you'll still find the items in some order, even if it's not perfect.
LLMs will just simply make up new items.
It's like half of the stuff is stuff that came from the original set
and then the rest is just like random stuff.
That's brilliant.
Yeah, just make this new thing as well and put it in your order.
Yeah, you had 1,000 items, now you've got 1,500.
No, that's fascinating.
That's really cool.
Now, I think the future for LMS is going to be really interesting
to see the impacts they can have kind of going forward.
There's a lot of investment going into them,
so many different levels as well, right? So, so yeah looking forward to seeing kind of what that brings and what your research in the area brings as well but you also
mentioned that you've kind of got a lot of plates spinning at the same time and working on other
sort of research projects as well so yeah what else have you kind of got on your plate at the
moment yeah so at the moment the other big projects that we have ongoing are trying to think about how do you use.
Okay, so it's all going to be LLMs because that's what's on my mind these days.
But how do you use large language models to make sense of large PDF document corpora?
Let's call it that. So collections of PDF documents.
And one specific target application that we've been focusing on a lot is police misconduct. So we're working with folks who are trying to make sense of massive volumes of police misconduct data
that are arriving thanks to new state legislation in the state of California. So SB 1421 and SB 2, I believe,
which led to police departments being required to release all of their police misconduct
information. So imagine now you have a million, millions of PDF documents, right? And our
journalist friends described this as the police departments
basically give you a stack of paper, but they don't give you a stack of paper in the way you
want it. They'll just throw it up in the air and then wait for it to all fall down and then they'll
collect that and then give it to you. So it's not organized in any which way. And it goes beyond
not being organized, right? It's like PDF documents are inherently a hard format to work with. And now you're further in this adversarial relationship with these police departments
in that they're not incentivized to give it to you in a way that you can make sense of it.
Now, come LLMs. Can you be like, hey, LLM, here's my PDF. Answer me these questions,
right? Who are the police officers mentioned? What is the location? What is the date, right?
Basic stuff so that you can be like, okay, journalist comes in, they're trying to investigate
a particular cop. They want to find other incidents that that cop has been named in.
Can you help them with that, right? Or can you, if the journalist wants to ask questions like, okay,
how often is there canine, use of canines in police misconduct cases? How often is there use of batons?
How often is there mental health issues involved?
How often is there drug use involved?
Whatever, right?
Like they want to investigate these sorts of things.
Combing through millions of PDFs is impossible.
LLMs help you get part of the way there.
But the challenge is that LLMs, as we just discussed, they hallucinate, they make mistakes.
Simply giving them a PDF and say, hey, answer me these questions, it doesn't work.
So you need to figure out how do you do it in a manner that's more reliable.
So one standard technique is you can chunk up the document and say, hey, look at these
portions of the document and then kind of work your way to the end. But that doesn't quite work
because you can't really interpret a document chunk
in and of itself.
You need the context before it to make sense of a given chunk.
So there's all kinds of challenges in making sense of large collections of documents.
And this is not just one application here just to illustrate the challenges, but we
are trying to be as general as we can in saying, hey do you support document what i like to call document
analytics right on large document collections awesome yeah i i had it when kind of you
explained there a few things kind of bouncing around my head kind of is there like an explainability
provenance aspect to the work as well in a sort of a case of like okay i'm saying that there was
this canine there's this many canine incidents or we're using this many incidents,
then it will actually link you back to the kind of,
oh, this is how I figured this out, right?
So there's that kind of angle to it as well,
rather than it making up 10 reports and being like,
yeah, there was canines using everything.
Totally, totally.
Yeah, provenance is extremely important here
because these journalists are not going to trust
any random numbers we throw at them
saying there's X number of incidents. We are not going to trust it. So what you want is pointers back
to the source, right? So you want to be able to say, hey, there's these incidents and here are
the excerpts that led me to believe that these were the... So they do want to be able to double
check everything. And so it's not as much about simply saying yes the here
is an aggregated answer but you also need to provide the provenance associated with it yeah
maybe maybe this next question is getting too much into actual implement implementation details but
how do you actually how does an llm actually read pdf data because i actually aren't really too sure
on the format of pdf, like what it actually looks like
when it's like, I guess it's a bespoke format
that Adobe created at some point, maybe.
How does it actually,
how do you actually feed that into an LLM?
Yeah, so I think what we end up doing,
I think there are mechanisms to do it
where you treat PDFs as images.
So that's one approach.
But what we end up doing usually
is just like applying some OCR techniques and then providing the OCR output to an LLM.
Now that causes some loss in information, but that's a trade-off, right?
Text is a much easier format for LLMs to understand.
It's cheaper.
And so that's why we opt for OCRing the PDF first before we provide it to an LLM.
But there are challenges with even things like OCR, right?
So for example, you're doing OCR
and you have some name that's redacted, right?
The OCR might just skip over that name entirely.
And then you kind of just,
you have the words before the name and after the name,
just like attached to each other with no gap in between.
And then, so the LLM is like confused,
what's going on here?
So sometimes you're going to have to go
and patch the output of an OCR being like,
okay, here's actually underscore, underscore, underscore,
underscore of this many of this long,
that's this long,
which is basically indicating
that something was redacted, right?
So OCR output isn't perfect, but it gives you part of the way there. Nice, awesome. that's this long uh which is basically indicating that something was redacted right so ocr output
isn't perfect but give to you part of the way that nice awesome i'm look forward to seeing all
this this research come out of your group over the next few years i'm sure it'll be fascinating
to see how it all develops for sure cool yeah so i kind of in the next sort of the the the podcast
we're going to kind of shift gears and kind of go look do another bit of a retrospective i guess
we've spoke about your genuine career so far but we're kind of i kind of want
to kind of get your take on what you're kind of most proud of of the the of your career you know
you work on a lot of things like social injustice and stuff so it must be really rewarding to kind
of see your work have some real world impact so yeah i guess the question is, what are you most proud of?
Yeah, I think I'll answer the question in two ways.
I think that what I'm most proud of over.
So in August 20, yeah, August this year will mark my 10 years of being faculty.
Right. So that's a milestone. Thank you.
So what I'm most proud of through my faculty career
is just the impact that I've had
through my students.
I'm very proud of the students
that I've graduated.
Some of them have ended up
becoming entrepreneurs.
Some of them have ended up
becoming professors.
Others have ended up in industry.
And that's really what I'm most proud of. And especially seeing the trajectory of some of these students where they come in and they're either unsure of what they want to do up things that you wouldn't expect them to pick up. And eventually, they supersede you in knowledge and ability, right?
And that's really what you want as an advisor.
So that's been really, really gratifying.
And the students that have graduated are truly amazing.
And I'm very grateful to have had the chance to work with them.
In terms of bits of work that I'm most proud of, I have this variety
of different projects that I feel quite proud of. And I can give you a couple of examples.
One project, a couple of projects that I'm pretty proud of in the last five years were to do with
data science tooling. And this led to the startup Ponder, which was eventually acquired by Snowflake. And so the students who led both those projects, Devin Peterson and Doris Lee, they each targeted
the open source community and try to address challenges that data scientists and data analysts
face when they are trying to use open source packages in tools like computational notebooks, right? So Doris Lee, for example, identified that when you're trying to do data analysis, you
often need to write dozens of lines of code to get to visualizations.
And this is just a barrier to data analysis, right?
So every time you want to get an insight, you need to write dozens of lines of code,
right?
That's annoying. And pick
your favorite visualization library, Matplotlib, Plotly, what have you, right? In none of these
libraries is it easy to get a single visualization, right? And so what we asked was a question, hey,
can you get Visualizations Institute during analysis without prompting, right? Can you just
get visualization recommendations during data analysis in
data analysis library like Pandas? And so our two lux synthesized lessons that we had developed over
like the last five to eight years on visualization recommendations and built it all into the usable
tool that sits within computational notebooks provides visualization recommendations
out of the box. So you print your data frame, you get visualization recommendations. So it's
really, really neat, a lot of impact. I think maybe half a million downloads and users,
which is pretty neat. So I'm pretty proud of that. Separate project, which also led to Ponder,
was this Modin project,
which was basically around scalable data frames. So also centered around Pandas. And Pandas,
if you didn't know, should have mentioned, is the most popular data science library, right?
Like it is the reason why Python is so successful, right? But at the same time, Pandas is a beast,
right? There's like 500 functions within pandas a lot of redundancy
it's a mess there's no optimization often gives out of memory errors and so on and so devin's
thesis centered on this tool moden which is it's a drop-in replacement for pandas so preserves the
pandas api but applies database and superior computing techniques to scale that up, right? So out of the box, you get speed ups because we now have thought about this more carefully,
right?
And we can leverage multiple cores.
We can leverage query optimization, all being applied to this new beast, which is DataFrames.
And so that, again, was a successful open source project.
It's continuing to be successful.
I think it has like a million downloads a month or something.
So that's pretty amazing.
So both of these projects I'm very proud of because they had a lot of impact and a lot of usage.
And the students, Doris and Devin, for now at Snowflake, spent a bunch of time listening to the open source community and drawing from what are their best, what are their challenges and how do you address them rather than simply
saying, Hey, we build another tool and chuck it over the wall at the people.
Right.
So that's something that we departed from in these two projects.
Yeah, definitely.
Kind of know your customer, right?
Know your customer,
like kind of know the end user and know what their pain points are and then
fix their pain.
But if you can solve a problem with someone,
make someone's life easier, like you're going to have so much,
like, I don't know, that's a tool that's going to get used, right, as well.
And I guess that speaks like half a million downloads,
a million downloads, right?
There's some impact there.
This is a podcast on high impacts and there is high impact.
So, yeah, that's fantastic.
Yeah, and I guess they've now been integrated
into Snowflake's platform, I guess, and kind of all of the I don't know how many users Snowflake have as well.
Right. So they'll be all they'll be all benefiting from it as well. So that's that's fantastic.
Exactly. Yep. Cool. So, yeah, the next kind of one sort of section we're going to we're going to jump on to sort of like motivation.
And I dropped this question on you yesterday in an email with kind of last minute, and asked about what your favorite papers are. So I don't know if anything's kind of come up. But yeah, what is your favorite paper, papers?
Yeah, so I thought about it since yesterday, and I still don't have a good answer for you.
I feel like through every phase of my career, there are papers that I've been inspired by. And I think the kinds of
papers that I, okay, so here are some timeless ones that I feel like I go to time and time again,
and I feel excited to reread whenever I reread them. So one example of this, and this is
reflecting in, I guess, the style of research that I do as well one of these papers
is this paper that kind of laid out the foundation for Tableau right so Tableau is
visualization platform or visual analytics platform and the paper is is paper its title
is Polaris so that was the name of the system before Tableau. And so in this paper,
they laid out a way to think about visualization, visual analytic systems. And so they laid out an algebra for how do you specify what a user is seeing on a visualization canvas and how do you
translate that into data processing queries in the backend. And so I think what I liked about that paper and what I took away from it is this seamless
blending of user interface aspects with data processing aspects in a way that combines
the best of data management and HCI principles.
And so, of course, the paper and subsequent tool have been enormously impactful, right?
So Tableau had a big IPO and then was acquired by Salesforce a few years ago.
So it's had a lot of impact.
Another paper that I'd like to mention that I think also embodies this principle of bridging HCI and data management is, I mentioned two from my colleague, Joe Hellestein at Berkeley.
One is called online aggregation. So again, thinking about end users and the fact that
you don't want to wait until all of the results are generated when you quickly get a sense of
what's going on, right? So for an aggregate query. So online aggregation basically talks about
how do you provide approximate results
as the results are being generated.
A separate paper also from Joe,
which is quite influential, is Potter's Wheel.
So this is trying to do data cleaning
and how do you figure out what is a good metric
for cleaning up your data, right?
So Joe and Vijay Shankar used this metric called minimum description length
to figure out what's the best type to induce on any given column
so you can clean your data.
In addition, they figured out an algebra for data cleaning,
which they then operationalized in this tool called Wrangler a decade later.
And then that led to this company called Trifacta, which Joe founded and then was acquired a couple of years ago.
So the reason I bring up these three papers is because it's all bridging human-computer interaction and database principles.
It's the style of work that I enjoy doing and I take
inspiration from. So thinking about end user needs while also trying to address some of the
scalability challenges. We'll put links to that in the show notes. And I've definitely got one
more to add to my reading list there with the Polaris papers, but I need to put onto my
monotonically increasing reading list. Sounds good.
Yeah, cool. So the next thing i kind of want to want to touch on
is is setbacks so we've talked about all the high impact great work that you've done over the
over the years but i progress is obviously non-linear right there's there's this doubt
there's like the rejection the setbacks so yeah i want to kind of kind of ask how you deal with
those setbacks and what your approach is if you have a systematic way of approaching sort of rejection and setbacks yeah so okay so on this particular topic turns out
i gave a keynote at some vldb phd symposium workshop which was which happened remotely
because we were i think it was you know it was a remote conference. So I think peak COVID at the time.
And so I have a Loom video that addresses this very question. So it's about how to deal with
the rejection. So I'm happy to share the link. Yeah, share the link. We'll put that in the show
notes so the listener can go find it. Sounds good. So in that brief 10 minute video, I talk about how as a PhD student, I had a bunch of setbacks.
So I think my first eight conference submissions were rejected.
OK, so imagine imagine a PhD student who deals with that much rejection.
Like one would think, why did I even persist? Right.
Like, I mean, that's a ridiculous number of amount of rejection to get as an early stage
PhD student.
And so it took me until my fourth year when I actually started getting papers accepted.
And eventually all of those, so these eight rejections are not actually all the same paper.
So imagine three different papers, all getting rejected multiple times, two to three times,
right?
And then all of these papers, eventually in in year four they all got published somewhere right
sometimes by just simply changing the audience sometimes by changing the introduction changing
the like random things that you wouldn't think have a huge impact but but they do right and so
so yeah so i think that gave me a lot of thick skin.
I think like now when I deal with rejection as an early stage faculty member or post-tenure, I realize that I don't take it that personally or I shouldn't take it personally.
It's just par for the course. Just learn the lessons from it and move on. Right. And at the same time, I think it's important to know, A, that there's a lot of randomness involved in everything that you do as an academic.
There's a matter of taste, right?
So some people like the stuff that you're doing.
It doesn't mean it's bad work.
It's just that some people don't like it.
And so you figure out how to best appease people who may have a different research taste than you, but you reframe the work in a way that you think may improve the work, but you stand by it, right? If the work is
good, it'll eventually get in somewhere. So that's how I think about rejection. The other place where
I've had a lot of rejection is in, I've applied for a variety of faculty jobs at various points,
I've gotten rejected.
I've also gotten rejected from grants several times. Again, there's a question of persistence,
a question of learning from what went wrong, try to get as much feedback as possible and try to
improve on it. Of course, even to this day, rejection stings, right? Like if I get a paper
rejected, I instinctively get angry right
like i'm like damn it reviewer number two exactly damn it reviewer number two and so it's important
to not react immediately then give it a day or two let it sink in others have said this better
than i have but like let it let it digest the rejection then come back and try to figure out what to do.
And sometimes this means just tossing it in again.
Just because it was a random reviewer who did not understand the value of your work,
even though you felt that they should have, doesn't mean you stop submitting that work.
You resubmit.
Sometimes it is like if enough reviewers are
giving you the same signal that they didn't get the point of the work or there's some fundamental
flaw, that means that you should go ahead and try to fix that before you resubmit, right? So simply
tucking it over the wall and expecting a different outcome this time doesn't actually help. So you
should try to see what the signs are and try to read between the lines of the reviews right so often
the reviewers say hey here's a hundred different things that are wrong with the paper but this
they are often fixated on one thing which is more of a deal breaker than the others right if they
say oh i like here's a typo here's here's this additional experiment you could have added here's
this like here is i i didn't you could add a discussion about this.
These are fixables and most likely these are not what led to the rejection.
So it's a matter of, it's an art to read a review and try to identify what led to the
eventual outcome.
Like what was the deal breaker for this reviewer?
And sometimes it's implicit, right?
Sometimes it's like i was
just not excited enough or i thought the writing was sloppy or i thought that this project has no
real impact right like and and so distilling that lesson will take time and effort and then often
you get better at it with experience yeah no i think that's really good solid advice and how to
kind of approach it it's uh i think yeah i think someone solid advice and how to kind of approach it
it's uh i think yeah i think someone says to me almost like if you treat it as like an opportunity
to make it better and rather than it being sort of necessarily a reflection on you as an individual
i have that i have that detachment between yourself and obviously it's a lot easier said
than done right we still kind of get angry initially when it feels like you've been
sort of like it's an attack on you person right rather than sort of but yeah detach yourself from the work and see it as an opportunity
to make things better and yeah it'll um it'll get there in the end there's so many stories of i
guess influential papers over the years that have been rejected three four or five times that have
been gone on to have like crazy impact so yeah yeah keep plugging away so yeah and the on that on that particular note right like i've had
every single one of my award-winning papers has been rejected at least once so yeah so that's
the statistic for you yeah there we go you actually i actually wanted to get rejected once
then it'll win the best then it'll win the best paper right right? That's what we need now. Yeah, exactly. Some causation there.
Exactly.
Cool.
Awesome.
Yeah, so the next question is a question that I borrowed from the regular format that I really, really like for my podcasts.
And that's about the creative process and how you approach idea generation and then
selecting projects.
And yeah, what is your process for that?
Do you have a process for it?
Are you systematic or is it more sort of serendipitous and kind of in the shower shower thought sort of thing yeah yeah so i think
i would say there isn't a lot of staring at a blank piece of paper and coming up with ideas
like i don't necessarily do that all that well so often when I get most of my inspiration is by
reading other papers so if I when I read papers often get inspired on follow-up work and and so
reading the literature often helps me make sense of the world often following what's going on in
industry also helped me make sense of the world and try to identify opportunities to go and improve things.
Right. So and so that's another source of inspiration for me.
Sometimes inspiration is also retrospective.
So often we have a couple of projects on various themes and then you realize that, hey, these are all different facets of the same equation, right? And maybe we could kind of combine all of these together
and you end up with a grander, bigger vision than what you started out.
So, and the most important way to get ideas or figure out ideas
is to brainstorm with your students, right?
Like that's where I created most of my ideas is just discussions with smart
people. And so I think like the, in terms of ways in which I've personally looked at problem
selection, often looking at existing approaches, figuring out, can you keep the user interface for the most part and then improve the backend
somehow, right?
That's kind of a philosophy that has worked really well for me.
So giving a couple of examples, or rather, how do you take a process or a tool that's
popular that is broken and then report pieces of it?
So it doesn't necessarily mean the backend, but report pieces of it that would, and then replace it with different pieces
that will help make it better. Concrete examples. So a big focus of my work over the last decade has
been on spreadsheets, right? Spreadsheets are amazing, the most popular data management tool
out there used by billions of people, right?
Except that spreadsheets do not change, right?
And if you try to use spreadsheets
on a million rows or more,
it's going to complain.
Spreadsheets often clash and hang
with as few as 50,000 rows, 100,000 rows.
So we asked the question,
okay, can we preserve the spreadsheet look and feel,
the spreadsheet interface, and then repart the backend and allow it to scale to arbitrarily large datasets, right?
And so, that led to a bunch of interesting questions.
We built a system called DataSpread.
We figured out how do you represent data?
How do you index it?
How do you do queries efficiently and so on? Similar philosophy is applied to
Modin, the project that I mentioned earlier, where you keep the Pandas API and then you rip
out the backend and you try to see if you can make it better, right? Lux also follows a similar
design principle in that we were like, okay, we don't want to destroy the user experience.
We want to change the user experience in that they get visualization recommendations out of the box, but they do it in a drop-in kind of fashion, right? So you enhance
existing tools, don't replace it, right? And I believe that this kind of philosophy of
looking at popular tools that are obviously fulfilling a need, but are broken in some way right be it in usability in intelligence
in scalability and then replace components of it by by by with better components preserving
everything else about it i think is a recipe that has worked well yeah i can definitely see how that
list instantly leads to better adoption as well right because you kind of they're already using
the tool you've made it better better for them and also you kind of that initially sort of gets around
the problem of well for me anyway if i rather than go and learning a whole new tool i just can keep
using what i'm using and i get this new fun cool extra awesome stuff for free which is like a much
better experience right so yeah i can definitely see how that's asking for success yeah it's like
i mean
you're dealing with user inertia right otherwise and users don't want to give up their existing
tools you say hey here's this new tool like all you have to do is to learn this new language or
use this new interface and be like you know what i'm i'm good thanks right and so if you're like
no no you can continue continue to use all of the scripts that you built you can continue to use all of the scripts that you built. You can continue to use all of the Excel files that you have.
You just need to use this instead.
It's a drop-in replacement.
That's a mantra.
Then you get instant adoption, right?
Like it's a game changer.
But it's a lot harder, right?
Like because now you're constrained by what this tool is doing, right?
You have to do the Pandas API.
You can't do something else. You have toas API. You can't do something else. You
have to do Excel. You can't do something else. You would love to do something else where you
could be like, you know what? Maybe data frames don't need to be ordered. Maybe Excel doesn't
need to be ordered. Maybe if I drop a good fraction of the API, a good fraction of the
commands, life will be easier. It'll be more elegant. it'll be more easy to scale up and well that
doesn't work like that's not that's not a drop in replacement anymore yeah yeah it's that sort of
practical having that kind of practical approach okay i'm going to play by your rules here excel
but i'm going to make it better but it makes things harder but then it has the trade-off right
there's always trade-offs and i think it definitely this definitely falls on the right side of that
trade-off because you get that it's worth putting up with some of the
the kind of the crude and the difficult aspects of these things because it gets the adoption and
you're making people's lives better definitely and cool my my next kind of question which kind
of got two more sort of kind of big level sort of topics kind of want to cover off and that is
bridging the gap and we've spoke about kind of across the podcast some of the the real world impact your your work your work has had tools
like looks and how that's been sort of integrated into into into looks was integrated into pond into
ponder right which was then that's the the flow that was one of the ones that was managed right
yeah cool so yeah i just kind of want to get your take on what the current interaction between academia and industry is like and how
it can be in what the problems are and how it can maybe be improved yeah so since my work is
fairly user centric so i am very much informed by what users are currently doing and what are the problems that they are facing.
And then we try to build tools that will help plug the gaps or help be enhancements
to existing workflows rather than replacements thereof.
So I spend a lot of time thinking about what do people use?
What do data practitioners, be it in kind of nonprofits
and small like underfunded organizations
all the way to like the big industry behemoths,
what do they use and what do they care about
and what are their concerns, right?
So a lot of my work does involve
going and talking to people, right?
So a lot of the papers that we publish,
what would be more traditionally regarded as human-computer interaction papers because they are user studies and user surveys and need-finding studies and so on.
So these are all the things that you take for granted from an HCI community.
You don't find as much of it in the database community. So overall, I do think that if I were to think about lessons for the database
community, especially for PhD students and folks in academia, like talking to real users,
I think is really, really important. Even if your work isn't on tools that are in the data science or BI world, even if what you're doing is
quote unquote, hardcore data based stuff, right? Even if your target audience is still
like data engineers, or like hardcore computer science folks, still going and talking to them
and learning about their problems and identifying their concerns
or constraints is still informative, right?
Because it'll still help you realize whether the problem that you're working on is the
right problem or not.
So I think that's, I'm a huge believer in that talking to users is always helpful.
If not anything, you'll get some confidence that what you're doing is the right problem,
right?
So that's good. The second thing that I advocate for, at least for the database community,
is there's a lot of work that is kind of one-off algorithmic papers
that you see in both BLEB and Sigma, which is like,
hey, here's this algorithm to do this one thing, right?
And algorithms papers are great.
And I've written my fair share of algorithms papers.
But I've, over the last decade, I've insisted more and more that these algorithms papers when you adopt this algorithm in the context of the real system.
Only through that process will you learn that.
And so I think those are my two kind of takeaways for my work and for the database community. The first is talk to users as much as possible.
The more you're informed by their need,
be it data engineers all the way to like business analysts who don't know programming, right?
Like irrespective of where you are in the spectrum
and where you land,
still think it's useful to talk to users.
Second lesson is if you're doing more algorithmic work
rather than systems work,
embed it in a real system. Take a real system where you think your stuff can be used if you're doing i don't know graph
database a graph algorithm right perhaps you should be implementing your stuff in your 4j
right like or or some other graph database so that's just kind of a takeaway for folks
in the community who are thinking of things in a from a more algorithmic
lens awesome i think that's a two really good messages there that they talk to you this one's
really cool because i mean i think i kind of did a lot of work on sort of concurrency control right
and kind of the mantra there is let's make serializability as fast as possible because
that's the best and that's what everyone needs and then you go and look at what people actually
what real systems do they run at weak resolution levels no one's actually using it people are doing
things so like why are we optimizing for this case that no one's using the disconnect there
right so we need to bridge that disconnect that and we do that by talking to people who actually
use the damn systems right and build applications on it so you need to understand your audience
right and speak to them and understand their their their their problems so no i really i
really like that i'm definitely getting you getting your kind of uh your hands right and speak to them and understand their their their their problems so no i really i really like that i'm definitely getting you getting you kind of uh your hands
dirty and trying to implement it in a real system as well as uh and kind of seeing the interaction
effects and seeing how it actually plays out in practice is great because it's we can all say
this thing will be faster but until you actually go and kind of put it in the real world then
i guess yeah the proof of the pudding's in the eating right you've got to go and do it so now
i think they're they're two two really good points awesome so yeah it's kind of time for the the last sort of the last
sort of point now really and that's kind of what you think are the most promising directions for
kind of future research and sort of exciting trends that you see at the moment and obviously
llns maybe you're going to feature in this answer i I'm not too sure. But yeah, kind of watch your own. You kind of take on the future, I guess.
Yeah, yeah, yeah.
Your guess is correct.
So I think maybe this is also a time for me
to describe a little bit this lab,
the Epic Data Lab that you mentioned right at the top.
So we are, as part of the lab,
we are thinking about how do you build low-code
and no-code tools for data work, broadly defined.
And so this is ranging all the way from data extraction, data cleaning, to building machine learning models, visualization, sensemaking.
So the entire spectrum of data work.
And so if you were to say, hey, how do you build low-code and no-code tools for that?
This is a vision that's been there for decades, right?
It's not like it's new, right?
How do you make it easy for people to get insights from data?
How do you make it easy for people to extract information,
integrate it, prepare it, clean it, what have you, right?
All of this is decades-plus old problems.
Now we have this new capability of large language models, right?
So I do believe that
you pick every stage in this pipeline, data extraction, data cleaning, data transformation,
what have you. If you were to say, okay, now large language models is a component,
how would that change the equation, right? So I do believe that it makes things better in certain ways. So you can interpret fuzzy input better from users.
You can operate on unstructured data better,
like the document example that I gave earlier.
It'll also help you synthesize programs.
So it can synthesize SQL, it can synthesize Pandas scripts,
it can generate a bunch of different program fragments for you.
So now you have a way to handle fuzzy inputs and generate fuzzy outputs, as well as synthesize programs.
However, we also know that LLMs by themselves are not going to work because they can hallucinate, make mistakes, blah, blah, blah. So how do you build in the remaining ecosystem around this LLM that will help you do things like data cleaning or data extraction, data transmission?
The way that we're doing this is, okay, so perhaps a user comes in to some interface.
This doesn't need to be a chatbot always.
Chat is perhaps a poor interface for most data-centric tasks. But let's say if you
want to do the data cleaning, you come into some interface where you specify your task. And this
could be in natural language, but it could also be in the form of an intuitive web UI. It could be
in the form of DSL. It could be in the form of examples, demonstrations, any number of flexible means of specification.
This gets fed into some kind of LLM-based synthesis approach, which considers a bunch
of different interpretations of what a user had in mind.
And then the process doesn't end there where it just picks one approach and then does it.
It takes these approaches, and then we figure out a way to show it back to the user and
say, hey, here are various ways in which I interpreted your idea. You wanted me to do this.
Here are ways I can accomplish this. Pick between these options which one you want. And if you want
to restart the process or if you want to change your query entirely, we can do that as well. So
how do you engage in a dialogue between the system interpretations for what the user had
and the user so that
they can guide the system to
what they wanted accomplished?
And the hope is that this would also operate
on structured data,
semi-structured data, what have you, right?
So that's a vision of the Epic Lab.
So it's like a human-centered
approach for making sense of data.
Low-code and no-code is the name of the game, but it's like a human-centered approach for making sense of data. Low code and no code is the name of the game.
But it's also like flexible interfaces that are not just constrained by chat, but like
what else can you provide, right?
Can you provide a GUI?
Can you provide examples?
Can you provide a demonstration?
And then how does the system make sense of all of that?
So it's going to bring bring together techniques from hci databases
programming languages and program synthesis as well as of course yeah lms are a component of it
but like to really harness the power of lms you need all of these other disciplines yeah i really
like that sort of interactive aspect of that and how you can imagine kind of someone going through
that process and it kind of saying oh these are the things i could do do you want to do this and
that sort of interactive iterative sort of experience
sounds a really nice user experience as well.
But yeah, no, that sounds awesome.
I'm sure there's going to be some really cool work
coming out of EuroLab for the foreseeable future.
That's for sure.
So yeah, I guess that's the end of the podcast.
Thank you so much for coming on.
It's been a fascinating chat.
I'm sure the listener will have absolutely loved it as well.
Where can we find you on social media or anything like you on any of the platforms linkedin twitter
i'm trying to i'm trying to stay away from social media but i am on twitter and linkedin
and uh yeah i'm on all i'm on all of them but i'm trying to stay away from them okay that's
probably the the healthiest thing to do to be honest i try but yeah they always reels me back in but anyway cool um thanks so much for having me a fun set of
questions yeah yeah it's been an absolute pleasure and yeah i guess we'll see you all next time for
some more awesome computer science research Thank you.