Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Aditya Parameswaran

Episode Date: October 21, 2024

In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work.Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya'...s story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Links:EPIC Data LabAnswering Queries using Humans, Algorithms and Databases (CIDR'11)Potter’s Wheel: An Interactive Data Cleaning System (VLDB'01)Online Aggregation (SIGMOD'97)Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases (INFOVIS'00)Coping with Rejection PonderYou can find Aditya on:TwitterLinkedInGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast, Jack here. Today we have another installment of our high-impact series and the podcast is brought to you by Pometry. Pometry are the developers behind Raftree, the open-source temporal graph analytics engine for Python and Rust. Raftree supports time-traveling, multi-layer, and comes out of the box with advanced analytics like community evolution, dynamic scoring, and temporal motifs mining. It is blazingly fast, scales to hundreds of millions of edges on your laptop, and connects directly to all your data science tooling, including Pandas, PyG, and Langchain.
Starting point is 00:00:58 Go check out what the Pomtree guys are doing at www.raftree.com where you can dive into their tutorial for the new 0.80 release. We're going to be talking to in this episode Aditya Parameswaran. Aditya is an associate professor at the University of California, Berkeley, where he co-directs the Epic Data Lab. This is a lab that is targeted at low slash no-code data tooling with a special emphasis on social justice applications. Aditya also was, until its recent acquisition by Snowflake, the president of Ponder, which was a company that he founded with his students based on popular data science tools, and that was developed at Berkeley. Aditya also, as he's describing your research interests now, Aditya kind of develops
Starting point is 00:01:41 human-centered tools for scalable data science, and he does this by synthesizing techniques from data systems and human-computer interaction. Aditya, welcome to the show. Thanks for having me, Jack. It's a pleasure to be here. The pleasure's all ours. So for our new listeners, any new listeners we've got today, the High Impact series was inspired by a blog post on most influential database papers by Ryan Marcus. And Aditya holds the, I guess, the number one spot for 2011 with his CIDR paper. Let me get the title correct here. It is answering queries using humans, algorithms, and databases. And so, but before we do get into
Starting point is 00:02:23 that, it's customary on the podcast to start off with your story, your journey and get into kind of know more about you rather than the kind of the highlight reel that I read out at the start of the show. So yeah, I guess in your own words, Nidhiya, kind of what has been your journey so far? Tell us more about your career. Sure. So yeah, so just to give you a little bit of background. So I started as a PhD student. I did my undergrad in India at one of the Indian Institutes of Technology. Then in 2007, I started as a PhD student at Stanford working. Actually, at the time, I wasn't even sure what I wanted to work on.
Starting point is 00:03:02 So I just wanted to do computer science and I wanted to do research. So I started my PhD at Stanford. So that led to this paper in 2011. Sorry, 2011 at CIDA on crowdsourcing, following which in 2013, I started as a postdoc at MIT for a year, following which I ended up at the University of Illinois, Urbana-Champaign, two hours south of Chicago. And then in 2019, I moved back to the Bay Area and started as faculty at UC Berkeley. So that's my journey. Somewhere along the way, once I moved back to the Bay Area, we started a company that got acquired by Snowflake last year. So yeah, originally, when I started as a PhD student, I wasn't sure I wanted
Starting point is 00:03:45 to do databases. I sampled a bunch of different topics. I worked a little bit on recommender systems, information extraction. And then in about 2010, I started getting fascinated by crowdsourcing, which is what led to this CIDR 2011 paper. Awesome just kind of going back there right to the very beginning you said that you wanted to you knew you wanted to go into research and you knew you wanted to kind of be it being computer science what was the sort of the original sort of motivation that even when you're sort of like a 10 year old 15 year old did you always knew like kind of you wanted to pursue academia initially as sort of a career path or what was the attraction as well? I was okay so going back to when I was a kid I was fascinated by programming there was this time when I was I think maybe 10-12 years old we had a computer science class and I was so and we were
Starting point is 00:04:39 doing programming in GW basic right so that was so that dates me a little bit. And so we had a project to complete, which was sort of like a, let's call it a semester long project back then. And I, so we used to do it after hours. So it was like, you spend time in the computer lab from three to five working on this project. And I was so, I got my project done really quickly. So in a matter of, let's say, a month, then the remaining two months, I would go for on the computer lab, helping out others finish their projects. And the Nalia, the bugs, the Nalia, the logic, the more fun I had kind of like disentangling their code bases and figuring out what they wanted to actually do. So that led to my fascination with computer science broadly. And all that I knew I
Starting point is 00:05:25 wanted to do was do computer science, do programming, and just to raise the level of difficulty a bit, right? So that led me to pursue an undergrad degree in computer science, one of the IITs. And then I was like, okay, I could go become a software engineer in Google. And I had an offer right out of undergrad to do that. But I was like, okay, I could go become a software engineer in Google. And I had an offer right out of undergrad to do that. But I was like, okay, let me raise the level of difficulty a little bit. I want to solve hard problems. And then I applied for a PhD program and got into Stanford. So I was very happy about that.
Starting point is 00:05:56 I originally applied in program analysis. So that was my area of interest when I applied for a PhD program. But what I realized then, what I realized after having spent some time doing program analysis, that most of the problems there are undecidable. And I was like, I don't want to work on stuff that is so hard. I want it to be hard, but not so hard that it's undecidable. So somewhere between easy and undecidable. A little bit, yeah. A little bit, a little bit it's undecidable. So somewhere between easy and undecidable. Turn it typically down a little bit.
Starting point is 00:06:26 Yeah, yeah, yeah. A little bit, a little bit lower than undecidable. Yes. So that's what I chance upon databases and data management broadly defined. The thing that drew me to databases and data management is it's very, well, yes, there's a classical field of database systems.
Starting point is 00:06:42 And yes, there are a lot of people who work on classical problems in database systems. But in some sense, it gives you a little bit of a blank canvas to do stuff that spans theory and systems in all areas that touch data management. So anything that doesn't even look like a database system. This paper from CIDR 2011 certainly doesn't look like a database system. This paper from CIDR 2011 certainly doesn't look like a database system, but it still falls under the banner of data management and the database systems field, which I think is really nice for the research community. It's pretty welcoming. It's pretty open-minded. So I found my way to that community. Nice. Yeah. It's got something for everyone,
Starting point is 00:07:23 right? Which is a good community to be part of for sure cool so let's talk about answering queries using humans algorithms and databases then so cast your mind back to 20 2011 then give us the elevator pitch and yeah tell us about tell us about the background of this work and the motivation and that kind of went into it. Yeah. So 2011 or 2010, when I was in search for a thesis topic, right? So I was like, I had worked on recommendation systems for a little bit and information extraction, but I wasn't really, my heart wasn't in either of those topics. And I really wanted to do something different. And then along came crowdsourcing. So this was, thanks to the advent of platforms like Mechanical Tuck, where you could pay people like a dollar to do tasks for you, be it labeling data, doing rankings, search results, content moderation, you name it, right? So you have the power of humans at your fingertips to do various sorts of data processing tasks. And while the HCI community, the human computer interaction
Starting point is 00:08:31 community started looking at crowdsourcing and figuring out, okay, how do you best harness the crowds from an interface standpoint? I felt that there was a natural opportunity here to think of things from a data processing standpoint, especially given that crowds are error prone, they take a long time to answer questions. So how do you figure out how do you best do broader tasks using humans? So if you wanted to filter a bunch of items, if you want to sort a bunch of items, if you wanted to cluster a bunch of items, think your standard relational primitives, and then figure out how to best orchestrate that with the crowd, given certain objectives, be it cost, accuracy, latency. So once we started
Starting point is 00:09:14 framing that vision, around that same time, Alkis Polizotis, who is my co-author on the paper, was visiting Stanford. And turns out my advisor at the time, Hector Garcia Molina, was already involved in a different side of people, the different student. So he's like, if you want to collaborate with Alkis, go for it. And so I ended up collaborating with Alkis and we fleshed out this paper together. It was my first real paper where I felt like I was driving it end to end and Alkis treated me like a peer to his credit, which is very generous of him because he was already, I think, a tenured faculty member at UC Santa Cruz. And so we had a blast writing this paper. I think that even though I'm framing this as, how do you best harness a crowd for data processing? The vision of the paper goes
Starting point is 00:10:03 beyond that. We were claiming that this would be not just crowds, but also machine learning primitives. So you could imagine that you could, that's where the algorithms in the title comes from. So you can imagine, instead of asking the crowd, you could ask a machine learning model to label some data or process some data in some way. So that was the story behind how the paper came to be. So what impact has the paper had over the years? Have you followed its progress and seen kind of the citations and been like, oh, it's inspired this, it's inspired that? Yeah. So I think the crowdsourcing work in the database community had, I think, quite a bit of an impact in terms of other folks trying to figure out.
Starting point is 00:10:48 Once we set out that vision of saying, hey, how do you take the cloud and harness it for data processing tasks? And to be fair and giving credit where credit's due, there were a couple of other groups that were doing who came up with similar vision that are on the same time. So there was CrowdDB from Berkeley, as well as Quirk, Q-U-R-K from MIT. And those two systems happened at roughly the same time as ours. So, I mean, citations get allocated the way they are. It's a little random, but they were, I would say they are concurrent work to ours. And since then, there's been a bunch of different work trying to, how do you fine tune this, right? How do you crank up the wheel and make it so that you get as much mileage for your buck, right? And so I would say hundreds, if not thousands of papers have been written by the database community on how to best harness a crowd. Now, some of this has gone in more esoteric
Starting point is 00:11:44 directions, perhaps more niche directions than I would like. But I think the basic ideas around thinking about how do you account for mistakes made by crowd workers and how do you assemble kind of error-proof pipelines with crowds, I think, is an effect of all of the work that we did. And I think perhaps even more relevant is two different kind of ways this line of work influenced practice. I think one way this line of work influenced practice is this large kind of content moderation and tagging and labeling shops that a lot of the large internet companies do as a matter of just gathering training data. And this they used to do as early as 2015, 2012 onwards. So they were Facebook, the Facebooks and Googles, then Facebooks.
Starting point is 00:12:41 The Googles of the world were spending millions on crowdsourcing, and our approaches would have helped them save, let's say, 20% to 50% of those costs, right? Which is not an insubstantial amount at the time. Now, though, with large language models being even more hungry for human-labeled data. And with the reinforcement learning approaches that crucially rely on human feedback, it's even more important to revisit some of these crowdsourcing approaches, though I don't know if some of the folks
Starting point is 00:13:13 who've actually been using crowds for training large-scale models have actually referred back to this literature. So yeah, I think it could be even more influential than it has been. So yeah, I think that's one more influential than it has been so yeah that's i think that's one one note on how it's been influential the other way it's been influential is in my own work and i think this is we're just starting some exciting new work which is draws on the
Starting point is 00:13:37 lessons of crowdsourcing but in a completely different domain so i'm happy to talk about that or we can switch to a different question if you'd like yeah no that's actually a really nice segue into kind of what i was going to ask next is kind of what are you currently working on at the moment so yeah let's talk about that yeah yeah so this is so i have a bunch of different projects but i wanted to mention one that has was directly influenced by this which is the the fact that if you were to, right now, LLMs are wonderful, right? It's really easy to get your large-signature model to write you a poem or whatever, right? Like it's an act as a pair programmer.
Starting point is 00:14:14 So all of that is great. But in many cases, you want to have the large-signature model repeatedly process some type of data, right? So for example, you want to get summaries for each of the products that you are showing to a user, right? Or you want to process a bunch of reviews for sentiment and generate a sentiment for each of those reviews, right? Sentiment score for each of those reviews.
Starting point is 00:14:39 Now, LLMs are very much like the crowd, right? In that they are error-prone, they easily hallucinate, make mistakes, pay attention to certain portions of your prompt over others, just like humans pay attention more to certain portions of the instructions over others. And so can we revisit all of this literature, right? Like the crowdsourcing literature and figure out how to best orchestrate LLMs for data processing tasks,
Starting point is 00:15:05 which is going to become even more and more important as we start to rely on LLM for various types of production application. So this was 10 years, maybe 12 years to the day, I guess, inside of 2023, we set forth a vision for how to use the principles from crowdsourcing declarative crowdsourcing as we called it for prompt engineering for llms right how do you best harness llms just the way we set out we could harness crowds that's fascinating stuff yeah i mean because i mean i find myself using llms more and more in my day-to-day life but yeah like they do have errors right they hallucinate they create some kind of funky stuff the citations is a is a is a is an interesting one I think where you say oh I don't know go find me some citations for
Starting point is 00:15:55 for this thing in distributed transactions whatever and they'll look really really realistic but I know they're bogus because you can't find them on google scholar right so you can't kind of harnessing that proud and and I don't know how far along with kind of realizing the vision you are but how do you treat that how do you map the llms to like being it's like one lm a human essentially or is it how do you treat them what's the sort of the model that yeah so there are similarities and differences so i i spoke about the similarities and that they are they hallucinate they they make mistakes, they pay attention to certain portions of the prompt
Starting point is 00:16:27 and so on. LLMs are also different from humans in that first is pragmatic stuff like cost models and so on. So LLM, you have to pay based on inputs and outputs and there's a cost per token.
Starting point is 00:16:39 Humans don't operate that way. I mean, you fix a price for a task and you go for it. Humans and LLMs both make mistakes in unpredictable ways, but you could, the one thing about humans is that you could ask other humans
Starting point is 00:16:56 and you could say, hey, this person said this, what do you think, right? And so you could, by asking enough humans, you could get to the, get to convergence on what the right answer is.
Starting point is 00:17:08 In the LLM world, that's less easy to do. You could certainly turn up the temperature knob for an LLM and get more kind of randomness in your LLM answers, or you could ask other LLMs, but there is an inherent lack of independence there in some sense that the LLMs are but there is an inherent lack of independence there in some sense that the LLMs are all kind of drawing on similar sources. So you're not truly getting independent observations. So that I think is a little bit of a difference. Yeah, so I think those would be the most prominent ones, the cost model and how the fact that in the crowdsourcing world, you could sort of ask multiple humans and figure out what the right answer is. But in the L crowdsourcing world, you could sort of ask multiple humans
Starting point is 00:17:45 and figure out what the right answer is. But in the LLM world, it's a little different. That said, some of the same principles in terms of techniques still apply. You could, so one of the big techniques in the crowdsourcing literature was, rather than doing a big task all en masse, how do you decompose it into smaller tasks
Starting point is 00:18:02 that you know can be done accurately, right? So imagine if you wanted to sort a thousand items. With humans, they're going to make mistakes if you ask them to sort a thousand items. But if you ask them, hey, rank these five items for me, they're more likely to do that correctly. Very similar with LLMs. In fact, if you ask them to sort a thousand items, what they'll end up doing, unlike humans who might struggle with the task and do something, but you'll still find the items in some order, even if it's not perfect. LLMs will just simply make up new items. It's like half of the stuff is stuff that came from the original set and then the rest is just like random stuff.
Starting point is 00:18:48 That's brilliant. Yeah, just make this new thing as well and put it in your order. Yeah, you had 1,000 items, now you've got 1,500. No, that's fascinating. That's really cool. Now, I think the future for LMS is going to be really interesting to see the impacts they can have kind of going forward. There's a lot of investment going into them,
Starting point is 00:19:08 so many different levels as well, right? So, so yeah looking forward to seeing kind of what that brings and what your research in the area brings as well but you also mentioned that you've kind of got a lot of plates spinning at the same time and working on other sort of research projects as well so yeah what else have you kind of got on your plate at the moment yeah so at the moment the other big projects that we have ongoing are trying to think about how do you use. Okay, so it's all going to be LLMs because that's what's on my mind these days. But how do you use large language models to make sense of large PDF document corpora? Let's call it that. So collections of PDF documents. And one specific target application that we've been focusing on a lot is police misconduct. So we're working with folks who are trying to make sense of massive volumes of police misconduct data
Starting point is 00:19:58 that are arriving thanks to new state legislation in the state of California. So SB 1421 and SB 2, I believe, which led to police departments being required to release all of their police misconduct information. So imagine now you have a million, millions of PDF documents, right? And our journalist friends described this as the police departments basically give you a stack of paper, but they don't give you a stack of paper in the way you want it. They'll just throw it up in the air and then wait for it to all fall down and then they'll collect that and then give it to you. So it's not organized in any which way. And it goes beyond not being organized, right? It's like PDF documents are inherently a hard format to work with. And now you're further in this adversarial relationship with these police departments
Starting point is 00:20:49 in that they're not incentivized to give it to you in a way that you can make sense of it. Now, come LLMs. Can you be like, hey, LLM, here's my PDF. Answer me these questions, right? Who are the police officers mentioned? What is the location? What is the date, right? Basic stuff so that you can be like, okay, journalist comes in, they're trying to investigate a particular cop. They want to find other incidents that that cop has been named in. Can you help them with that, right? Or can you, if the journalist wants to ask questions like, okay, how often is there canine, use of canines in police misconduct cases? How often is there use of batons? How often is there mental health issues involved?
Starting point is 00:21:28 How often is there drug use involved? Whatever, right? Like they want to investigate these sorts of things. Combing through millions of PDFs is impossible. LLMs help you get part of the way there. But the challenge is that LLMs, as we just discussed, they hallucinate, they make mistakes. Simply giving them a PDF and say, hey, answer me these questions, it doesn't work. So you need to figure out how do you do it in a manner that's more reliable.
Starting point is 00:21:52 So one standard technique is you can chunk up the document and say, hey, look at these portions of the document and then kind of work your way to the end. But that doesn't quite work because you can't really interpret a document chunk in and of itself. You need the context before it to make sense of a given chunk. So there's all kinds of challenges in making sense of large collections of documents. And this is not just one application here just to illustrate the challenges, but we are trying to be as general as we can in saying, hey do you support document what i like to call document
Starting point is 00:22:26 analytics right on large document collections awesome yeah i i had it when kind of you explained there a few things kind of bouncing around my head kind of is there like an explainability provenance aspect to the work as well in a sort of a case of like okay i'm saying that there was this canine there's this many canine incidents or we're using this many incidents, then it will actually link you back to the kind of, oh, this is how I figured this out, right? So there's that kind of angle to it as well, rather than it making up 10 reports and being like,
Starting point is 00:22:52 yeah, there was canines using everything. Totally, totally. Yeah, provenance is extremely important here because these journalists are not going to trust any random numbers we throw at them saying there's X number of incidents. We are not going to trust it. So what you want is pointers back to the source, right? So you want to be able to say, hey, there's these incidents and here are the excerpts that led me to believe that these were the... So they do want to be able to double
Starting point is 00:23:21 check everything. And so it's not as much about simply saying yes the here is an aggregated answer but you also need to provide the provenance associated with it yeah maybe maybe this next question is getting too much into actual implement implementation details but how do you actually how does an llm actually read pdf data because i actually aren't really too sure on the format of pdf, like what it actually looks like when it's like, I guess it's a bespoke format that Adobe created at some point, maybe. How does it actually,
Starting point is 00:23:51 how do you actually feed that into an LLM? Yeah, so I think what we end up doing, I think there are mechanisms to do it where you treat PDFs as images. So that's one approach. But what we end up doing usually is just like applying some OCR techniques and then providing the OCR output to an LLM. Now that causes some loss in information, but that's a trade-off, right?
Starting point is 00:24:15 Text is a much easier format for LLMs to understand. It's cheaper. And so that's why we opt for OCRing the PDF first before we provide it to an LLM. But there are challenges with even things like OCR, right? So for example, you're doing OCR and you have some name that's redacted, right? The OCR might just skip over that name entirely. And then you kind of just,
Starting point is 00:24:38 you have the words before the name and after the name, just like attached to each other with no gap in between. And then, so the LLM is like confused, what's going on here? So sometimes you're going to have to go and patch the output of an OCR being like, okay, here's actually underscore, underscore, underscore, underscore of this many of this long,
Starting point is 00:25:00 that's this long, which is basically indicating that something was redacted, right? So OCR output isn't perfect, but it gives you part of the way there. Nice, awesome. that's this long uh which is basically indicating that something was redacted right so ocr output isn't perfect but give to you part of the way that nice awesome i'm look forward to seeing all this this research come out of your group over the next few years i'm sure it'll be fascinating to see how it all develops for sure cool yeah so i kind of in the next sort of the the the podcast we're going to kind of shift gears and kind of go look do another bit of a retrospective i guess
Starting point is 00:25:24 we've spoke about your genuine career so far but we're kind of i kind of want to kind of get your take on what you're kind of most proud of of the the of your career you know you work on a lot of things like social injustice and stuff so it must be really rewarding to kind of see your work have some real world impact so yeah i guess the question is, what are you most proud of? Yeah, I think I'll answer the question in two ways. I think that what I'm most proud of over. So in August 20, yeah, August this year will mark my 10 years of being faculty. Right. So that's a milestone. Thank you.
Starting point is 00:26:07 So what I'm most proud of through my faculty career is just the impact that I've had through my students. I'm very proud of the students that I've graduated. Some of them have ended up becoming entrepreneurs. Some of them have ended up
Starting point is 00:26:21 becoming professors. Others have ended up in industry. And that's really what I'm most proud of. And especially seeing the trajectory of some of these students where they come in and they're either unsure of what they want to do up things that you wouldn't expect them to pick up. And eventually, they supersede you in knowledge and ability, right? And that's really what you want as an advisor. So that's been really, really gratifying. And the students that have graduated are truly amazing. And I'm very grateful to have had the chance to work with them. In terms of bits of work that I'm most proud of, I have this variety
Starting point is 00:27:07 of different projects that I feel quite proud of. And I can give you a couple of examples. One project, a couple of projects that I'm pretty proud of in the last five years were to do with data science tooling. And this led to the startup Ponder, which was eventually acquired by Snowflake. And so the students who led both those projects, Devin Peterson and Doris Lee, they each targeted the open source community and try to address challenges that data scientists and data analysts face when they are trying to use open source packages in tools like computational notebooks, right? So Doris Lee, for example, identified that when you're trying to do data analysis, you often need to write dozens of lines of code to get to visualizations. And this is just a barrier to data analysis, right? So every time you want to get an insight, you need to write dozens of lines of code,
Starting point is 00:28:04 right? That's annoying. And pick your favorite visualization library, Matplotlib, Plotly, what have you, right? In none of these libraries is it easy to get a single visualization, right? And so what we asked was a question, hey, can you get Visualizations Institute during analysis without prompting, right? Can you just get visualization recommendations during data analysis in data analysis library like Pandas? And so our two lux synthesized lessons that we had developed over like the last five to eight years on visualization recommendations and built it all into the usable
Starting point is 00:28:40 tool that sits within computational notebooks provides visualization recommendations out of the box. So you print your data frame, you get visualization recommendations. So it's really, really neat, a lot of impact. I think maybe half a million downloads and users, which is pretty neat. So I'm pretty proud of that. Separate project, which also led to Ponder, was this Modin project, which was basically around scalable data frames. So also centered around Pandas. And Pandas, if you didn't know, should have mentioned, is the most popular data science library, right? Like it is the reason why Python is so successful, right? But at the same time, Pandas is a beast,
Starting point is 00:29:22 right? There's like 500 functions within pandas a lot of redundancy it's a mess there's no optimization often gives out of memory errors and so on and so devin's thesis centered on this tool moden which is it's a drop-in replacement for pandas so preserves the pandas api but applies database and superior computing techniques to scale that up, right? So out of the box, you get speed ups because we now have thought about this more carefully, right? And we can leverage multiple cores. We can leverage query optimization, all being applied to this new beast, which is DataFrames. And so that, again, was a successful open source project.
Starting point is 00:30:02 It's continuing to be successful. I think it has like a million downloads a month or something. So that's pretty amazing. So both of these projects I'm very proud of because they had a lot of impact and a lot of usage. And the students, Doris and Devin, for now at Snowflake, spent a bunch of time listening to the open source community and drawing from what are their best, what are their challenges and how do you address them rather than simply saying, Hey, we build another tool and chuck it over the wall at the people. Right. So that's something that we departed from in these two projects.
Starting point is 00:30:35 Yeah, definitely. Kind of know your customer, right? Know your customer, like kind of know the end user and know what their pain points are and then fix their pain. But if you can solve a problem with someone, make someone's life easier, like you're going to have so much, like, I don't know, that's a tool that's going to get used, right, as well.
Starting point is 00:30:49 And I guess that speaks like half a million downloads, a million downloads, right? There's some impact there. This is a podcast on high impacts and there is high impact. So, yeah, that's fantastic. Yeah, and I guess they've now been integrated into Snowflake's platform, I guess, and kind of all of the I don't know how many users Snowflake have as well. Right. So they'll be all they'll be all benefiting from it as well. So that's that's fantastic.
Starting point is 00:31:11 Exactly. Yep. Cool. So, yeah, the next kind of one sort of section we're going to we're going to jump on to sort of like motivation. And I dropped this question on you yesterday in an email with kind of last minute, and asked about what your favorite papers are. So I don't know if anything's kind of come up. But yeah, what is your favorite paper, papers? Yeah, so I thought about it since yesterday, and I still don't have a good answer for you. I feel like through every phase of my career, there are papers that I've been inspired by. And I think the kinds of papers that I, okay, so here are some timeless ones that I feel like I go to time and time again, and I feel excited to reread whenever I reread them. So one example of this, and this is reflecting in, I guess, the style of research that I do as well one of these papers is this paper that kind of laid out the foundation for Tableau right so Tableau is
Starting point is 00:32:12 visualization platform or visual analytics platform and the paper is is paper its title is Polaris so that was the name of the system before Tableau. And so in this paper, they laid out a way to think about visualization, visual analytic systems. And so they laid out an algebra for how do you specify what a user is seeing on a visualization canvas and how do you translate that into data processing queries in the backend. And so I think what I liked about that paper and what I took away from it is this seamless blending of user interface aspects with data processing aspects in a way that combines the best of data management and HCI principles. And so, of course, the paper and subsequent tool have been enormously impactful, right? So Tableau had a big IPO and then was acquired by Salesforce a few years ago.
Starting point is 00:33:11 So it's had a lot of impact. Another paper that I'd like to mention that I think also embodies this principle of bridging HCI and data management is, I mentioned two from my colleague, Joe Hellestein at Berkeley. One is called online aggregation. So again, thinking about end users and the fact that you don't want to wait until all of the results are generated when you quickly get a sense of what's going on, right? So for an aggregate query. So online aggregation basically talks about how do you provide approximate results as the results are being generated. A separate paper also from Joe,
Starting point is 00:33:53 which is quite influential, is Potter's Wheel. So this is trying to do data cleaning and how do you figure out what is a good metric for cleaning up your data, right? So Joe and Vijay Shankar used this metric called minimum description length to figure out what's the best type to induce on any given column so you can clean your data. In addition, they figured out an algebra for data cleaning,
Starting point is 00:34:22 which they then operationalized in this tool called Wrangler a decade later. And then that led to this company called Trifacta, which Joe founded and then was acquired a couple of years ago. So the reason I bring up these three papers is because it's all bridging human-computer interaction and database principles. It's the style of work that I enjoy doing and I take inspiration from. So thinking about end user needs while also trying to address some of the scalability challenges. We'll put links to that in the show notes. And I've definitely got one more to add to my reading list there with the Polaris papers, but I need to put onto my monotonically increasing reading list. Sounds good.
Starting point is 00:35:03 Yeah, cool. So the next thing i kind of want to want to touch on is is setbacks so we've talked about all the high impact great work that you've done over the over the years but i progress is obviously non-linear right there's there's this doubt there's like the rejection the setbacks so yeah i want to kind of kind of ask how you deal with those setbacks and what your approach is if you have a systematic way of approaching sort of rejection and setbacks yeah so okay so on this particular topic turns out i gave a keynote at some vldb phd symposium workshop which was which happened remotely because we were i think it was you know it was a remote conference. So I think peak COVID at the time. And so I have a Loom video that addresses this very question. So it's about how to deal with
Starting point is 00:35:51 the rejection. So I'm happy to share the link. Yeah, share the link. We'll put that in the show notes so the listener can go find it. Sounds good. So in that brief 10 minute video, I talk about how as a PhD student, I had a bunch of setbacks. So I think my first eight conference submissions were rejected. OK, so imagine imagine a PhD student who deals with that much rejection. Like one would think, why did I even persist? Right. Like, I mean, that's a ridiculous number of amount of rejection to get as an early stage PhD student. And so it took me until my fourth year when I actually started getting papers accepted.
Starting point is 00:36:34 And eventually all of those, so these eight rejections are not actually all the same paper. So imagine three different papers, all getting rejected multiple times, two to three times, right? And then all of these papers, eventually in in year four they all got published somewhere right sometimes by just simply changing the audience sometimes by changing the introduction changing the like random things that you wouldn't think have a huge impact but but they do right and so so yeah so i think that gave me a lot of thick skin. I think like now when I deal with rejection as an early stage faculty member or post-tenure, I realize that I don't take it that personally or I shouldn't take it personally.
Starting point is 00:37:31 It's just par for the course. Just learn the lessons from it and move on. Right. And at the same time, I think it's important to know, A, that there's a lot of randomness involved in everything that you do as an academic. There's a matter of taste, right? So some people like the stuff that you're doing. It doesn't mean it's bad work. It's just that some people don't like it. And so you figure out how to best appease people who may have a different research taste than you, but you reframe the work in a way that you think may improve the work, but you stand by it, right? If the work is good, it'll eventually get in somewhere. So that's how I think about rejection. The other place where I've had a lot of rejection is in, I've applied for a variety of faculty jobs at various points,
Starting point is 00:38:04 I've gotten rejected. I've also gotten rejected from grants several times. Again, there's a question of persistence, a question of learning from what went wrong, try to get as much feedback as possible and try to improve on it. Of course, even to this day, rejection stings, right? Like if I get a paper rejected, I instinctively get angry right like i'm like damn it reviewer number two exactly damn it reviewer number two and so it's important to not react immediately then give it a day or two let it sink in others have said this better than i have but like let it let it digest the rejection then come back and try to figure out what to do.
Starting point is 00:38:45 And sometimes this means just tossing it in again. Just because it was a random reviewer who did not understand the value of your work, even though you felt that they should have, doesn't mean you stop submitting that work. You resubmit. Sometimes it is like if enough reviewers are giving you the same signal that they didn't get the point of the work or there's some fundamental flaw, that means that you should go ahead and try to fix that before you resubmit, right? So simply tucking it over the wall and expecting a different outcome this time doesn't actually help. So you
Starting point is 00:39:19 should try to see what the signs are and try to read between the lines of the reviews right so often the reviewers say hey here's a hundred different things that are wrong with the paper but this they are often fixated on one thing which is more of a deal breaker than the others right if they say oh i like here's a typo here's here's this additional experiment you could have added here's this like here is i i didn't you could add a discussion about this. These are fixables and most likely these are not what led to the rejection. So it's a matter of, it's an art to read a review and try to identify what led to the eventual outcome.
Starting point is 00:39:59 Like what was the deal breaker for this reviewer? And sometimes it's implicit, right? Sometimes it's like i was just not excited enough or i thought the writing was sloppy or i thought that this project has no real impact right like and and so distilling that lesson will take time and effort and then often you get better at it with experience yeah no i think that's really good solid advice and how to kind of approach it it's uh i think yeah i think someone solid advice and how to kind of approach it it's uh i think yeah i think someone says to me almost like if you treat it as like an opportunity
Starting point is 00:40:30 to make it better and rather than it being sort of necessarily a reflection on you as an individual i have that i have that detachment between yourself and obviously it's a lot easier said than done right we still kind of get angry initially when it feels like you've been sort of like it's an attack on you person right rather than sort of but yeah detach yourself from the work and see it as an opportunity to make things better and yeah it'll um it'll get there in the end there's so many stories of i guess influential papers over the years that have been rejected three four or five times that have been gone on to have like crazy impact so yeah yeah keep plugging away so yeah and the on that on that particular note right like i've had every single one of my award-winning papers has been rejected at least once so yeah so that's
Starting point is 00:41:14 the statistic for you yeah there we go you actually i actually wanted to get rejected once then it'll win the best then it'll win the best paper right right? That's what we need now. Yeah, exactly. Some causation there. Exactly. Cool. Awesome. Yeah, so the next question is a question that I borrowed from the regular format that I really, really like for my podcasts. And that's about the creative process and how you approach idea generation and then selecting projects.
Starting point is 00:41:39 And yeah, what is your process for that? Do you have a process for it? Are you systematic or is it more sort of serendipitous and kind of in the shower shower thought sort of thing yeah yeah so i think i would say there isn't a lot of staring at a blank piece of paper and coming up with ideas like i don't necessarily do that all that well so often when I get most of my inspiration is by reading other papers so if I when I read papers often get inspired on follow-up work and and so reading the literature often helps me make sense of the world often following what's going on in industry also helped me make sense of the world and try to identify opportunities to go and improve things.
Starting point is 00:42:28 Right. So and so that's another source of inspiration for me. Sometimes inspiration is also retrospective. So often we have a couple of projects on various themes and then you realize that, hey, these are all different facets of the same equation, right? And maybe we could kind of combine all of these together and you end up with a grander, bigger vision than what you started out. So, and the most important way to get ideas or figure out ideas is to brainstorm with your students, right? Like that's where I created most of my ideas is just discussions with smart people. And so I think like the, in terms of ways in which I've personally looked at problem
Starting point is 00:43:13 selection, often looking at existing approaches, figuring out, can you keep the user interface for the most part and then improve the backend somehow, right? That's kind of a philosophy that has worked really well for me. So giving a couple of examples, or rather, how do you take a process or a tool that's popular that is broken and then report pieces of it? So it doesn't necessarily mean the backend, but report pieces of it that would, and then replace it with different pieces that will help make it better. Concrete examples. So a big focus of my work over the last decade has been on spreadsheets, right? Spreadsheets are amazing, the most popular data management tool
Starting point is 00:44:03 out there used by billions of people, right? Except that spreadsheets do not change, right? And if you try to use spreadsheets on a million rows or more, it's going to complain. Spreadsheets often clash and hang with as few as 50,000 rows, 100,000 rows. So we asked the question,
Starting point is 00:44:22 okay, can we preserve the spreadsheet look and feel, the spreadsheet interface, and then repart the backend and allow it to scale to arbitrarily large datasets, right? And so, that led to a bunch of interesting questions. We built a system called DataSpread. We figured out how do you represent data? How do you index it? How do you do queries efficiently and so on? Similar philosophy is applied to Modin, the project that I mentioned earlier, where you keep the Pandas API and then you rip
Starting point is 00:44:50 out the backend and you try to see if you can make it better, right? Lux also follows a similar design principle in that we were like, okay, we don't want to destroy the user experience. We want to change the user experience in that they get visualization recommendations out of the box, but they do it in a drop-in kind of fashion, right? So you enhance existing tools, don't replace it, right? And I believe that this kind of philosophy of looking at popular tools that are obviously fulfilling a need, but are broken in some way right be it in usability in intelligence in scalability and then replace components of it by by by with better components preserving everything else about it i think is a recipe that has worked well yeah i can definitely see how that list instantly leads to better adoption as well right because you kind of they're already using
Starting point is 00:45:44 the tool you've made it better better for them and also you kind of that initially sort of gets around the problem of well for me anyway if i rather than go and learning a whole new tool i just can keep using what i'm using and i get this new fun cool extra awesome stuff for free which is like a much better experience right so yeah i can definitely see how that's asking for success yeah it's like i mean you're dealing with user inertia right otherwise and users don't want to give up their existing tools you say hey here's this new tool like all you have to do is to learn this new language or use this new interface and be like you know what i'm i'm good thanks right and so if you're like
Starting point is 00:46:19 no no you can continue continue to use all of the scripts that you built you can continue to use all of the scripts that you built. You can continue to use all of the Excel files that you have. You just need to use this instead. It's a drop-in replacement. That's a mantra. Then you get instant adoption, right? Like it's a game changer. But it's a lot harder, right? Like because now you're constrained by what this tool is doing, right?
Starting point is 00:46:41 You have to do the Pandas API. You can't do something else. You have toas API. You can't do something else. You have to do Excel. You can't do something else. You would love to do something else where you could be like, you know what? Maybe data frames don't need to be ordered. Maybe Excel doesn't need to be ordered. Maybe if I drop a good fraction of the API, a good fraction of the commands, life will be easier. It'll be more elegant. it'll be more easy to scale up and well that doesn't work like that's not that's not a drop in replacement anymore yeah yeah it's that sort of practical having that kind of practical approach okay i'm going to play by your rules here excel
Starting point is 00:47:16 but i'm going to make it better but it makes things harder but then it has the trade-off right there's always trade-offs and i think it definitely this definitely falls on the right side of that trade-off because you get that it's worth putting up with some of the the kind of the crude and the difficult aspects of these things because it gets the adoption and you're making people's lives better definitely and cool my my next kind of question which kind of got two more sort of kind of big level sort of topics kind of want to cover off and that is bridging the gap and we've spoke about kind of across the podcast some of the the real world impact your your work your work has had tools like looks and how that's been sort of integrated into into into looks was integrated into pond into
Starting point is 00:47:54 ponder right which was then that's the the flow that was one of the ones that was managed right yeah cool so yeah i just kind of want to get your take on what the current interaction between academia and industry is like and how it can be in what the problems are and how it can maybe be improved yeah so since my work is fairly user centric so i am very much informed by what users are currently doing and what are the problems that they are facing. And then we try to build tools that will help plug the gaps or help be enhancements to existing workflows rather than replacements thereof. So I spend a lot of time thinking about what do people use? What do data practitioners, be it in kind of nonprofits
Starting point is 00:48:45 and small like underfunded organizations all the way to like the big industry behemoths, what do they use and what do they care about and what are their concerns, right? So a lot of my work does involve going and talking to people, right? So a lot of the papers that we publish, what would be more traditionally regarded as human-computer interaction papers because they are user studies and user surveys and need-finding studies and so on.
Starting point is 00:49:13 So these are all the things that you take for granted from an HCI community. You don't find as much of it in the database community. So overall, I do think that if I were to think about lessons for the database community, especially for PhD students and folks in academia, like talking to real users, I think is really, really important. Even if your work isn't on tools that are in the data science or BI world, even if what you're doing is quote unquote, hardcore data based stuff, right? Even if your target audience is still like data engineers, or like hardcore computer science folks, still going and talking to them and learning about their problems and identifying their concerns or constraints is still informative, right?
Starting point is 00:50:08 Because it'll still help you realize whether the problem that you're working on is the right problem or not. So I think that's, I'm a huge believer in that talking to users is always helpful. If not anything, you'll get some confidence that what you're doing is the right problem, right? So that's good. The second thing that I advocate for, at least for the database community, is there's a lot of work that is kind of one-off algorithmic papers that you see in both BLEB and Sigma, which is like,
Starting point is 00:50:37 hey, here's this algorithm to do this one thing, right? And algorithms papers are great. And I've written my fair share of algorithms papers. But I've, over the last decade, I've insisted more and more that these algorithms papers when you adopt this algorithm in the context of the real system. Only through that process will you learn that. And so I think those are my two kind of takeaways for my work and for the database community. The first is talk to users as much as possible. The more you're informed by their need, be it data engineers all the way to like business analysts who don't know programming, right?
Starting point is 00:51:33 Like irrespective of where you are in the spectrum and where you land, still think it's useful to talk to users. Second lesson is if you're doing more algorithmic work rather than systems work, embed it in a real system. Take a real system where you think your stuff can be used if you're doing i don't know graph database a graph algorithm right perhaps you should be implementing your stuff in your 4j right like or or some other graph database so that's just kind of a takeaway for folks
Starting point is 00:52:01 in the community who are thinking of things in a from a more algorithmic lens awesome i think that's a two really good messages there that they talk to you this one's really cool because i mean i think i kind of did a lot of work on sort of concurrency control right and kind of the mantra there is let's make serializability as fast as possible because that's the best and that's what everyone needs and then you go and look at what people actually what real systems do they run at weak resolution levels no one's actually using it people are doing things so like why are we optimizing for this case that no one's using the disconnect there right so we need to bridge that disconnect that and we do that by talking to people who actually
Starting point is 00:52:36 use the damn systems right and build applications on it so you need to understand your audience right and speak to them and understand their their their their problems so no i really i really like that i'm definitely getting you getting your kind of uh your hands right and speak to them and understand their their their their problems so no i really i really like that i'm definitely getting you getting you kind of uh your hands dirty and trying to implement it in a real system as well as uh and kind of seeing the interaction effects and seeing how it actually plays out in practice is great because it's we can all say this thing will be faster but until you actually go and kind of put it in the real world then i guess yeah the proof of the pudding's in the eating right you've got to go and do it so now i think they're they're two two really good points awesome so yeah it's kind of time for the the last sort of the last
Starting point is 00:53:08 sort of point now really and that's kind of what you think are the most promising directions for kind of future research and sort of exciting trends that you see at the moment and obviously llns maybe you're going to feature in this answer i I'm not too sure. But yeah, kind of watch your own. You kind of take on the future, I guess. Yeah, yeah, yeah. Your guess is correct. So I think maybe this is also a time for me to describe a little bit this lab, the Epic Data Lab that you mentioned right at the top.
Starting point is 00:53:38 So we are, as part of the lab, we are thinking about how do you build low-code and no-code tools for data work, broadly defined. And so this is ranging all the way from data extraction, data cleaning, to building machine learning models, visualization, sensemaking. So the entire spectrum of data work. And so if you were to say, hey, how do you build low-code and no-code tools for that? This is a vision that's been there for decades, right? It's not like it's new, right?
Starting point is 00:54:08 How do you make it easy for people to get insights from data? How do you make it easy for people to extract information, integrate it, prepare it, clean it, what have you, right? All of this is decades-plus old problems. Now we have this new capability of large language models, right? So I do believe that you pick every stage in this pipeline, data extraction, data cleaning, data transformation, what have you. If you were to say, okay, now large language models is a component,
Starting point is 00:54:36 how would that change the equation, right? So I do believe that it makes things better in certain ways. So you can interpret fuzzy input better from users. You can operate on unstructured data better, like the document example that I gave earlier. It'll also help you synthesize programs. So it can synthesize SQL, it can synthesize Pandas scripts, it can generate a bunch of different program fragments for you. So now you have a way to handle fuzzy inputs and generate fuzzy outputs, as well as synthesize programs. However, we also know that LLMs by themselves are not going to work because they can hallucinate, make mistakes, blah, blah, blah. So how do you build in the remaining ecosystem around this LLM that will help you do things like data cleaning or data extraction, data transmission?
Starting point is 00:55:34 The way that we're doing this is, okay, so perhaps a user comes in to some interface. This doesn't need to be a chatbot always. Chat is perhaps a poor interface for most data-centric tasks. But let's say if you want to do the data cleaning, you come into some interface where you specify your task. And this could be in natural language, but it could also be in the form of an intuitive web UI. It could be in the form of DSL. It could be in the form of examples, demonstrations, any number of flexible means of specification. This gets fed into some kind of LLM-based synthesis approach, which considers a bunch of different interpretations of what a user had in mind.
Starting point is 00:56:16 And then the process doesn't end there where it just picks one approach and then does it. It takes these approaches, and then we figure out a way to show it back to the user and say, hey, here are various ways in which I interpreted your idea. You wanted me to do this. Here are ways I can accomplish this. Pick between these options which one you want. And if you want to restart the process or if you want to change your query entirely, we can do that as well. So how do you engage in a dialogue between the system interpretations for what the user had and the user so that they can guide the system to
Starting point is 00:56:49 what they wanted accomplished? And the hope is that this would also operate on structured data, semi-structured data, what have you, right? So that's a vision of the Epic Lab. So it's like a human-centered approach for making sense of data. Low-code and no-code is the name of the game, but it's like a human-centered approach for making sense of data. Low code and no code is the name of the game.
Starting point is 00:57:07 But it's also like flexible interfaces that are not just constrained by chat, but like what else can you provide, right? Can you provide a GUI? Can you provide examples? Can you provide a demonstration? And then how does the system make sense of all of that? So it's going to bring bring together techniques from hci databases programming languages and program synthesis as well as of course yeah lms are a component of it
Starting point is 00:57:31 but like to really harness the power of lms you need all of these other disciplines yeah i really like that sort of interactive aspect of that and how you can imagine kind of someone going through that process and it kind of saying oh these are the things i could do do you want to do this and that sort of interactive iterative sort of experience sounds a really nice user experience as well. But yeah, no, that sounds awesome. I'm sure there's going to be some really cool work coming out of EuroLab for the foreseeable future.
Starting point is 00:57:54 That's for sure. So yeah, I guess that's the end of the podcast. Thank you so much for coming on. It's been a fascinating chat. I'm sure the listener will have absolutely loved it as well. Where can we find you on social media or anything like you on any of the platforms linkedin twitter i'm trying to i'm trying to stay away from social media but i am on twitter and linkedin and uh yeah i'm on all i'm on all of them but i'm trying to stay away from them okay that's
Starting point is 00:58:22 probably the the healthiest thing to do to be honest i try but yeah they always reels me back in but anyway cool um thanks so much for having me a fun set of questions yeah yeah it's been an absolute pleasure and yeah i guess we'll see you all next time for some more awesome computer science research Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.