Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Joe Hellerstein

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast, as usual, Jack here. Today is going to be another installment of our high-impact series, but before we get onto that, I need to give a shout out to our sponsor, Pomtree. Pomtree are the developers behind Raftree, the open-source temporal graph analytics engine for Python and Rust. Raftree supports time traveling, multi-layer modeling, and comes out of the box with advanced analytics like community evolution, dynamic scoring, and temporal motives mining. It's blazingly fast, scales to hundreds of millions of edges on your laptop, and connects directly to all your data science tooling, including Pandas, PyG, and Langchain. So go check out what the Pantry guys are doing at www.raftree.com where you can dive into their tutorial for their new 0.80 release.

Starting point is 00:01:06 Today we are going to be talking to Joe Hellestein. Now Joe is the Jim Gray Professor of Computer Science at UC Berkeley, where his research focuses primarily in the area of data-centric systems and he's interested in the way they drive computing. Joe has won numerous awards across his career. He's won the Codd Award, the Alfred P. Sloan Research Fellowship Award, and also he's been featured in Fortune's magazine's Smartest in Tech list. He's also led numerous open source projects, including Bloom, Madlib, and Telegram. And he's also started several companies, including Aqueduct and Trifecta. And for new listeners to this show, they'll know this type of episode is based off of a blog post by Ryan Marcus around the most influential people and papers in databases. And Joey, you're number eight in the rankings at the moment.

Starting point is 00:01:58 So doing good. Welcome to the show. Thanks for having me. It's fun. Awesome stuff. Cool. So I've given obviously the highlight reel there of everything me. It's fun. Awesome stuff. Cool. So I've given obviously the highlight reel there of everything you've achieved so far in your career or some of the things.

Starting point is 00:02:10 But yeah, help us color in between the lines now and tell us about your own journey in your own words. Let's see. So I was born into the database research community out of IBM, Berkeley, and Wisconsin back in the day when those were sort of the three dominant areas of database research. So right after college, I did an undergrad at Harvard with a woman named Mason Hsu, who later on went to HP Labs. And from there, I went to IBM Research as a pre-doc intern in the lab that was famous for System R and R Star.

Starting point is 00:02:47 And they had a project at that time called Starburst. So I was part of the Starburst team working on query optimization. And that stuff apparently still ships in DB2. So that was long ago good stuff, extensible query rewriting. And then I went to start a PhD at Berkeley with Mike Stonebraker on the Postgres project. But being a young man and full of beans, I decided after one year that that wasn't where I was happy. And I transferred to Wisconsin, which is actually my hometown, and finished my PhD there with Jeff Naughton. But I continued to work on Postgres.

Starting point is 00:03:22 So my PhD work is all in the context of the Postgres project. Mike Stonebraker was super supportive, helped recruit me back to Berkeley as a professor. And I've been there ever since, since 1995, leading my own research and expanding out of core database systems into all kinds of things that interest me where there's collaboration. Stonebreaker left Berkeley a few years after I arrived, which sort of forced me into collaborating with other folks. And so I've been working on things like machine learning and data visualization and networking and operating systems and how those things connect to data management over the course of my career. Awesome. So did you always kind of going back to the kind of,

Starting point is 00:04:03 I don't know, 12 year old, 13 year old Joe, did you always want to become a researcher? Was that always the sort of desire from being very young? what was called a systems analyst, which means COBOL programmer. So I kind of fell in between them, but I also have two big sisters who went on to grad school and became professors. So I just did what everybody else did. I'm a, I was well-behaved as a child. And so I just went into the family business. Awesome stuff. And also it's the first time I'd heard about Starburst there. Now the name, the name of that, is it, did you have Starburst in the U S the suite, right?

Starting point is 00:04:43 Like the little suites. I believe they were there at that time. Right. Okay. The name of that, did you have Starburst in the US, the suite, right? Like the little suites? I believe they were there at that time. Right, okay. It's long enough. Cool. Awesome. So let's get on to what you're working on currently today then. So yeah, give us the high-level overview of the stuff you're working on today.

Starting point is 00:04:58 Yeah, well, the primary project I'm working on is a project called Hydro, and it's the culmination of a very long effort on my part to bring declarative languages and their power to other areas of computer science. And we did this in the early 2000s in declarative networking. So we showed how you could write networking protocols in high-level languages and then compile them down from those high level languages to implementations that made sense in different contexts. Like the same spec would give you a wireless protocol or an internet protocol, depending on how you cost it out links and reliability and stuff. So there's a whole generation of work on declarative networking. What we're working on right now is high level languages for distributed programming in general.

Starting point is 00:05:42 And the thesis here really is that there's only one language that succeeds in scaling from one core to the globe, and that language is SQL. It runs in your phone, you know, in SQLite, and you can take the same query that runs in your phone and run it on Snowflake across the globe. It's pretty amazing. Most programmers don't think of that as an option for writing general purpose code. So we're trying to figure out where those gaps are and what are the optimizations required for general purpose programming to scale up and scale out across lots of machines. And in the Hydro project, we've got a prototype going now after a couple of years that can do some pretty cool things, ranging from we can optimize low-level protocols

Starting point is 00:06:26 like Paxos for high bandwidth as implemented in Hydro. And we can also build systems infrastructure like key value stores that auto-scale in Hydro as well. And you get very tight specifications that automatically optimize themselves. Awesome. Is this already integrated into sort of GCP AWS or is it sort of, how is that path looking like for it to be sort of general purpose, I guess? Yeah, well, we're on a road with that and we actually have some developer support from Sutter Hill Ventures.

Starting point is 00:06:53 So they have a couple of developers who I actually manage who contribute to the project. So we have some professional coding as well as grad student coding going into this. And it's a Rust library. And so it's all in the Rust ecosystem. And we're not in general release right now. Certainly you can go to the Hydro Project GitHub and play with it. But we're still very much in research mode right now. But I'd say in the coming 12, 18 months, we should have something that's worth playing with.

Starting point is 00:07:23 Awesome stuff. Cool. Yeah, kind of while we're talking about the Hydro Project. And I mentioned Starbursts earlier on. 18 months, we should have something that's worth playing with. Awesome stuff. Cool. Yeah. Kind of while we're talking about the Hydro project, and I mentioned Starburst earlier on, but I see one of the key components is also called Cloudburst. Is that in any way a link back to Starburst or is that just kind of Cloudburst sounds cool, right? Yeah. Cloudburst sounded cool.

Starting point is 00:07:38 That's actually, that project is end of life. That was a very early Hydro effort. The folks who did the Cloudburst work, which, by the way, functions as a service with state. So think of Lambda with good state management. So we were all very disappointed that Lambda didn't have any data when it first came out. And we wrote a sort of an opinion piece about that that created a stir. But when we did something about it, we built Cloudburst. So the folks who built that actually have gone on and started a company. It was called Aqueduct, which you mentioned, but names changed to run LLM in a sign of the times.

Starting point is 00:08:17 But it's essentially a cloud-hosted environment, the infrastructure. And the target application there is doing retrieval augmented generation. So document results in an LLM for API driven software. So if you want an assistant for your open source package or your company's APIs, run LLM, we'll build you one and host it. Sweet. We'll put a link to all these, all these cool things in the show notes. So the interested listener can go and check, go and check them out. Cool. So yeah, this, this podcast likes to talk about high impact, right? It's called the high impact series. So let's have a little bit of a retrospective now on your, on your career, Joe. So the first question I want to ask you is what are you

Starting point is 00:08:53 most proud of in your career so far? And does this necessarily correlate with your work that's been the most impactful? That's a great question. I think you're always, I'm always most excited about what I'm currently working on, but it comes from somewhere. So looking back, I'd interested in this idea of online aggregation, where you ask a query where it's going to take hours to compute and maybe only produce one answer or a small table of answers. Can we get early returns from that election, so to speak? Can we get a prediction of what the final result is going to be while it's running? That early work got me interested in how databases interact with humans, because it was really driven from human impatience. And this, by the way, was the very early days of web browsers. So I also was very enthusiastic about the Netscape Navigator when it first came out. Its chief feature that I liked over Mosaic,

Starting point is 00:09:56 which preceded it, was that it had interleased GIF support. So your images would start to download over the small modem, the small baud rate modems incrementally. So you could get an early view of what the pictures were going to be before they fully downloaded. I just thought that was super great. I wanted to do that for queries. So this was very much human driven work. And that made me realize that like for compute intense or data intense tasks, you know, there should be some care for like, what's the user experience of that? And that drove me generally to be interested

Starting point is 00:10:30 in interactivity and streaming and all these kinds of things. And that knits through my work in a lot of ways, particularly as it meets declarative specifications. So I was super disappointed when MapReduce came out because MapReduce was the exact opposite. It was like, we're going to do a big batch job. And between every stage, we're just going to put everything on disk.

Starting point is 00:10:51 And there'll be no outputs for you, sir, till we are all done. And I just thought that was such a big step backwards. And everybody was so excited about it. And I sort of said, but you could stream most of the things that MapReduce does. For instance, it did joins as reduces, which meant that you don't get any join output till the full join was computed, which is just not required. This then fed into a whole bunch of work in my group on like, when do you really need barriers? When do you really need to block and when don't you, which led to the column theorem, which is this question of like, what is coordination for?

Starting point is 00:11:22 And so a lot of sort sort of the deeper results the calm theorem which i'm proud of the ideas of online aggregation um ideas of uh streaming declarative languages for networking and for distributed systems a lot of this comes out from the beginning of this idea that like computers should give you answers right away and there's you know just being impatient and wanting to be interactive and and not do batch work batch works from like mainframes i don't understand why people but yeah it reminds me of a keynote you give at the vldb sydney with uh jeffrey hair you know i remember being kind of blown away there's a third time i think it was wrangler right was the tool

Starting point is 00:12:03 i believe that was demonstrated been a blow blown away by it and been like oh yeah remember being kind of blown away. There's a third time I see it was Wrangler, right? Was the tool, I believe that was demonstrated, been blown away by it and been like, oh yeah, that's kind of, why are all tools not like that? But yeah, it was really, really interesting. Remember, remember at the time, it springs up to mind. Cool. And that's a really awesome answer to that question. So kind of, again, building on this sort of retrospective sort of topic,

Starting point is 00:12:22 what's the most challenging project you've been part of? Well, probably, you know, you brought up Wrangler and Jeff Hare. Jeff and I took the Wrangler code with the student who wrote it, Sean Candell, and the three of us started a company, Trifacta, to commercialize that work. So this is interactive, essentially AI-driven synthesis of code for data cleaning with visualization built around it. Those words weren't the words that were used 10 years ago.

Starting point is 00:12:49 But doing that startup over the course of a decade was the hardest work that we did, the hardest consistent ongoing work. Doing a company that lives on, a 10-year journey for a company, is nights and weekends and a lot of heart. And just many, many people engaged who you're mentoring and caring for and collaborating with and learning from. And that was a full body experience, kind of in the same way that my other job of being professor has been a full body experience.

Starting point is 00:13:16 But the professor thing is like a career and a title and the company is like a chapter. So when you asked me about projects, that was the biggest pull. Yeah, cool. Would you tell to do would you do anything differently if you went back and had to do this sort of journey again would you be like hard maybe do this different that different yeah yeah i don't indulge in that kind of thinking much uh i'm kind of a forward-looking person that's the way to be right yeah yeah yeah uh in in part this also goes along with my very countercultural for science for san francisco uh point of view which is that like personal

Starting point is 00:13:53 optimization is not something i like to think about very much most people out here they're like i'm gonna you know do everything to the nth degree and make sure that i'm always being efficient in in my athletics in my eating, in how I get to work. I'm like, you know what? Get up in the morning, live life, and be joyous. So part of that is also not looking back and saying, what if or what should I have done better? That's fabulous. But I suppose you'd like something that I can share with students and with people coming up. From that respect, I do think I never

Starting point is 00:14:28 worked a Saturday in my life because of religious reasons. And I think that's been very healthy for me. So religion aside, having some discipline for downtime is, I think, something that I'm very grateful I always did. So that's a thing I would do over again, religious or not, because that changes in a lifetime, maybe over a lifetime many times. But I think dedicating time to downtime in a disciplined way where you don't let yourself off the hook and then work, that's really good, even regardless of how busy and committed you are. In terms of projects, I would say I had wonderful mentors. They were mostly pretty hands-off. I wish I had learned to be a better teammate. I think one of the cool things about doing a company was that it was

Starting point is 00:15:19 a collaboration that was forced by circumstance. We have to work together to grow. As a professor, it's really easy to work on your own. And while I've collaborated a ton, I've only really had one shared project, I would say, that was an ongoing campus project. That was the Telegraph project with Mike Franklin, which was a great deal of fun. But there's something I wish I was better at, teaming up on running things um it's a skill that i feel like i still need to work on nice it's funny what you said there about um going back a bit a second about sort of everything being data driven and over optimizing i'm a sucker for that as well like i have this this aura ring now which tracks my sleep and everything and you just get so

Starting point is 00:15:58 like oh i didn't get enough sleep last night i only got seven hours 13 minutes like i need to get my eight hours and yeah it is well but yeah trying to detach from that is is and just yeah give yourself a break a little bit and unplug from the from um from reality maybe or from the modern world is definitely needed and not feeling guilty about taking time off as well because i was a sucker for that during my phd i know i should be working but yeah sage advice for sure cool um yeah let's talk about motivation some more so what are your favorite papers i saw you do a tweet recently actually but maybe along these lines so this might that might be like about one of the papers but yeah go for it joe i think you know i have papers that i enjoy teaching because i teach a graduate class and and you you know you go over them enough times and and

Starting point is 00:16:43 you don't they're good vehicles for teaching. Whether they still inspire me is hard to say because I've just poured over them over the years. But I think there's early papers and databases that just ring. They just keep working. They keep making sense. So the Systemar sort of retrospective paper is just great. The Pat Selinger paper on query optimization from 79, everybody calls the Bible of query optimization. It's spot on and just, you know,

Starting point is 00:17:10 it was so well seen that this is the problem to be solved and this is how to think about it. And you almost can't appreciate it unless you read other papers like the Ingress Query Decomp paper from around the same time. And they just look incoherent now because at that time, nobody knew what the problem statement was. They were kind of noodling about.

Starting point is 00:17:29 Salinger got the problem statement right. And after that, everything flows, like decades of research flow. And so there's papers like that that just make sense now because they fit our mental model of the computation problem. But at the time, she was just pulling that out of thin air, like putting structure on an unstructured design space. So that paper is great. A less known paper that I love teaching is a paper by Mike Carey, Rakesh Agarwal, and Maron Livni on performance studies for concurrency control protocols. I don't remember the title off the top of my head. It's from the 80s. And at the time, there was controversies over whether locking was better than optimistic concurrency control. And there were papers that had come to completely opposing conclusions.

Starting point is 00:18:16 So the thing that's great about this paper is that it says, how could it be that science tells us two opposite things? Well, it must be that the scientists were making different assumptions. So let's crack open the space of like, what assumptions could you make with a well-formulated simulation mechanism, and then see if we can set the knobs on this simulation to explain how they came to these different conclusions and perhaps inform what a real world conclusion should be for this controversy. And it's just this beautiful study and like, here are performance graphs that tell a story and they explain what's going on. I love that paper, especially for students, because so many research papers we write, they have performance study at the end that says, I win, you lose. I win, you lose. I win,

Starting point is 00:19:04 you lose under many parameters, right? I win, I win, I win. And those tell almost no story at all except that binary story. This paper shows you what performance studies should be. And it also brings to the fore the point that the graphs aren't the issue. The issue is understanding the problem. The graphs are there to explain your understanding of the problem. And I think so few research papers in computer science do that well. So I love that paper. This year, I added a paper to that reading, which is by a postdoc of mine, again, results from Mike Stonebraker's group at MIT that had been done earlier on concurrency control, but in this case, multiprocessors, like thousand-core processors. So they did this paper, Staring into the abyss with a thousand cores. So Timo

Starting point is 00:20:06 implemented all their stuff. He took their simulator and put it on a real machine and got completely different answers. So their simulations appear to be all wrong. And again, he goes through this sort of iterations of assumptions and setting up the problem so that you understand it's not a simple story. It's not an I when you lose story. It's all performance studies are all about what is your context? What are your assumptions? And why do those assumptions and context lead to the conclusions? people who are like, you don't have enough graphs. This won't get into SOSP because you haven't studied it carefully enough, by which they mean you haven't put in the sweat equity to meet our community's standard for sweat equity. But I look at the papers that come out in a lot of those conferences and there's just garbage graphs. They have no bearing on what might connect to the real world. They tell you very little about what parameters of the setup or the deployment would affect the results.

Starting point is 00:21:06 And they've been done at great cost to student time. And in these days, cloud costs sometimes if they have GPUs involved and the conclusions there, you can show me lots of graphs. That doesn't mean that you have a fine grained understanding of the problem. It just means you ran a lot of experiments and plotted them. So I really like that Keri Agarwal-Livni as like a, it's my avatar for, should I even have a performance study or have I

Starting point is 00:21:29 already explained the solution? The graph is just there for window dressing. Right. So anyway, those are a couple examples. Of course, you know, in like the data visualization side of my brain, the work on Polaris that led to Tableau, the grammar of graphics realized as software. I think that's Jack McKinley's thesis. That's beautiful work that I learned about in my travels into visualization land. I'm trying to think of other stuff that kind of knocked my socks off. I think the Cord paper by my colleague Jan Stojka, David Karger, Franz Kaczek and others

Starting point is 00:22:05 is a lovely paper from peer-to-peer days that I thought was quite influential on my thinking. I just thought it was super cool and went and ran off and worked on related things because I thought it was cool. So those are some. That's awesome. Definitely a few more on the reading list there. So yeah, good stuff. Yeah. So I guess kind of, i'm not sure if we've covered this off really but um kind of inspiration and which ones of these papers or people have had the biggest sort of impact on your on your career well people i mean personal mentors um i've been super fortunate so people at ibm well even yeah mason shu, who people don't really know about so much. She was a wonderful, she got me into the field as a college student. She was a Harvard professor at the time.

Starting point is 00:22:52 But then at IBM, just the whole crew of people, starting with my boss, Hamid Pirahesh, very influential. Guy Lohman, Laura Haas, Pat Sellinger, Jim Gray. He wasn't there at the time, but I met him through them. And then Mike Stonebraker was super influential on my career. He was my master's advisor and then really one of my PhD advisors and colleagues over time and oversized personality for our field, certainly, over the years. And, you know, he's a source of friction sometimes.

Starting point is 00:23:23 And, you know, sometimes to create a spark, you need some friction. Yeah, exactly. If you want to make a novelty, you've got to make a few eggs, right? That's right. So I don't always have fun with Mike, but he's definitely forced me to think hard and sharpen my arguments and sometimes use him as a counterpoint to, you know, I'm not going to do it Mike's way, so what way am I going to do it? So he's been great. Jeff Naughton, my thesis advisor, much gentler mentor, saw me through some rough times. Everybody has rough times in their early career. I had my share. So those are some people. And then intellectually since then, Christos Papadimitriou was an

Starting point is 00:24:00 amazing sort of source of just like the joie de vivre of doing computer science and enjoying it uh that guy there's nobody like him for just joyful embrace of ideas that guy's so so exciting to hang out with and see lecture because he just loves computer science he just he and he conveys that love through the way he lectures and talks about work and collaborates. Jeff Hare, my collaborator at Trifacta, is a visualization thinker and builder. Wonderful open source builder. He led D3 and Vega and he's been very successful in open source, but also such a deep thinker and so articulate. He's been a real influence.

Starting point is 00:24:44 So those are some of the folks that I've worked with who've been so articulate. He's been a real influence. So those are some of the folks that I've worked with who've been really inspiring. That's awesome. Whilst I've got you on, Joe, I want to ask you this question. So I believe you are the author of this video, and this had a big impact. I actually showed someone at work the other day. This is the crazy concurrency control video on YouTube.

Starting point is 00:25:00 Is that you that made that? Yeah, that's me. It's absolutely outstanding. It's brilliant. Even now, we were having dinner the other week. We had an offsite at Neo. And I said, you must have seen this. And a few people hadn't seen it.

Starting point is 00:25:12 And everyone was absolutely loving it. Yeah, tell us about that a little bit more. Because I want to use that as the intro music to the show, if that's possible as well. I think I love it. It's awesome. I mean, you know, we're skirting copyright issues on that. Just a bit, yeah.

Starting point is 00:25:27 Yeah, just a bit. I don't think they care much. When GarageBand came out, so I'm a musician. That's the career I never pursued. But I'm a pretty serious jazz trumpet player. But when GarageBand first came out, it really was a revolution for me anyway. But I think for the industry, suddenly it was super easy to make music, digital music, with very mediocre keyboard skills, which is what I have. And so I started to do these kind of silly pastiche songs where I would, you know, I was like, today I'm going to do a heavy metal song.

Starting point is 00:26:02 And I will write lyrics about how unpleasant it is to give birth to babies and make a song for my wife. And then tomorrow I'll do a disco song. And so I have this album from my family of like silly songs that I wrote, which I will not share. But in that process, one of the things I started doing, I was like, well, I'll just do songs for class. And so I did the concurrency control song. And that one was just a karaoke. I stole the background from a popular song. And I thought I would do more. And so the next one was going to be a recovery song set to the tune of Don't Worry, Be Happy.

Starting point is 00:26:36 And then that got repurposed as a birthday song for my mother-in-law. I never saw the light of day. I am a one hit wonder in the database song department um but maybe i'll have more time as i'm getting older to do more songs it's just music's changed so much now uh you know even even the concurrency control song is now an oldie yeah it's classic though might have to leave it to the next generation to be more relevant cool um great stuff yeah so um before we talk about that, you mentioned kind of things were easy earlier on in your career.

Starting point is 00:27:09 And I kind of want to ask you about setbacks and how you deal with setbacks and rejections. It's part of doing business in this sort of, in this industry. So yeah, how do you deal with it? Yeah. Well, I'll tell the story of my grad school days because that was the hardest one and probably the one many of your academic listeners will resonate with.

Starting point is 00:27:30 I wrote my first paper at Berkeley. It was my master's thesis signed off by Mike Stonebreaker, Randy Katz. I can't remember who else, but famous people at Berkeley. Got to Wisconsin, had sent it to SIGMOD. It was accepted. Very exciting. So I was like on the road. Life's good. I'm in the shower one day. Oh, no, sorry. Before this, I go on. I gave the talk at SIGMOD.

Starting point is 00:27:52 Everybody said, oh, you're a very smart young man. And then Sergey Choudry, who is a couple years older than me, but he had gotten his PhD already. So let's say five years older than me. Came up to me afterwards. He said, that can't be right. I was like, dude, it's right. I have a theorem and a proof. It's right. I'm a grad student. I'm smart. He's like, no, no, that can't be right. And we argued and I was stubborn. Six months later, I'm in the shower. I'm like, oh my

Starting point is 00:28:14 God, it's not right. It's wrong. Here's a counter example. And I was sure my career was over. I had published a paper. It was wrong. I was going to have to retract it, I guess, right? You can't say something in science that's wrong. And I went into the department at Wisconsin and I thought that was it, end game. And I went to my advisor, Jeff Naughton, who's a lovely human being, just gentle and thoughtful and many things that an advisor could and should be. And he said, you know, this happens all the time. People write papers, they have flaws and you know what they do? They write another paper and then they publish that. And it's honest and it takes part, you know, what was wrong with the previous paper and it tries to address it. And that's okay. So, you know,

Starting point is 00:29:01 what are the cases where your thing is right? And what are the cases where it's wrong? Let's work that through. And so we worked it through, you know, it was a kind of conditionally it's right under certain circumstances. And when it's wrong, here's what happens. And maybe here's some empirical kind of heuristics you can put around it. Publish that paper. Met Sergey Choudhury again at some point. Fessed up.

Starting point is 00:29:21 Every time, you know, I work with him closely. We're co-editors of foundations and trends and databases. And about every 12 months or so, I'm like, you know, Sergi, you always worked smarter than me, but you were right about that paper. I have to just fess up because it's still, you know, it still bites me a little bit, but the lesson I think is the one Jeff taught me, which is that, you know, computer science, fast moving field, get your results out. If they get published, that's great. That doesn't mean, you know, it's the end of the story. Sometimes things need more work. And if there were flaws in your work, that's okay. But be open

Starting point is 00:29:53 about them, address them. It's an opportunity for more conversation in the community. And papers are not like record albums. Or maybe they are, maybe they are. But the point is, you know, like your second album isn't your last album. Maybe that's the way to think about it you know they're more like a conversation over time uh and uh it's okay for that conversation to have some actually i didn't mean and i'm sorry when i said that and you know it's a dialogue uh or maybe it's a community dialogue and that's hard to learn when you've only written one paper because you feel like you only got to say one thing and you said didn't get it quite right ah but there's more chances uh hopefully in in most people's lives i've been real fortunate with that but most people will get to write another paper if they keep at it yeah that's

Starting point is 00:30:38 lovely um cool so yeah while we're talking about ideas and things isn't this next question is my favorite my favorite one um it's about the creative process. I love seeing how people's minds work. And everyone has a different answer to this question. It's how do you approach idea generation? And then once you've generated some ideas, how do you select what to work on for the next six months, two years, five years, 10 years? Yeah, that's such a good question. And I've changed my approach over time. I also think that it's good to change your approach over time. So I feel like I move back and forth between styles to some degree. So in terms of ideas, sometimes I want to continue stubbornly on the same idea. Lots of times I want to jump into a new idea. So I do have a little bit of, you know, squirrel chasing tendency on that early in my career, particularly opportunities to collaborate with smart people and work on new things were so exciting that I, I felt like I had, you know, I had done a lot of small things and that I had a sense I was doing fine. You know,

Starting point is 00:31:41 people would pat me on the head and say I was a clever lad, but I wanted to be remembered for something. And I couldn't articulate to myself what it was that my through theme was, my research vision was. And so in part, this was inspired by Jennifer Whittem, who was another mentor at IBM when I was there. And in some ways, a competitor who, again, like Mike Stonebreaker, sometimes the friction there, I think, was good for me, but very different style. She's a very meticulous person, and I'm less so. She was also at Stanford and I was at Berkeley, so we had to compete. But I could say exactly what Jennifer was working on. She was so good at articulating what she was doing, and I thought, I need to be able to do that.

Starting point is 00:32:20 So part of the work that I'm still doing on declarative languages and how they can impact the rest of computer science was me saying, okay, I think this is one of the signature themes of my community. And I think that if I plug away at this thematic advantage of database thinking, I should have an advantage in computing broadly. So I'm going to take the tools of our trade, take our brand, so to speak, and see what I can do with it. And so that's been deliberate that I wanted to have a series of projects that had a through theme while also doing other stuff. I mean, I always have a portfolio going and that's a luxury of being at Berkeley and having great students who can execute on things when I'm distracted. They can take random ideas I haven't run with them. So that's kind of the portfolio thing of having some through themes

Starting point is 00:33:11 that I'm working on at all times and then have space to explore new things with collaborators who I find inspiring. And that speaks both to idea generation, also methodology, because I think they're hard to separate. The craft is the inspiration. You know, I think that's true in music or in art or whatever, as it is in any creative field, that the doing of it leads to the outcomes

Starting point is 00:33:34 as much as some spark going off in your head, right? So having big systems projects is one kind of doing. Doing creative collaborations with people who come from somewhere else is a different kind of doing. And in the doing of these come from somewhere else is a different kind of doing. And in the doing of these things, you come up with ideas. And they're different. Like, you know, the conversations I have with students where there's four of us in a room building an artifact that's been going for a few years, it's very different than the conversations I have the third time I'm meeting someone from another area. And we're trying to understand, we're misunderstanding each other creatively. And, you know, it's just, they're just different. Yeah. Yeah. Something you mentioned

Starting point is 00:34:10 a second ago there about actually implementing something or doing something can kind of, a lot of ideas can flow from the act of doing. And then this was in the Red Book, which I believe you've, you've coauthored and it was about in the concurrency control section. And it said something like, once you've read these, these protocols or whatever, you've learned them, you've not actually learned them until you've actually gone and implemented them. And once you've implemented them, then you've got a proper understanding of them. And I just thought that kind of, that really hit home with me anyway, because I do that with a lot of things. I'll read something, I'll go, oh yeah, I've got it. But not until you've actually gone and put it into practice do you realize, oh, maybe I don't really fully understand that.

Starting point is 00:34:41 And then once you've done it, you think, oh, it'd be cool if I could tweak it in this direction, that direction. So yeah, I kind of, yeah and those those shower thoughts are like the eureka moments i mean not many people have those right sometimes the narrative gets retrofitted like yeah i was in a shower or i was under a tree and an apple fell on my head sort of thing but um yeah that often doesn't happen like that right but yeah cool i think that's something by the way that um i humbly have learned from music practice. And I think it's certainly true in athletics. Anything that involves the human body, it's 100% true, which is being able to conceive of something and being able to execute on it are completely independent.

Starting point is 00:35:17 And, you know, the philosophers call this the mind-body problem. How do you translate intent into action? And, you know, it's maddening when you're trying to really refine a skill. But I think even for the purely intellectual pursuits of, you know, computer science, writing programs, you know, for that matter, doing math, I think the practice leads to the inspiration. So, you know, very few people

Starting point is 00:35:44 are born brilliant mathematicians. Most people, even great mathematicians, it's the doing of the math that makes you better at math. And it's certainly true, I think, for the intricate, tricky bits of computer science like concurrency control, that implementing it so that you see enough permutations and combinations of what can happen gives you a different intuition.

Starting point is 00:36:06 It gets your brain seasoned in a different way. But I think it's true for a lot of things. The apple falls on your head because you were sitting in the orchard, you know, and you got to spend your time in the orchard. Yeah, I like that. I like on the mind-body thing as well. I mean, you don't believe how many times I've envisioned scoring the winner for England in the World Cup. I mean, World Cup final, but that ain't happened. So yeah.

Starting point is 00:36:30 But yeah. Humility is also good. I had actually a big epiphany this week because I'm at a certain age where I'm like, so, you know, what more skills will I be able to acquire in this life, really? I was washing the dishes and actually I was cooking. I was making a cake for my wife's birthday and I was trying to scoop the batter out of the bowl.

Starting point is 00:36:50 And I had the bowl in my right hand and the spatula in my left, and I couldn't do it. I had to swap hands. And this trumpet practice thought came into my head, which is, I really need to practice my left hand more. You know, like any skill that you're not good at is clearly an absence of practice. And I was like, that's ridiculous. Like, do we judge people on how ambidextrous they are? Is that part of my personal optimization space that I want to pursue? Like, no, I'm a righty. It's fine. All speaking to this idea that like, you know, there are things in life that can be left unoptimized. Yeah. Yeah. Yeah. That's very true. Awesome. So yeah, my, my next question,

Starting point is 00:37:27 and this is, I guess is kind of the mission statement of the podcast as a whole. And it's about bridging the gap and between academia and industry. And obviously you've done that with various kind of startups. You've kind of crossed, crossed that, that, that bridge numerous times. And so I kind of want to get your take on what you think the current interaction, interaction between academia and industry is like and how it can maybe be improved going forward, if it needs improving at all. Yeah. I mean, there's always room for improvement, right after saying that things are chilling, you should just... But I think in systems, we can certainly improve our systems. I think it's a wonderful time to be an academic who's interested

Starting point is 00:38:05 in entrepreneurship. It's not as wonderful a time as three years ago when it was easier to raise funds. And right now, if you're not doing AI, particularly language model-oriented, foundation model-oriented things, it's hard to get funding. But it is the case that Silicon Valley investors are looking for technical leadership and happy to talk to academics about what it would look like to start a company. It's only the last few years that venture capitalists started coming to research conferences, but they do regularly now, certainly in databases. I mean, they've been going in AI even maybe a little bit longer. And that means that you'll trip across, if not actually the opportunity to start a company, then someone who has started a company. When I was coming up, it was just Mike Stonebraker.

Starting point is 00:38:52 And if you didn't like the way Mike did things, you had no other role model really for entrepreneurship and research, at least in databases. And now there's many. Many of us have stories to tell. Every entrepreneur I've met overfits their advice to their experience, myself included. So if you only get one person's advice on how to run things,

Starting point is 00:39:14 it's not going to work for you, very likely. In fact, it's probably not going to work for their next company either. These so-called lessons are just data points. You need to amalgamate a bunch of them. Nowadays, you can get that. You can go to Sigmund and talk to 10 people who've had successful exits and talk to them about what their companies look like.

Starting point is 00:39:30 You can talk to 10 more about people who didn't and what they learned from that process. And maybe some of us who've had a little bit of both. So I think it's great times now for entrepreneurship and learning from other people. Also, just the volume of material on blog posts and stuff on how to start a company is just completely different than it was 10 years ago even. So there's just lots of good advice out there. It's not that hard to find. Industrial, like big industry research, on the other hand, I think is not in a great state right now. Microsoft continues to fly the flag for Microsoft Research. I know that they are constantly battling about protecting it or making it more relevant versus making it more fundamental. But they have a large and successful

Starting point is 00:40:17 research organization, and they're kind of the only one. I mean, there are other research organizations, but they're not dedicated to doing the kind of foundational research that universities and Microsoft Research do. So there's good people at other places. And I certainly don't mean to disparage the efforts at other places. But most of what I see in industry coming into the conferences is we built a thing. Here it is. And those papers aren't bad, but they're very rarely educational. So if somebody says we built a thing and it scales, it's hard to project from that to its relevance to other stuff. And often, you know, there's sort of, if anything, misleading because it's overfit to their environment. You know, it worked at company X, big super scaler X for task Y.

Starting point is 00:41:04 Nobody else has task Y. And so those papers, I think, lead people down the garden path with some frequency in ways that's not so good. And very few of them are thoughtful, honestly, in terms of thinking outside what they built, because these are software engineers. So they describe what they built. That's their experience. And there's nothing wrong with that. It's just anecdotal, is I think the thing to emphasize when you read those papers is most of them are anecdotal. I was recently reading the teaching, the Amazon Aurora, the first paper on Aurora to my students. And among those papers, I was really pleased at how methodical it was.

Starting point is 00:41:38 And it was clearly written by people who had a bigger perspective. And that's an example of a good one where they're not a research group at all. But there are researchers on that team and it was written with the mind towards like, what are the piece parts that make up the system? And why did we choose these ones? As opposed to, we built a thing, it scaled. Contrast with the Spanner paper,

Starting point is 00:42:01 just to put it right out there, the Google Spanner paper. Here's four mechanisms for doing concurrency control we use all of them spanner the end why so many and how and you know how did you arrive at that magic number of four yeah like what was why was the other components that you didn't consider why did you discount those right so yeah and that's a technically rich paper i should say a lot of these papers it's like we have ideal one and we scaled it at least the spanner papers we had three or four things we mixed together and it's really

Starting point is 00:42:33 complicated and it works um there's no why uh but at least it is deep i mean it's a important paper in some sense it's an interesting anecdote yeah for sure i mean it's kind of been quite impactful it's always the span of s is calvin right and uh like that's kind of the two those two papers came out roughly around the same time right i know when i was starting out reading about district transactions and stuff it was the yeah you gotta pick a side spanner versus calvin for example anyway i digress i never i never thought of it in those terms that's interesting that may be a product of your times cool well let's talk about um current trends in future so uh the first one i want to ask is what's the most exciting advancement that you've observed recently

Starting point is 00:43:14 yeah it's not fair uh because you know llms are just so interesting you know it's not interesting to say that they're interesting, but it is the biggest thing that's hit computing since the Macintosh, let's say, or the iPhone, maybe. And I think of it very much in those terms, like, it's cool. I want to play with it. Because, you know, it's not like I'm going to understand it any better by reading the papers, because it doesn't matter. And it's not clear I would understand it or anyone would understand it by reading the papers. But playing with it and thinking about what could I do with this object is super cool.

Starting point is 00:43:51 And it's the one research topic that I'm not working on that if I weren't doing what I'm doing, I probably would work on. The distracting thing about LLMs is that everybody's working on them. And so I don't know how much I would add to humanity or computer science or whatever by doing research in that space. I watched my very close collaborator and colleague, Joey Gonzalez, who's a leader in that area. And he, you know, he does machine learning

Starting point is 00:44:16 systems. That's his full-time thing. He's not like adjacent sort of, I'm a database person who is happy to do machine learning. And you know, his group,. And his group, they're just getting papers out every few weeks, getting scooped every few weeks, getting more papers out. It's exhausting. And unclear, like if his group didn't exist, would another group pop up and do the same work? Maybe. I don't know. I mean, Joey's really good, so I don't mean to discount him in any way.

Starting point is 00:44:40 But if it were me, I think that would be my worry, is that I if if i i prefer to do something off in my corner where people go oh i never thought of that uh it just feels more um well it's more controlled for sure and it feels a little bit i might have a better chance of doing something uh higher impact you know where if i didn't do it maybe it wouldn't have gotten done yeah so that's my hesitation of working in that space. But man, you have this magic new question answering box. And we're in the business at some level in databases.

Starting point is 00:45:12 One of the things we do is answer questions, answer queries. It's not the only thing we do, but it's one of the big ones that we do. Now there's a new box that answers questions in a completely different way, scoped very differently in terms of what it can do and what it does wrong. How do we bring these question answering schemes together to be accurate and efficient and all that? The thing that really juices me, though, is the idea that the world used to be divided into structured and unstructured data. And that's just not true anymore. So if you wanted to get tabular data out of videos or text or whatever, you can. You can featurize it and get features out, I guess, columns, and then run SQL on it.

Starting point is 00:45:55 So now we can query everything with structured queries. Should we? Maybe not. Maybe yes. So now that also raises the question of what are structured queries good for? When is it better to ask natural language queries? And this whole thing is like mutually recursive. You know, it's like, where does the natural language stuff end and the structured stuff begin? Where does the data end and the queries begin? Everything is everything. And it's a really mushy space in that sense and a very malleable one. So you can view that positively or negatively. But boy, there's going to be a lot of change. Yeah.

Starting point is 00:46:31 Yeah. I mean, it's kind of, it's almost, from a far, I don't really dip my toe. I've played with TrackGPT and I use it more and more actually in my day-to-day life, trying to get exposed to it more. But I find it very as a feel very daunting because of the the fact that the the output is so fast and it's hard to like get the signal from the noise basically and it feels very much like a bubble and it's like i'm like okay come back to

Starting point is 00:46:55 me in five years when you've got some like when you realize what like what the actual useful stuff is um but yeah maybe i should be i should i don't know persist a little bit more but uh for sure it's going to be interesting to see what implications it has. On the unstructured versus structured thing real quick, I interviewed a guy on the podcast who had a paper, I think it was at CIDA, and he was like, we were talking about it off air.

Starting point is 00:47:17 He's like, yeah, you can kind of give these LLMs just numerical data, just give it random CSV files or whatever. And it can actually reason about those. And I was like, that's just wild. Like I'm giving it some random numbers, even like, and as I said, it was baffling. I was like, wow, that's mad. Really interesting. Yeah. I guess leading off that question a little bit then, what do you think is a promising direction for future research in databases, maybe in the shadows of LLMs would be a way to scope this question. What's hidden around the corner that's not getting any sunlight and could potentially have big impact? Well, I mean, that is the story of my day-to-day.

Starting point is 00:47:55 So I think declarative languages applied to other stuff is an idea whose time has really come. I've been saying that for a long time, so maybe you shouldn't trust me. But particularly as we enter the world where the authoring of code isn't the point, because maybe generating code is something that LLMs can do, what we need to be doing under the covers is going from specification to efficient implementation,

Starting point is 00:48:22 which is the magic of query optimization and turning declarative into imperative. And the database community has tricks up its sleeve there. So I'm doing that in the distributed systems environment. One of the things I like about that environment is there's a lot of data movement. A lot of the cost of building a distributed system is about what data goes where and when. The when part not being as typical in databases.

Starting point is 00:48:48 That's where the distributed systems reasoning and the concurrency control reasoning comes in. But that's kind of my jam right now. But I think there's lots of other places. Like you could think about languages for programming GPUs. You know, CUDA's not great. Halide's much better. Halide looks a lot more like a query language. And maybe, you know, it came out of the programming languages

Starting point is 00:49:10 community really, but database people have a lot to say there. So generally in the connection to programming languages and not, you know, traditionally that was, oh, what's a good language for programming a database or programming over a database, you know, like object relational mappers and stuff. I think there's stuff to be done where it's like, no, just database ideas are important to compilers, are important to programming language design and the tool chain. And there's people like Max Wilsey here at Berkeley who are programming languages researchers learning from database people.

Starting point is 00:49:41 So that, I think there's a lot of interest there from the PL community if you go reach out uh so i think that's exciting dan sucio at washington's doing a bunch of this on the theory side we're doing stuff on this on the applied side there's others uh but it's certainly uh not getting the kind of shine that llms get um so i think that's a lot of fun. Cool. Yeah, I guess. Awesome. Just one last question from me now, Joe. And that is, what does success look like for you from now over the rest of your career? What is on the, what's your goals, objectives? Yeah.

Starting point is 00:50:16 What does success look like going forward? Those are the hard questions in life, right? I'd like to see one of my projects succeed in open source. I've never done that, actually. So Trifacta, which I think was quite successful, we had lots of users. That was done through non-open source, partly because open source is a vehicle

Starting point is 00:50:38 for getting software to programmers. It's not necessarily an important vehicle for getting software to end users who aren't programmers. But I care about programmers. It's not necessarily an important vehicle for getting software to end users who aren't programmers. But I care about programmers, but half of my work is in the space of improving the experience for programmers. So Hydro would be an example of a thing where I'd really like to see open source adoption. And I've never played that game, really. It is a marketing exercise like any other, really. I mean, what I've seen from my colleagues who've done this successfully is you go to meetups, it's really just marketing. There's a certain kind of open source marketing that I've never tried. Obviously the work has to be good, but you don't have

Starting point is 00:51:13 salespeople selling it and you don't write papers about it. You go off and do the thing that makes open source successful. You build communities online, you go to meetups, you do all that kind of stuff. And I think it's gratifying to see people use your code as software engineers, which I haven't had. I've had people, business analysts using my code, marketing people using my code, data engineers using my code. But I'm a programmer. I'd like to see programmers use my code.

Starting point is 00:51:36 That'd be cool. So that would be a form of success that I have yet to see that I would enjoy. I do have code in Postgres and things like that. So it's not like I haven't felt it, but it wasn't mine. Postgres is a stonebreaker. I'm proud of it. So that'd be cool. I'm super proud of my PhD students, and I'd like to see my current crop at least do well.

Starting point is 00:51:56 So that would be success. And if I have more students after that, then that's always, that's a very personal one because you spend six years with a person, they're as much family as they are uh uh products right uh so so i really want you know that would be success for me is to see all those folks have an impact and i think as you get older uh teaching is half of the fun and in that sense it would the knowledge isn't useful unless you pass it along. Well, that's a great message to end on, Joe. Thank you so much

Starting point is 00:52:28 for speaking with me today. It's been a fascinating chat and I'm sure the listener will have loved it as well. So yeah, we'll see you all next time for some more awesome computer science research. Thank you.

Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Joe Hellerstein

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.