Disseminate: The Computer Science Research Podcast - High Impact in Databases with... Joe Hellerstein
Episode Date: July 1, 2024In this High Impact episode we talk to Joe Hellerstein.Joe is the Jim Gray Professor of Computer Science at UC Berkeley. Tune in to hear Joe's story and learn about some of his most impactful work.The... podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast, as usual, Jack here.
Today is going to be another installment of our high-impact series, but before we get onto that,
I need to give a shout out to our sponsor, Pomtree. Pomtree are the developers behind Raftree,
the open-source temporal graph analytics engine for Python and Rust. Raftree supports time
traveling, multi-layer modeling, and comes out of the box with advanced analytics like community evolution, dynamic scoring, and temporal
motives mining. It's blazingly fast, scales to hundreds of millions of edges on your laptop,
and connects directly to all your data science tooling, including Pandas, PyG, and Langchain.
So go check out what the Pantry guys are doing at www.raftree.com where you can dive into their tutorial for their new 0.80 release.
Today we are going to be talking to Joe Hellestein. Now Joe is the Jim Gray Professor of Computer
Science at UC Berkeley, where his research focuses primarily in the area of data-centric systems
and he's interested in the way they drive computing. Joe has won numerous awards across his career. He's won
the Codd Award, the Alfred P. Sloan Research Fellowship Award, and also he's been featured
in Fortune's magazine's Smartest in Tech list. He's also led numerous open source projects,
including Bloom, Madlib, and Telegram. And he's also started several companies,
including Aqueduct and Trifecta. And for new listeners to this show, they'll know this type of episode is based off of a blog post by Ryan Marcus around the most influential people and papers in databases.
And Joey, you're number eight in the rankings at the moment.
So doing good.
Welcome to the show.
Thanks for having me.
It's fun.
Awesome stuff.
Cool. So I've given obviously the highlight reel there of everything me. It's fun. Awesome stuff. Cool.
So I've given obviously the highlight reel there of everything you've achieved so far
in your career or some of the things.
But yeah, help us color in between the lines now and tell us about your own journey in
your own words.
Let's see.
So I was born into the database research community out of IBM, Berkeley, and Wisconsin back in the day when those were
sort of the three dominant areas of database research. So right after college, I did an
undergrad at Harvard with a woman named Mason Hsu, who later on went to HP Labs. And from there,
I went to IBM Research as a pre-doc intern in the lab that was famous for System R and R
Star.
And they had a project at that time called Starburst.
So I was part of the Starburst team working on query optimization.
And that stuff apparently still ships in DB2.
So that was long ago good stuff, extensible query rewriting.
And then I went to start a PhD at Berkeley with Mike
Stonebraker on the Postgres project. But being a young man and full of beans, I decided after one
year that that wasn't where I was happy. And I transferred to Wisconsin, which is actually my
hometown, and finished my PhD there with Jeff Naughton. But I continued to work on Postgres.
So my PhD work is all in the context of the Postgres project. Mike Stonebraker was super supportive, helped recruit me back to Berkeley as a professor. And I've been there ever since, since 1995, leading my own research and expanding out of core database systems into all kinds of things that interest me where there's collaboration. Stonebreaker left Berkeley a few years after I arrived,
which sort of forced me into collaborating with other folks.
And so I've been working on things like machine learning
and data visualization and networking and operating systems
and how those things connect to data management
over the course of my career.
Awesome.
So did you always kind of going back to the kind of,
I don't know, 12 year old, 13 year old Joe, did you always want to become a researcher? Was that always the sort of desire from being very young? what was called a systems analyst, which means COBOL programmer. So I kind of fell in between them,
but I also have two big sisters who went on to grad school and became
professors. So I just did what everybody else did.
I'm a, I was well-behaved as a child.
And so I just went into the family business.
Awesome stuff. And also it's the first time I'd heard about Starburst there.
Now the name, the name of that, is it,
did you have Starburst in the U S the suite, right?
Like the little suites.
I believe they were there at that time. Right. Okay. The name of that, did you have Starburst in the US, the suite, right? Like the little suites?
I believe they were there at that time.
Right, okay.
It's long enough.
Cool.
Awesome.
So let's get on to what you're working on currently today then. So yeah, give us the high-level overview of the stuff you're working on today.
Yeah, well, the primary project I'm working on is a project called Hydro,
and it's the culmination of a very long effort on my part to bring declarative languages and their power to other areas of computer science.
And we did this in the early 2000s in declarative networking.
So we showed how you could write networking protocols in high-level languages and then compile them down from those high level languages to implementations
that made sense in different contexts. Like the same spec would give you a wireless protocol or
an internet protocol, depending on how you cost it out links and reliability and stuff.
So there's a whole generation of work on declarative networking. What we're working
on right now is high level languages for distributed programming in general.
And the thesis here really is that there's only
one language that succeeds in scaling from one core to the globe, and that language is SQL.
It runs in your phone, you know, in SQLite, and you can take the same query that runs in your
phone and run it on Snowflake across the globe. It's pretty amazing. Most programmers don't think
of that as an option for writing general purpose code.
So we're trying to figure out where those gaps are and what are the optimizations required for general purpose programming to scale up and scale out across lots of machines.
And in the Hydro project, we've got a prototype going now after a couple of years that can do some pretty cool things,
ranging from we can optimize low-level protocols
like Paxos for high bandwidth as implemented in Hydro.
And we can also build systems infrastructure
like key value stores that auto-scale in Hydro as well.
And you get very tight specifications
that automatically optimize themselves.
Awesome. Is this already integrated into sort of GCP AWS
or is it sort of, how is that path looking like for it to be sort of general purpose, I guess?
Yeah, well, we're on a road with that and we actually have some developer support from Sutter Hill Ventures.
So they have a couple of developers who I actually manage who contribute to the project.
So we have some professional coding as well as grad student coding going into this.
And it's a Rust library.
And so it's all in the Rust ecosystem.
And we're not in general release right now.
Certainly you can go to the Hydro Project GitHub and play with it.
But we're still very much in research mode right now.
But I'd say in the coming 12, 18 months, we should have something that's worth playing with.
Awesome stuff.
Cool.
Yeah, kind of while we're talking about the Hydro Project. And I mentioned Starbursts earlier on. 18 months, we should have something that's worth playing with. Awesome stuff. Cool. Yeah.
Kind of while we're talking about the Hydro project, and I mentioned Starburst earlier on, but I see one of the key components is also called Cloudburst.
Is that in any way a link back to Starburst or is that just kind of Cloudburst sounds
cool, right?
Yeah.
Cloudburst sounded cool.
That's actually, that project is end of life.
That was a very early Hydro effort.
The folks who did the Cloudburst work, which,
by the way, functions as a service with state. So think of Lambda with good state management.
So we were all very disappointed that Lambda didn't have any data when it first came out.
And we wrote a sort of an opinion piece about that that created a stir. But when we did something
about it, we built Cloudburst. So the folks who built that actually have gone on and started a company.
It was called Aqueduct, which you mentioned, but names changed to run LLM in a sign of the times.
But it's essentially a cloud-hosted environment, the infrastructure. And the target application there is doing retrieval augmented generation.
So document results in an LLM for API driven software. So if
you want an assistant for your open source package or your company's APIs, run LLM, we'll build you
one and host it. Sweet. We'll put a link to all these, all these cool things in the show notes.
So the interested listener can go and check, go and check them out. Cool. So yeah, this, this
podcast likes to talk about high impact,
right? It's called the high impact series. So let's have a little bit of a retrospective now
on your, on your career, Joe. So the first question I want to ask you is what are you
most proud of in your career so far? And does this necessarily correlate with your work that's
been the most impactful? That's a great question. I think you're always, I'm always most excited
about what I'm currently working on, but it comes from somewhere. So looking back, I'd interested in this idea of online aggregation, where you ask
a query where it's going to take hours to compute and maybe only produce one answer or a small table
of answers. Can we get early returns from that election, so to speak? Can we get a prediction
of what the final result is going to be while it's running? That early work got me interested
in how databases interact with humans, because it was really driven from human impatience.
And this, by the way, was the very early days of web browsers. So I also was very enthusiastic about the Netscape Navigator when it first came out. Its chief feature that I liked over Mosaic,
which preceded it, was that it had interleased GIF support. So your images would start to
download over the small modem, the small baud rate modems
incrementally. So you could get an early view of what the pictures were going to be before they
fully downloaded. I just thought that was super great. I wanted to do that for queries.
So this was very much human driven work. And that made me realize that like
for compute intense or data intense tasks, you know, there should be some care for like,
what's the user experience of that?
And that drove me generally to be interested
in interactivity and streaming
and all these kinds of things.
And that knits through my work in a lot of ways,
particularly as it meets declarative specifications.
So I was super disappointed when MapReduce came out
because MapReduce was the exact opposite.
It was like, we're going to do a big batch job.
And between every stage, we're just going to put everything on disk.
And there'll be no outputs for you, sir, till we are all done.
And I just thought that was such a big step backwards.
And everybody was so excited about it.
And I sort of said, but you could stream most of the things that MapReduce does.
For instance, it did joins as reduces, which meant that you don't get any join output till the full join was computed,
which is just not required. This then fed into a whole bunch of work in my group on like,
when do you really need barriers? When do you really need to block and when don't you,
which led to the column theorem, which is this question of like, what is coordination for?
And so a lot of sort sort of the deeper results the
calm theorem which i'm proud of the ideas of online aggregation um ideas of uh streaming
declarative languages for networking and for distributed systems a lot of this comes out
from the beginning of this idea that like computers should give you answers right away
and there's you know just being impatient and wanting to be interactive and
and not do batch work batch works from like mainframes i don't understand why people
but yeah it reminds me of a keynote you give at the vldb sydney with uh jeffrey hair you know i
remember being kind of blown away there's a third time i think it was wrangler right was the tool
i believe that was demonstrated been a blow blown away by it and been like oh yeah remember being kind of blown away. There's a third time I see it was Wrangler, right? Was the tool, I believe that was demonstrated,
been blown away by it and been like, oh yeah, that's kind of,
why are all tools not like that?
But yeah, it was really, really interesting.
Remember, remember at the time, it springs up to mind.
Cool.
And that's a really awesome answer to that question.
So kind of, again, building on this sort of retrospective sort of topic,
what's the most challenging project you've been part of?
Well, probably, you know, you brought up Wrangler and Jeff Hare.
Jeff and I took the Wrangler code with the student who wrote it,
Sean Candell, and the three of us started a company, Trifacta,
to commercialize that work.
So this is interactive, essentially AI-driven synthesis of code
for data cleaning with visualization built around it.
Those words weren't the words that were used 10 years ago.
But doing that startup over the course of a decade was the hardest work that we did,
the hardest consistent ongoing work.
Doing a company that lives on, a 10-year journey for a company, is nights and weekends and a lot of heart. And just many, many people engaged
who you're mentoring and caring for
and collaborating with and learning from.
And that was a full body experience,
kind of in the same way that my other job
of being professor has been a full body experience.
But the professor thing is like a career and a title
and the company is like a chapter.
So when you asked me about projects,
that was the biggest pull.
Yeah, cool. Would you tell to do would you do anything differently if you went back and had to do this sort of journey again would you be like hard maybe do
this different that different yeah yeah i don't indulge in that kind of thinking much uh i'm kind
of a forward-looking person that's the way to be right yeah yeah yeah uh in in part this also goes along with my very
countercultural for science for san francisco uh point of view which is that like personal
optimization is not something i like to think about very much most people out here they're
like i'm gonna you know do everything to the nth degree and make sure that i'm always being
efficient in in my athletics in my eating, in how I get to work.
I'm like, you know what?
Get up in the morning, live life, and be joyous.
So part of that is also not looking back and saying, what if or what should I have done better?
That's fabulous.
But I suppose you'd like something that I can share with students and with people coming up. From that respect, I do think I never
worked a Saturday in my life because of religious reasons. And I think that's been very healthy for
me. So religion aside, having some discipline for downtime is, I think, something that I'm
very grateful I always did. So that's a thing I would do over again, religious or not, because that changes
in a lifetime, maybe over a lifetime many times. But I think dedicating time to downtime
in a disciplined way where you don't let yourself off the hook and then work,
that's really good, even regardless of how busy and committed you are. In terms of projects, I would say I had
wonderful mentors. They were mostly pretty hands-off. I wish I had learned to be a better
teammate. I think one of the cool things about doing a company was that it was
a collaboration that was forced by circumstance. We have to work together to grow.
As a professor, it's really easy to work on your own. And while I've collaborated a ton,
I've only really had one shared project, I would say, that was an ongoing campus project. That was
the Telegraph project with Mike Franklin, which was a great deal of fun. But there's something I
wish I was better at, teaming up on running things um it's a skill
that i feel like i still need to work on nice it's funny what you said there about um going back a
bit a second about sort of everything being data driven and over optimizing i'm a sucker for that
as well like i have this this aura ring now which tracks my sleep and everything and you just get so
like oh i didn't get enough sleep last night i only got seven hours 13 minutes like i need to
get my eight hours and yeah it is well but yeah trying to detach from that is is and just yeah give yourself a break
a little bit and unplug from the from um from reality maybe or from the modern world is definitely
needed and not feeling guilty about taking time off as well because i was a sucker for that during
my phd i know i should be working but yeah sage advice for sure cool um yeah let's talk about motivation some more so what are your favorite
papers i saw you do a tweet recently actually but maybe along these lines so this might that might
be like about one of the papers but yeah go for it joe i think you know i have papers that i enjoy
teaching because i teach a graduate class and and you you know you go over them enough times and and
you don't they're good vehicles for teaching.
Whether they still inspire me is hard to say because I've just poured over them over the years.
But I think there's early papers and databases that just ring.
They just keep working.
They keep making sense.
So the Systemar sort of retrospective paper is just great.
The Pat Selinger paper on query optimization from 79, everybody calls the Bible of query optimization.
It's spot on and just, you know,
it was so well seen that this is the problem to be solved
and this is how to think about it.
And you almost can't appreciate it
unless you read other papers
like the Ingress Query Decomp paper
from around the same time.
And they just look incoherent now
because at that time, nobody knew what the problem statement was. They were kind of noodling about.
Salinger got the problem statement right. And after that, everything flows, like decades of
research flow. And so there's papers like that that just make sense now because they fit our
mental model of the computation problem. But at the time, she was just pulling that out of thin air,
like putting structure on an unstructured design space. So that paper is great.
A less known paper that I love teaching is a paper by Mike Carey, Rakesh Agarwal, and Maron
Livni on performance studies for concurrency control protocols. I don't remember the title
off the top of my head. It's from the 80s. And at the time, there was controversies over whether locking was better than optimistic
concurrency control. And there were papers that had come to completely opposing conclusions.
So the thing that's great about this paper is that it says, how could it be that science tells
us two opposite things? Well, it must be that the scientists were making
different assumptions. So let's crack open the space of like, what assumptions could you make
with a well-formulated simulation mechanism, and then see if we can set the knobs on this
simulation to explain how they came to these different conclusions and perhaps inform what a real world conclusion should be for this controversy. And it's just this beautiful
study and like, here are performance graphs that tell a story and they explain what's going on.
I love that paper, especially for students, because so many research papers we write,
they have performance study at the end that says, I win, you lose. I win, you lose. I win,
you lose under many parameters, right? I win, I win, I win. And those tell almost no story at
all except that binary story. This paper shows you what performance studies should be.
And it also brings to the fore the point that the graphs aren't the issue. The issue is
understanding the problem. The graphs are there to explain your understanding of the problem.
And I think so few research papers in computer science do that well.
So I love that paper.
This year, I added a paper to that reading, which is by a postdoc of mine, again, results from Mike Stonebraker's group at MIT that had been done earlier on concurrency control, but in this case, multiprocessors, like thousand-core processors.
So they did this paper, Staring into the abyss with a thousand cores. So Timo
implemented all their stuff. He took their simulator and put it on a real machine and got
completely different answers. So their simulations appear to be all wrong. And again, he goes through
this sort of iterations of assumptions and setting up the problem so that you understand
it's not a simple story. It's not an I when you lose story. It's all performance studies are all about what is your context? What are your assumptions? And why do those assumptions and context lead to the conclusions? people who are like, you don't have enough graphs. This won't get into SOSP because you haven't studied it carefully enough, by which they mean you haven't put in the sweat equity to meet our
community's standard for sweat equity. But I look at the papers that come out in a lot of those
conferences and there's just garbage graphs. They have no bearing on what might connect to the real
world. They tell you very little about what parameters of the setup or the deployment would affect
the results.
And they've been done at great cost to student time.
And in these days,
cloud costs sometimes if they have GPUs involved and the conclusions there,
you can show me lots of graphs.
That doesn't mean that you have a fine grained understanding of the problem.
It just means you ran a lot of experiments and plotted them.
So I really like that
Keri Agarwal-Livni as like a, it's my avatar for, should I even have a performance study or have I
already explained the solution? The graph is just there for window dressing. Right. So anyway,
those are a couple examples. Of course, you know, in like the data visualization side of my brain,
the work on Polaris that led to Tableau, the
grammar of graphics realized as software.
I think that's Jack McKinley's thesis.
That's beautiful work that I learned about in my travels into visualization land.
I'm trying to think of other stuff that kind of knocked my socks off.
I think the Cord paper by my colleague Jan Stojka, David Karger, Franz Kaczek and others
is a lovely paper from peer-to-peer days that I thought was quite influential on my thinking.
I just thought it was super cool and went and ran off and worked on related things because I thought
it was cool. So those are some. That's awesome. Definitely a few more on the reading list there.
So yeah, good stuff. Yeah. So I guess kind of, i'm not sure if we've covered this off really but
um kind of inspiration and which ones of these papers or people have had the biggest sort of
impact on your on your career well people i mean personal mentors um i've been super fortunate so
people at ibm well even yeah mason shu, who people don't really know about so much. She was a wonderful, she got me into the field as a college student.
She was a Harvard professor at the time.
But then at IBM, just the whole crew of people, starting with my boss, Hamid Pirahesh, very influential.
Guy Lohman, Laura Haas, Pat Sellinger, Jim Gray.
He wasn't there at the time, but I met him through them.
And then Mike Stonebraker was super influential on my career.
He was my master's advisor and then really one of my PhD advisors
and colleagues over time and oversized personality for our field,
certainly, over the years.
And, you know, he's a source of friction sometimes.
And, you know, sometimes to create a spark, you need some friction.
Yeah, exactly.
If you want to make a novelty, you've got to make a few eggs, right?
That's right.
So I don't always have fun with Mike, but he's definitely forced me to think hard and sharpen my arguments and sometimes use him as a counterpoint to, you know, I'm not going to do it Mike's way, so what way am I going to do it?
So he's been great. Jeff Naughton, my thesis advisor, much gentler mentor,
saw me through some rough times. Everybody has rough times in their early career. I had my share.
So those are some people. And then intellectually since then, Christos Papadimitriou was an
amazing sort of source of just like the joie de vivre of doing computer science and
enjoying it uh that guy there's nobody like him for just joyful embrace of ideas that guy's so
so exciting to hang out with and see lecture because he just loves computer science he just
he and he conveys that love through the way he lectures and talks about work and collaborates.
Jeff Hare, my collaborator at Trifacta, is a visualization thinker and builder.
Wonderful open source builder.
He led D3 and Vega and he's been very successful in open source, but also such a deep thinker and so articulate.
He's been a real influence.
So those are some of the folks that I've worked with who've been so articulate. He's been a real influence.
So those are some of the folks that I've worked with who've been really inspiring.
That's awesome.
Whilst I've got you on, Joe, I want to ask you this question.
So I believe you are the author of this video,
and this had a big impact.
I actually showed someone at work the other day.
This is the crazy concurrency control video on YouTube.
Is that you that made that?
Yeah, that's me.
It's absolutely outstanding.
It's brilliant.
Even now, we were having dinner the other week.
We had an offsite at Neo.
And I said, you must have seen this.
And a few people hadn't seen it.
And everyone was absolutely loving it.
Yeah, tell us about that a little bit more.
Because I want to use that as the intro music to the show,
if that's possible as well.
I think I love it.
It's awesome.
I mean, you know, we're skirting copyright issues on that.
Just a bit, yeah.
Yeah, just a bit.
I don't think they care much.
When GarageBand came out, so I'm a musician.
That's the career I never pursued.
But I'm a pretty serious jazz trumpet player.
But when GarageBand first came out, it really was a revolution for me anyway.
But I think for the industry, suddenly it was super easy to make music, digital music, with very mediocre keyboard skills, which is what I have.
And so I started to do these kind of silly pastiche songs where I would, you know, I was like, today I'm going to do a heavy metal song.
And I will write lyrics about how unpleasant it is to give birth to babies and make a song for my wife.
And then tomorrow I'll do a disco song.
And so I have this album from my family of like silly songs that I wrote, which I will not share.
But in that process, one of the things I started doing, I was like, well, I'll just do songs for class.
And so I did the concurrency control song. And that one was just a karaoke.
I stole the background from a popular song.
And I thought I would do more.
And so the next one was going to be a recovery song set to the tune of Don't Worry, Be Happy.
And then that got repurposed as a birthday song for my mother-in-law.
I never saw the light of day.
I am a one hit wonder in the database song department um but maybe i'll
have more time as i'm getting older to do more songs it's just music's changed so much now uh
you know even even the concurrency control song is now an oldie yeah it's classic though
might have to leave it to the next generation to be more relevant cool um great stuff yeah so um
before we talk about that,
you mentioned kind of things were easy earlier on in your career.
And I kind of want to ask you about setbacks
and how you deal with setbacks and rejections.
It's part of doing business in this sort of,
in this industry.
So yeah, how do you deal with it?
Yeah.
Well, I'll tell the story of my grad school days
because that was the hardest one and probably the one many of your academic listeners will resonate with.
I wrote my first paper at Berkeley. It was my master's thesis signed off by Mike Stonebreaker,
Randy Katz. I can't remember who else, but famous people at Berkeley. Got to Wisconsin,
had sent it to SIGMOD. It was accepted. Very exciting. So I was like on the road.
Life's good.
I'm in the shower one day.
Oh, no, sorry.
Before this, I go on.
I gave the talk at SIGMOD.
Everybody said, oh, you're a very smart young man.
And then Sergey Choudry, who is a couple years older than me, but he had gotten his PhD already.
So let's say five years older than me.
Came up to me afterwards.
He said, that can't be right.
I was like, dude, it's right.
I have a theorem and a proof. It's right. I'm a grad student. I'm smart. He's like, no, no, that can't
be right. And we argued and I was stubborn. Six months later, I'm in the shower. I'm like, oh my
God, it's not right. It's wrong. Here's a counter example. And I was sure my career was over. I had
published a paper. It was wrong. I was going to have to retract it, I guess, right? You can't say something in science that's wrong. And I went into the department at Wisconsin
and I thought that was it, end game. And I went to my advisor, Jeff Naughton, who's a
lovely human being, just gentle and thoughtful and many things that an advisor could and should be.
And he said, you know,
this happens all the time. People write papers, they have flaws and you know what they do? They
write another paper and then they publish that. And it's honest and it takes part, you know,
what was wrong with the previous paper and it tries to address it. And that's okay. So, you know,
what are the cases where your thing is right? And what are the cases where it's wrong? Let's work
that through. And so we worked it through, you know, it was a kind of conditionally it's right under
certain circumstances.
And when it's wrong, here's what happens.
And maybe here's some empirical kind of heuristics you can put around it.
Publish that paper.
Met Sergey Choudhury again at some point.
Fessed up.
Every time, you know, I work with him closely.
We're co-editors of foundations and trends
and databases. And about every 12 months or so, I'm like, you know, Sergi, you always worked
smarter than me, but you were right about that paper. I have to just fess up because it's still,
you know, it still bites me a little bit, but the lesson I think is the one Jeff taught me,
which is that, you know, computer science, fast moving field, get your results out.
If they get published, that's great. That doesn't mean, you know, it's the end of the story.
Sometimes things need more work. And if there were flaws in your work, that's okay. But be open
about them, address them. It's an opportunity for more conversation in the community. And papers are
not like record albums. Or maybe they are, maybe they are. But the point is, you know, like your
second album isn't your last album. Maybe that's the way to think about it you know they're more like a
conversation over time uh and uh it's okay for that conversation to have some actually i didn't
mean and i'm sorry when i said that and you know it's a dialogue uh or maybe it's a community
dialogue and that's hard to learn when you've only written one paper because you feel like you only got to say one thing and you said didn't get it quite right
ah but there's more chances uh hopefully in in most people's lives i've been real fortunate
with that but most people will get to write another paper if they keep at it yeah that's
lovely um cool so yeah while we're talking about ideas and things isn't this next question is my
favorite my favorite one um it's about the creative process. I love seeing how people's minds work. And everyone has a
different answer to this question. It's how do you approach idea generation? And then once you've
generated some ideas, how do you select what to work on for the next six months, two years,
five years, 10 years? Yeah, that's such a good question. And I've changed my approach over time.
I also think that it's good to change your approach over time. So I feel like I move back and forth between styles to some degree. So in terms of ideas, sometimes I want to continue stubbornly on the same idea. Lots of times I want to jump into a new idea. So I do have a little bit of, you know,
squirrel chasing tendency on that early in my career, particularly opportunities to collaborate with smart people and work on new things were so exciting that I, I felt like I had, you know,
I had done a lot of small things and that I had a sense I was doing fine. You know,
people would pat me on the head and say I was a clever lad, but I wanted to be remembered for something.
And I couldn't articulate to myself what it was that my through theme was, my research vision was.
And so in part, this was inspired by Jennifer Whittem, who was another mentor at IBM when I was there.
And in some ways, a competitor who, again, like Mike Stonebreaker, sometimes the friction there, I think, was good for me, but very different style.
She's a very meticulous person, and I'm less so.
She was also at Stanford and I was at Berkeley, so we had to compete.
But I could say exactly what Jennifer was working on.
She was so good at articulating what she was doing, and I thought, I need to be able to do that.
So part of the work that I'm still doing on declarative languages and how they can impact the rest of computer science was me saying, okay, I think this is one of the signature themes of my community.
And I think that if I plug away at this thematic advantage of database thinking, I should have an advantage in computing broadly.
So I'm going to take the tools of our trade, take our brand, so to speak,
and see what I can do with it. And so that's been deliberate that I wanted to have a series
of projects that had a through theme while also doing other stuff. I mean, I always have a
portfolio going and that's a luxury of being at Berkeley and having great students who can execute
on things when I'm distracted. They can take random ideas I haven't run with them.
So that's kind of the portfolio thing of having some through themes
that I'm working on at all times
and then have space to explore new things
with collaborators who I find inspiring.
And that speaks both to idea generation,
also methodology,
because I think they're hard to separate.
The craft is the inspiration. You know, I think that's true in music or in art or
whatever, as it is in any creative field, that the doing of it leads to the outcomes
as much as some spark going off in your head, right? So having big systems projects is one
kind of doing. Doing creative collaborations with people who come from somewhere else is
a different kind of doing. And in the doing of these come from somewhere else is a different kind
of doing. And in the doing of these things, you come up with ideas. And they're different. Like,
you know, the conversations I have with students where there's four of us in a room building an
artifact that's been going for a few years, it's very different than the conversations I have the
third time I'm meeting someone from another area. And we're trying to understand, we're
misunderstanding each other creatively. And, you know, it's just, they're just different. Yeah. Yeah. Something you mentioned
a second ago there about actually implementing something or doing something can kind of, a lot
of ideas can flow from the act of doing. And then this was in the Red Book, which I believe you've,
you've coauthored and it was about in the concurrency control section. And it said something
like, once you've read these, these protocols or whatever, you've learned them, you've not actually learned them until you've actually gone and implemented them.
And once you've implemented them, then you've got a proper understanding of them.
And I just thought that kind of, that really hit home with me anyway, because I do that with a lot of things.
I'll read something, I'll go, oh yeah, I've got it.
But not until you've actually gone and put it into practice do you realize, oh, maybe I don't really fully understand that.
And then once you've done it, you think, oh, it'd be cool if I could tweak it in this direction, that direction.
So yeah, I kind of, yeah and those those shower thoughts are like
the eureka moments i mean not many people have those right sometimes the narrative gets retrofitted
like yeah i was in a shower or i was under a tree and an apple fell on my head sort of thing but
um yeah that often doesn't happen like that right but yeah cool i think that's something by the way
that um i humbly have learned from music practice.
And I think it's certainly true in athletics.
Anything that involves the human body, it's 100% true, which is being able to conceive of something and being able to execute on it are completely independent.
And, you know, the philosophers call this the mind-body problem.
How do you translate intent into action?
And, you know, it's maddening when you're trying to really refine a skill.
But I think even for the purely intellectual pursuits
of, you know, computer science, writing programs,
you know, for that matter, doing math,
I think the practice leads to the inspiration.
So, you know, very few people
are born brilliant mathematicians.
Most people, even great mathematicians,
it's the doing of the math that makes you better at math.
And it's certainly true, I think,
for the intricate, tricky bits of computer science
like concurrency control, that implementing it
so that you see enough permutations and combinations
of what can happen gives you a different intuition.
It gets your brain seasoned in a different way.
But I think it's true for a lot of things.
The apple falls on your head because you were sitting in the orchard, you know, and you got to spend your time in the orchard.
Yeah, I like that.
I like on the mind-body thing as well.
I mean, you don't believe how many times I've envisioned scoring the winner for England in the World Cup.
I mean, World Cup final, but that ain't happened.
So yeah.
But yeah.
Humility is also good.
I had actually a big epiphany this week
because I'm at a certain age where I'm like,
so, you know, what more skills will I be able to acquire
in this life, really?
I was washing the dishes and actually I was cooking.
I was making a cake for my wife's birthday and I was trying to scoop the batter out of the bowl.
And I had the bowl in my right hand and the spatula in my left, and I couldn't do it. I had
to swap hands. And this trumpet practice thought came into my head, which is, I really need to
practice my left hand more. You know, like any skill that you're not good at is clearly an
absence of practice. And I was like, that's ridiculous. Like, do we judge people on how ambidextrous they are?
Is that part of my personal optimization space that I want to pursue? Like, no,
I'm a righty. It's fine. All speaking to this idea that like, you know,
there are things in life that can be left unoptimized.
Yeah. Yeah. Yeah. That's very true. Awesome. So yeah, my, my next question,
and this is, I guess is kind of the mission statement of the podcast as a whole.
And it's about bridging the gap and between academia and industry.
And obviously you've done that with various kind of startups.
You've kind of crossed, crossed that, that, that bridge numerous times.
And so I kind of want to get your take on what you think the current interaction,
interaction between academia and industry is like and how it can maybe be improved going forward,
if it needs improving at all. Yeah. I mean, there's always room for improvement,
right after saying that things are chilling, you should just... But I think in systems, we can certainly improve our systems. I think it's a wonderful time to be an academic who's interested
in entrepreneurship. It's not as wonderful a time as three years ago when it was easier to raise
funds. And right now, if you're not doing AI, particularly language model-oriented,
foundation model-oriented things, it's hard to get funding. But it is the case that Silicon
Valley investors are looking for technical leadership and happy to talk to academics about what it would look like to start a company.
It's only the last few years that venture capitalists started coming to research conferences, but they do regularly now, certainly in databases.
I mean, they've been going in AI even maybe a little bit longer.
And that means that you'll trip across, if not actually the opportunity to start a company, then someone who has started a company.
When I was coming up, it was just Mike Stonebraker.
And if you didn't like the way Mike did things, you had no other role model really for entrepreneurship and research, at least in databases.
And now there's many.
Many of us have stories to tell.
Every entrepreneur I've met
overfits their advice to their experience,
myself included.
So if you only get one person's advice
on how to run things,
it's not going to work for you, very likely.
In fact, it's probably not going to work
for their next company either.
These so-called lessons are just data points.
You need to amalgamate a bunch of them.
Nowadays, you can get that.
You can go to Sigmund and talk to 10 people who've had successful exits and talk to them
about what their companies look like.
You can talk to 10 more about people who didn't and what they learned from that process.
And maybe some of us who've had a little bit of both.
So I think it's great times now for entrepreneurship and learning from other people. Also, just the volume of material on
blog posts and stuff on how to start a company is just completely different than it was 10 years ago
even. So there's just lots of good advice out there. It's not that hard to find. Industrial,
like big industry research, on the other hand, I think is not in a great state right now. Microsoft continues to fly the flag
for Microsoft Research. I know that they are constantly battling about protecting it or
making it more relevant versus making it more fundamental. But they have a large and successful
research organization, and they're kind of the only one. I mean, there are other research
organizations, but they're not dedicated to doing the kind of foundational research that universities and Microsoft Research do. So
there's good people at other places. And I certainly don't mean to disparage the efforts
at other places. But most of what I see in industry coming into the conferences is we built
a thing. Here it is. And those papers aren't bad, but they're very rarely educational.
So if somebody says we built a thing and it scales, it's hard to project from that to its relevance to other stuff.
And often, you know, there's sort of, if anything, misleading because it's overfit to their environment.
You know, it worked at company X, big super scaler X for task Y.
Nobody else has task Y.
And so those papers, I think, lead people down the garden path with some frequency
in ways that's not so good. And very few of them are thoughtful, honestly, in terms of thinking
outside what they built, because these are software engineers. So they describe what they
built. That's their experience. And there's nothing wrong with that. It's just anecdotal,
is I think the thing to emphasize when you read those papers is most of them are anecdotal.
I was recently reading the teaching, the Amazon Aurora, the first paper on Aurora to my students.
And among those papers, I was really pleased at how methodical it was.
And it was clearly written by people who had a bigger perspective.
And that's an example of a good one where they're not a research group at all.
But there are researchers on that team
and it was written with the mind towards like,
what are the piece parts that make up the system?
And why did we choose these ones?
As opposed to, we built a thing, it scaled.
Contrast with the Spanner paper,
just to put it right out there,
the Google Spanner paper.
Here's four mechanisms for doing concurrency control we use all of them spanner the end
why so many and how and you know how did you arrive at that magic number of four yeah like
what was why was the other components that you didn't consider why did you discount those right
so yeah and that's a technically
rich paper i should say a lot of these papers it's like we have ideal one and we scaled it
at least the spanner papers we had three or four things we mixed together and it's really
complicated and it works um there's no why uh but at least it is deep i mean it's a important
paper in some sense it's an interesting anecdote yeah for sure i mean it's kind of been quite
impactful it's always the span of s is calvin right and uh like that's kind of the two those two papers came out roughly
around the same time right i know when i was starting out reading about district transactions
and stuff it was the yeah you gotta pick a side spanner versus calvin for example anyway i digress
i never i never thought of it in those terms that's interesting that may be a product of your
times cool well let's talk about um current trends in future so
uh the first one i want to ask is what's the most exciting advancement that you've observed recently
yeah it's not fair uh because you know llms are just so interesting you know it's not interesting
to say that they're interesting, but it is the biggest
thing that's hit computing since the Macintosh, let's say, or the iPhone, maybe. And I think of
it very much in those terms, like, it's cool. I want to play with it. Because, you know, it's not
like I'm going to understand it any better by reading the papers, because it doesn't matter.
And it's not clear I would understand it or anyone would understand it by reading the papers.
But playing with it and thinking about
what could I do with this object is super cool.
And it's the one research topic that I'm not working on
that if I weren't doing what I'm doing,
I probably would work on.
The distracting thing about LLMs
is that everybody's working on them.
And so I don't know how much I would add to humanity or computer
science or whatever by doing research in that space. I watched my very close collaborator and
colleague, Joey Gonzalez, who's a leader in that area. And he, you know, he does machine learning
systems. That's his full-time thing. He's not like adjacent sort of, I'm a database person who
is happy to do machine learning. And you know, his group,. And his group, they're just getting papers out every few weeks,
getting scooped every few weeks, getting more papers out.
It's exhausting.
And unclear, like if his group didn't exist,
would another group pop up and do the same work?
Maybe. I don't know.
I mean, Joey's really good, so I don't mean to discount him in any way.
But if it were me, I think that would be my worry,
is that I if if i i
prefer to do something off in my corner where people go oh i never thought of that uh it just
feels more um well it's more controlled for sure and it feels a little bit i might have a better
chance of doing something uh higher impact you know where if i didn't do it maybe it wouldn't
have gotten done yeah so that's my hesitation of working in that space.
But man, you have this magic new question answering box.
And we're in the business at some level in databases.
One of the things we do is answer questions, answer queries.
It's not the only thing we do, but it's one of the big ones that we do.
Now there's a new box that answers questions in a completely different way,
scoped very differently in terms of what it can do and what it does wrong.
How do we bring these question answering schemes together to be accurate and efficient and all that?
The thing that really juices me, though, is the idea that the world used to be divided
into structured and unstructured data. And that's just not true anymore. So if you wanted to get tabular data out of videos or text or whatever, you can.
You can featurize it and get features out, I guess, columns, and then run SQL on it.
So now we can query everything with structured queries.
Should we?
Maybe not.
Maybe yes.
So now that also raises the question of what are structured queries good for?
When is it better to ask natural language queries? And this whole thing is like mutually recursive. You know, it's like, where does the natural language stuff end and the structured stuff begin? Where does the data end and the queries begin? Everything is everything. And it's a really mushy space in that sense and a very malleable one. So you can view that positively or negatively.
But boy, there's going to be a lot of change.
Yeah.
Yeah.
I mean, it's kind of, it's almost, from a far,
I don't really dip my toe.
I've played with TrackGPT and I use it more and more actually
in my day-to-day life, trying to get exposed to it more.
But I find it very as a feel very daunting
because of the the fact that the the output is so fast and it's hard to like get the signal from the
noise basically and it feels very much like a bubble and it's like i'm like okay come back to
me in five years when you've got some like when you realize what like what the actual useful stuff
is um but yeah maybe i should be i should i don't know persist a little bit more but uh for sure
it's going to be interesting to see
what implications it has.
On the unstructured versus structured thing real quick,
I interviewed a guy on the podcast who had a paper,
I think it was at CIDA, and he was like,
we were talking about it off air.
He's like, yeah, you can kind of give these LLMs
just numerical data, just give it random CSV files or whatever.
And it can actually reason about those. And I was like, that's just wild. Like I'm giving it
some random numbers, even like, and as I said, it was baffling. I was like, wow, that's mad.
Really interesting. Yeah. I guess leading off that question a little bit then,
what do you think is a promising direction for future research in databases, maybe in the shadows of LLMs would
be a way to scope this question. What's hidden around the corner that's not getting any sunlight
and could potentially have big impact? Well, I mean, that is the story of my day-to-day.
So I think declarative languages applied to other stuff is an idea whose time has really come. I've
been saying that for a long time,
so maybe you shouldn't trust me.
But particularly as we enter the world
where the authoring of code isn't the point,
because maybe generating code is something that LLMs can do,
what we need to be doing under the covers
is going from specification to efficient implementation,
which is the magic of query optimization and
turning declarative into imperative.
And the database community has tricks up its sleeve there.
So I'm doing that in the distributed systems environment.
One of the things I like about that environment is there's a lot of data movement.
A lot of the cost of building a distributed system is about what data goes where and when.
The when part not being as
typical in databases.
That's where the distributed systems reasoning and the concurrency control reasoning comes
in.
But that's kind of my jam right now.
But I think there's lots of other places.
Like you could think about languages for programming GPUs.
You know, CUDA's not great.
Halide's much better. Halide looks a
lot more like a query language. And maybe, you know, it came out of the programming languages
community really, but database people have a lot to say there. So generally in the connection
to programming languages and not, you know, traditionally that was, oh, what's a good
language for programming a database or programming over a database, you know, like object relational mappers and stuff.
I think there's stuff to be done where it's like, no,
just database ideas are important to compilers,
are important to programming language design and the tool chain.
And there's people like Max Wilsey here at Berkeley
who are programming languages researchers learning from database people.
So that, I think there's a lot of interest there
from the PL community if you go
reach out uh so i think that's exciting dan sucio at washington's doing a bunch of this on the theory
side we're doing stuff on this on the applied side there's others uh but it's certainly uh not
getting the kind of shine that llms get um so i think that's a lot of fun. Cool. Yeah, I guess. Awesome. Just one last question from me now, Joe.
And that is, what does success look like for you from now over the rest of your career?
What is on the, what's your goals, objectives?
Yeah.
What does success look like going forward?
Those are the hard questions in life, right?
I'd like to see one of my projects succeed in open source.
I've never done that, actually.
So Trifacta, which I think was quite successful,
we had lots of users.
That was done through non-open source,
partly because open source is a vehicle
for getting software to programmers.
It's not necessarily an important vehicle
for getting software to end users who aren't programmers.
But I care about programmers. It's not necessarily an important vehicle for getting software to end users who aren't programmers. But I care about programmers, but half of my work is in the space of improving the experience for programmers. So Hydro would be an example of a thing where I'd really like
to see open source adoption. And I've never played that game, really. It is a marketing
exercise like any other, really. I mean, what I've seen from my colleagues who've done this
successfully is you go to meetups, it's really just marketing. There's a certain kind of open
source marketing that I've never tried. Obviously the work has to be good, but you don't have
salespeople selling it and you don't write papers about it. You go off and do the thing that makes
open source successful. You build communities online, you go to meetups, you do all that kind
of stuff. And I think it's gratifying to see people use your code as software engineers,
which I haven't had.
I've had people, business analysts using my code,
marketing people using my code, data engineers using my code.
But I'm a programmer.
I'd like to see programmers use my code.
That'd be cool.
So that would be a form of success that I have yet to see that I would enjoy.
I do have code in Postgres and things like that.
So it's not like I haven't felt it, but it wasn't mine.
Postgres is a stonebreaker.
I'm proud of it.
So that'd be cool.
I'm super proud of my PhD students, and I'd like to see my current crop at least do well.
So that would be success.
And if I have more students after that, then that's always, that's a very personal one
because you spend six years with a person, they're as much family as they are uh uh products right uh so so i really want
you know that would be success for me is to see all those folks have an impact and i think as you
get older uh teaching is half of the fun and in that sense it would the knowledge isn't useful
unless you pass it along.
Well, that's a great message to end on, Joe.
Thank you so much
for speaking with me today.
It's been a fascinating chat
and I'm sure the listener
will have loved it as well.
So yeah, we'll see you all next time
for some more awesome
computer science research. Thank you.