Disseminate: The Computer Science Research Podcast - High Impact in Databases with... David Maier
Episode Date: November 4, 2024In this High Impact episode we talk to David Maier.David is the Maseeh Professor Emeritus of Emerging Technologies at Portland State University. Tune in to hear David's story and learn about some of h...is most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find David on:HomepageGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
Jack here with another episode in our High Impact in Databases series.
I'm delighted to say I'm going to be talking to Dave Meyer today.
But before we do that, shout out to our sponsor, Pomtree.
Pomtree are the developers behind Raftree,
the open-source temporal graph analytics engine for Python and Rust.
Raftree supports time traveling
multi-layer modeling and comes out the box with advanced analytics like community evolution
dynamic scoring and temporal motifs mining it's blazingly fast scales to hundreds of millions
millions of edges on your laptop and connects directly to all your data science tooling
including pandas pyg and lang chain so yeah go check out what the guy or the palm tree guys are doing at www.rafty.com where you can dive into their trussorial for their latest 0.8 release
anyway on to the podcast so yeah like i said at the top of the show it's been a great pleasure
to welcome david meyer to the um to the show today so tell you a bit more about david before
he tells us about his story david is the Massey Professor of Emerging Technologies in the
Department of Computer Science at Portland State University. He's also the author of several kind
of key texts in the field of databases, including the theory of relational databases. And across
his career, he has consulted with various companies to name a few, IBM, Microsoft, and Oracle.
And he's also won the Codd Award. Award and lastly he's got a habit of coining terms
for things so yeah he's famous for coining the term datalog which I'm sure many of our listeners
will have come across at some point in their lives so yeah cool welcome to the show David.
Well thank you for having me. Awesome stuff cool well let's get started then so that it's
customary on the on the podcast for the for the guests to tell their story. So yeah, what has your journey been like so far?
And yeah, why did you become a database researcher?
What's the story there?
Well, it's interesting.
I always knew what I wanted to be, but what that was changed.
It was first a fireman, and then I wanted to be a scientist, and then I wanted to be
a professor, probably influenced by the fact that my father was a professor of mathematics.
And so I started studying math.
When I got to college, I was able to take some computer science courses. And so I ended up
with a double major, both in math and computer science. And then for reasons I cannot completely
reconstruct, I've decided to apply to graduate school in computer science rather than mathematics.
In retrospect, it was a great decision. So I ended up in Princeton because I'd been reading things in cellular automata and those sort of theoretical topics.
I was interested at the time, and I saw all these names like, you know, Church being there and Turing and von Neumann. And little did I know
that they were all gone. Gertl, he might've still been there when I was there, but
they weren't around and they certainly weren't part of the computer science department.
But anyway, I got there, fell in with my advisor, Jeff Ullman, and actually my first research was quite theoretical.
I mean, I had a math background, so I was coming to that.
And so it was on NP completeness of some sequence problems. doc at Toronto and started getting interested in databases there with, with Dennis Sekretsis
came to Princeton for another postdoc or something. And all of a sudden, like all of
Jeff Ullman's other students were doing relational database theory. And so I was finishing up my
thesis, but at the same time I was getting involved in writing papers with them. And so then that was how I got off into databases.
So it was first the very theoretical stuff, theory of relational databases,
and then started moving into query processing and then into more database systems work
and have sort of stayed more or less around that area since then.
Yeah, so that kind of brings us up to today.
So what are you working on at the moment, David?
Ah, well, so I'm actually an emeritus professor now
and I've mostly been spending time on
is helping younger faculty get going.
And so I have one person I'm working with,
Primal Papachan,
who I know is a listener to your podcast
on fine-grained privacy issues. I'm working with
Binafshir Rekaddar on some ML problems, multimedia. I talk to people at Oregon State
frequently about using large language models to learn schema updates i've got i've inherited one graduate student who is
working on call it data alignment which is useful and with temporal and spatial data and trying to
reuse it and combine it and then the latest thing i've gotten into is single photon photography. So using single photon cameras to construct distance maps
by time of flight kind of things.
And it doesn't seem related to databases,
but if you stand back far enough,
it really is kind of a problem
of computing aggregates over data streams
of a certain port.
And so I've worked in streaming data a lot so i was able to bring something to the table on that yeah when you first said that it
seems like kind of a big pivot from kind of what you've been doing before but yeah when you kind
of break it down there's there's some transferable sort of skills there to that problem as well
cool i wanted to to you mentioned that you kind of spend a lot of your time sort of helping sort
of younger faculty and i know you have this thing on your website about sort of your advice to sort of mid at Stony Brook. And some of my cohort were already getting asked
to be on program committees.
And I wasn't.
And I was a little worried about that.
And I saw my advisor at a meeting
and he said, don't worry about it.
It'll come.
And he was right.
And it wasn't that long before
there was more things than I could do,
but you kind of have this temptation, you know, that you've been waiting for this.
And then all of a sudden the opportunities come and you start seizing them. And I find that,
you know, I was getting overwhelmed by that. And so as my, and there was other things, you know, being conference organizing, you know, being on study panels, things like that. And I realized that, you know, I had to be able to not accept every opportunity that came by. And I think it's a temptation among young faculty. And so, you know, I started thinking,
well, I need to have, I need to say no to most things. And so I need to have some reasons to
say yes. So, you know, so some of those reasons is, you know, I like to do things that will bring
me into touch with people outside my area. So one of the things I said yes
to was there's a board on mathematical sciences and analytics at the National Academies in the
U.S. And I served on that for multiple years as token database person, but got to meet a lot of
interesting people in other areas. You know, some things just have a lot of interesting people in other areas. Some things just have a
lot more payoff for the amount of investment. So being on a panel at a conference, make up a few
slides, but you get as much face time as if you wrote a paper. And then the other thing was coming
up with reasons not to say yes. So if somebody says, oh, you'd be the best person for it,
well, that's flattering, but it may be that, you know,
the third best person or the 10th best person for it would do an adequate job.
Or they might tell you, oh, if you don't do it, it won't get done.
Well, maybe it's something we should stop doing.
Yeah, that's a great point. get done. Well, maybe it's something we should stop doing. So, and then, you know, advising,
I've gotten some feedback from young faculty who've read this. And the one thing I said is
at the very end is, you know, making sure you make time for your family as you're going. And
a lot of them wanted that validation that, that that was important.
I, myself, when I was, had young kids sort of made a decision that I wouldn't, you know,
work on work stuff, you know, from dinnertime until they went to bed and was able to do that
for many years. What I hadn't calculated in is, you know, as they got older, you know, bedtime moved from eight to nine to 10. And so, you know, it was trying to get back and do some useful work after 10
wasn't great. So, you know, at some point I, you know, they had homework and I could do some,
some things myself at that time. Yeah, no, I think that's great advice. So
trying to say no more, I mean'm i'm i'm terrible for it
as well kind of you just it's really hard to say no a lot of the times right and you kind of have
to end up kind of doing five things that kind of get you get getting five b's rather than three
years right so you've got to go to streamline yourself a little bit and sort of be more
selective so i mean that's that's really solid yeah i've i've helped out some young faculty by
just writing the word no on a piece of paper and said,
and put this on your bulletin board in case you're, or, you know, put it in your wallet in case you're wondering what the answer should be.
Yeah, it's really interesting to say that because I was watching, I don't know if you've ever seen the TV series Fargo.
I was watching the latest season of that and one of the characters in her office, she just has this giant picture.
Like it just says no behind her desk.
Oh, really? Yeah, and I thought that was fantastic. I love that painting. It was awesome. her in her office she just has this giant picture like it just says no behind her desk oh really
yeah i thought that was fantastic i love that painting it was awesome yeah but yeah it's anyway
digress cool so let's the next sort of section of the of the of the podcast we like to do a bit of
a retrospective of your career david so i guess the first question is what are you most proud of
in your career? And does that
correlate with the work that's been the most impactful?
Well, that's an interesting question. So things I'm sort of proud of. So one thing is actually
my thesis research, it was on the complexity of sort of sequence matching problems over multiple sequences.
And what's gratifying about it is if you go look on something like Google Scholar,
usually the typical pattern for paper citation is there's a little peak a few years after it,
and then it sort of has a long tail, maybe a bump here or there, but this paper I have moved steadily upward
and is at a quite high, you know, higher level now than when I published it. And the reason is,
is that anybody who's working on multiple sequence alignment in bioinformatics
cites it as an excuse that their answers are approximate. Because if it's,
you know, my problem was a special case of theirs, it's NP-complete, which means you're unlikely to
find, you know, a sub-exponential algorithm for an exact solution. So that's been kind of fun.
I didn't, I didn't real, that was sort of happening in the background.
And then I decided to help someone co-teach a bioinformatics course and was looking at
the back of the text and saw that I was cited and then started noticing it more.
But then I'm also very proud of the relational database theory book.
And that was, you know, my, my advisor, Jeff Ullman
had written a lot of books and, you know, it's not necessarily something that I'd advise a young
faculty member to do because you're being evaluated at least in, you know, computer areas,
more on papers and grants, but I wanted to teach the topic anyway. And so essentially I was developing
course notes at the time. And that book, even though it's out of print, it still gets cited a
lot. And if anybody wants it, I've put a PDF scan of it up on my website, so it's had a lot of legs.
Other stuff that I'm proud of, you know, and helped introduce some ideas of stream semantics and
this idea of punctuation and processing streams out of order. And I think those have all had
influence. And I actually went as a visiting researcher to Microsoft on several occasions
and helped with the Stream Insight product they were working on there, although that's sort of been superseded by other
technologies they've developed. But maybe five years ago, I started, I guess this first met him,
a guy named Todd Porter, who was at Microsoft then, but moved to Meta, who actually worked
more on the development side, supporting their Azure streams or their stream processing products.
And I started talking to him and he'd start saying, well, I'm thinking about the following problem.
And I was able to say, well, you know, I had a paper on that 12 years ago.
And then another problem.
And it seemed like everything he brought up almost, you know, I could point him to where we thought about it.
You know, I guess it was kind of this thing of looking at
some of these problems back then and finally it was getting to the place where they became
important enough that they were of interest to industry so those are some of the things that i
think are highlights for me yeah nice i like that sort of it must be quite gratifying to solve i
mean because that's kind of i guess how research should kind of work in a sense.
Like you kind of, you do something now and maybe it's really kind of at the frontier for thinking.
And there's a sort of like bit of a delay before it kind of gets into industry adoption and those sort of ideas can actually be put into practice.
So that must be really nice to sort of see that play out in front of you almost.
Yeah, there was, I saw an interesting example of that.
There was a company called Logic Blocks,
and their head at the time was a gentleman named Mohan Aref.
And they was basically a data log engine under the covers.
They had some surface syntax that was maybe a little bit more flexible.
But I was surprised that he was going back and reading these papers from like
you know the 80s on deductive databases and compiling data log and so forth and so
and applying the ideas yeah the turns out that some of these theoretical results have
have really long shelf lives and can be picked up later yeah that's awesome i guess as well kind of we mentioned
we mentioned datalog there so maybe i kind of teased at the top of the show that you have a
habit for for coining terms so yeah let's talk about datalog then so you've kind of used kind
of involved and involved in that sort of project from the very early stages i guess right to give
it its name so yeah can you tell us a little bit more about the story with Datalog? So back early on, it must have been my first job,
so late 70s, early 80s,
logic programming started to be more popular, come into view.
And I was at Stony Brook.
I started working with someone named David Warren,
who actually stayed and worked a lot on Prolog
and started a company with it. But we started there and I finished it when I got to Oregon,
writing a textbook on logic programming. And our strategy there was we were going to sort of build
up both the theory and the technology a little bit at a time. So we started with just propositional logic.
So you just have symbols, you don't have any variables or predicates, but with no variables.
And you could talk about different things like resolution in that context and how you
might build a simple interpreter.
And then we did predicate logic.
And there you have, now you have variables but no
function symbols and then we did you know full predicate logic with function symbols and so
the n1 the first one had an obvious name prop log the last one was already called prologue and so
there was this thing in the middle and we were trying to figure out what to call it.
And according to David Warren, I left one evening thinking about it and came back the next morning and said, oh, we should call it Datalog.
Because these things that you're working with look like database relations.
And so that stuck.
And around that time, my advisor, Jeff Ullman, had moved to Stanford, and I was visiting him. We were talking about issues with evaluating recursive data logs or logic-styled programs, and I started using the term data log with that group. and it's interesting its first occurrence in print isn't necessarily one of my own publications
other people other people put picked it up my book hadn't come out yet where it's there but
then it's funny so that sort of i worked i worked in you know processing recursive data log for a
while but then i got pulled off to work on object databases.
But then maybe 10 years ago, I got involved with Joe Hellerstein at Berkeley and some of his
students, Peter Alvaro in particular, and they were looking at execution of, looking at convergence
of distributed computation, and their model was data log log and so i started talking to them again
and so i had a couple other papers in the in the data log domain after that cool that's fascinating
and there's this is another sort of question that you can probably have this on your while we're
talking about kind of impactful work and this is another thing you have on your website about the
best paper you never published and this is a logic for objects and this concept
called i don't know if i'm pronouncing this correctly but scolum surrogates yes scolum
scolum surrogates so yeah can you tell us about this yeah so i've been working in
data log and and logic query optimization and processing.
And then I got over partly because of some consulting work that I picked up here in Oregon.
There was a company called Serviologic.
They're called Gemstone now.
But originally they were building a piece of hardware
to work in a nested relational model.
So it was going to be a database machine with a nested relational model.
And at some point, it morphed into an object-oriented database.
So I'd been thinking about that.
And then I got involved with some researchers at MCC,
the Microelectronics and Computing Corporation. So it was this industry
consortium down in Austin, Texas. And one of the things they were trying to do was parallel
query processing. So working with big data, this was about the time of the Japanese fifth
generation project. And so I was down meeting with them
and they were, at least some of them,
were using logic languages as a basis for that
and were interested in my inputs on languages.
And I started thinking about these extensions
you could make with, you know,
could we extend the logic languages
to have more object-oriented features?
And so I'd come up with this idea of what you really needed
was some notion of object identity
so that you could talk about updates or that you had,
and how do you represent that you have two references
to the same item?
And so I had, from the logic work, I knew about this, these things called
Scalem variables, a trick to kind of get rid of existential quantifiers in certain cases.
And so I said, well, I could use something like that to represent the object identities
in these, in this setting.
And so that best paper I never published was presented at a workshop on logic and databases
that Jack Minker and some of his friends at Maryland organized.
And afterwards, you could send your paper in to be considered for a book that would be published of it.
And they declined to include that paper in the book.
They just thought the work was a little bit too early on or whatever.
And the funny thing is, I mean, people can still get out.
There was a tech report of it.
There was still a paper.
It just wasn't in the book. But I've
seen it cited several times as
having been in the book.
Okay, right. The retrofitting.
It was in the book, right?
Well, you know, it was...
People's assumption is, well, it was presented at the workshop
of logic and databases. And then here's this book
that's kind of the proceedings.
They sort of assume what was in there, but it was
an edited volume, so not everything i guess got in got into it yeah so that so i just you know uh
people have cited it there was work by michael keifer and others at stony brook on extensions to
it uh f logic and c logic so it had in you know it had influence you know people people took off from
there but you know some sense of seminal paper but wasn't obviously so i guess at the time i wrote it
yeah no that's awesome we kind of spoke just to kind of change up a little bit we spoke a little
bit about kind of object-oriented databases so and i mean sort of when i first came across
uh databases they're kind of not really around in sort and i mean sort of when i first came across uh databases
they're kind of not really around in sort of i guess their original form as much anymore so i
kind of wanted to get your take i mean the only time i've ever encountered actually was in i think
one of jeff ullman's books in like kind of the the database systems implementations books i think i
think he's a i believe he's a he's an author on that one yeah i'll just check and and i kind of
i was first came across i was like what the hell are these object-oriented databases that they mention here because they kind of don't
get spoke about i guess as much these days as i guess they maybe did sort of 10 15 20 years ago
so yeah kind of i wanted to kind of get your take on them kind of looking back on them and kind of
what what position they have today yeah so it was it was interesting development there was this
so i talked about this company i was in that was originally doing a nested relational database and
then you know having the problem is that we were trying to devise a query language and when you're
creating a query language and there you're the only product that has that query language,
you have a problem because, you know, there's, there's no textbooks on it.
There's no training students aren't learning it, you know, in their,
in their classes. And so there was a little bit of internal revolt.
There'd been people, some of the people that come over from
Tektronix, which at that point
means they're best known for
sort of oscilloscopes and test equipment,
but at that point they had a computer research lab.
And
they were one of the first outside
groups outside of Xerox
Park to do an implementation of
Smalltalk. And so there was people
who knew about Smalltalk, this object-oriented language.
And then somebody said, hey, let's, you know, here's a language that already exists.
These nested relational structures kind of look like, you know, complex objects.
You know, let's make small talk or something like it our language.
And so that's where the Gemstone system had its origins.-aided design, computer-aided software engineering systems.
Gemstone, not so much.
And so, you know, I was involved in various debates with, you know, relational proponents like Mike Stonebraker versus this object stuff.
And, you know, a lot of the relational database companies started reacting to that by basically having their marketing departments upgrade them to object-oriented databases.
Marketing people are a lot cheaper than computer engineers
and so
we're object oriented because we have these binary
large objects or blobs
so most of
the companies didn't
persist, some are still
there, I think objectivity is still there
I believe that there is
well
they have a large contract with the NSA because you could, as you might know, being at Neo4j, you can easily build a graph model on top of an object database.
And they also, I've heard, have, were running in a lot of cell phone tower software.
Okay. And so, you know in a lot of cell phone tower software. Okay.
And so, you know, what kind of happened to them?
Well, some of the features got absorbed into mainstream databases.
Some of it showed up in more like object-oriented middleware.
Okay.
So like Enterprise Java Beans is kind of an object model.
There's something at Microsoft called Orleans.
That's sort of a middleware object model.
And actually it's one of these things where, you know,
I'm on the second generation of something that they were,
I was working with Phil Bernstein at,
at Microsoft research and trying to add indexing to this.
And so that was something I'd worked on back in the 80s
and was able to bring that up to one of these sort of object middlewares.
And then there are some, you know,
there are a few object-oriented systems still around,
although it's not a big market segment.
Yeah, it's interesting what you're saying about
sort of sql sort of always be at their marketing departments initially sort of consuming whatever
it is at the time be it object oriented base this is all kind of with xml and to some extent with
kind of graphs as well i mean there's an extension for sql now pg pgq and i know there is actually
there is the now that the new graph query language standard, which might prevent graphs from being sort of consumed completely by SQL. But yeah, it's
interesting to see how that pattern seems to play out over and over again, to some extent.
Cool. Yeah. All these sort of projects and things we spoke about that you've worked on,
David, over the year, are there any that stand out as being particularly sort of challenging or rewarding um i'm kind of i'm kind of
ticking ticking through various phds i've advised advised in my head so one of the ones that was
was challenging was done with veronica megler and i was involved in something called the Center for Coastal Margin Observation and Prediction.
So these were people who were basically concerned with the Columbia River estuary.
So Columbia River, for your listeners, is a big east-west mainly river.
Well, it starts up in Canada.
It drains most of the Pacific Northwest.
Okay. Okay. It gets quite wet in that part of the world right yeah and so for things like salmon
survivability flooding control and so forth they wanted to understand what was happening in this
very complex system where you know the fresh water water meets the ocean and tides. And so one part of
that was various observations, stations, both fixed and some on, they'd get data from cruises.
They actually had these underwater robots that could, you know, some were kind of like torpedoes. Other ones were called gliders that basically just changed their buoyancy and sort of zigzagged up through the water column.
And then there was a big modeling effort where they were trying to build models that would predict, for example,
you know, there's a salt wedge that comes in with the incoming tide that tends to resuspend organic matter on the bottom because it's driving along the bottom under the freshwater.
There's a freshwater plume out into the ocean, and at the edge of that is where a lot of fish congregate.
I'd worked with these people for many years, and the problem was that all of a sudden they, and they, they were committed to open data.
So all the data they had, they wanted to get online as fast as possible.
But the problem was there was just about at that point, 30,000 different data sets.
And, and there's no one person who knew all of those. And so, you know, if somebody
wanted to come in and use the data, how would you figure out which data sets they could use?
And so, and, you know, you could do things like text, you know, text search largely solved problem
at that point. So you could search on the name of the file or on column names,
but that didn't really satisfy.
And so what we wanted is something where you could do approximate search on numeric data.
And so we came up with a system called Data Near Here.
So it's sort of like the idea is you're on a map and you want to know, are there any service petrol stations near here?
Or are there any restaurants nearby the highway that I'm on?
And so what we wanted mainly people to be able to do is, well, they could say what kind of data they wanted, like salinity or temperature, but then be able to give, you know, both a physical extent and a temporal
extent. Like I want, you know, I'm, I'm interested in turbidity data near the Astoria Bridge from
August, you know, 2018. And there may be no exact match to that. And so we were trying, we, you know, worked on ways
that you could figure out, well, which data sets are there that are closest to the query?
And one of the problems we had was, you know, how do you trade off time for space? Is something
that's in the right area, but two months away, closer than something
that's at the right time, but two kilometers away. So how do you compare months to kilometers?
Because what we wanted was rank search. One of the problems is if you just did
hard ranges and a Boolean result,
then you had this problem that either, you know,
you had no data sets coming back or you had a thousand coming back.
And so it was really about rank search. And, and so that was,
that was challenging. How do you compare months to kilometers?
And what we ended up doing is sort of taking the person's query as a yardstick.
So if they said, well, I'm looking at this one kilometer square area
and for this period of two weeks, then we said, well, I guess, you know,
to them, two weeks is kind of their unit of thinking and one kilometer.
And so then we would use, you know, one kilometer and two weeks is to be able, you know, one kilometer equals two weeks to be able to then rank these data sets.
And we had a tool running in production for a while, but I'm not the the whole enterprise the the person leading it antonio batista retired
it was transferred to a group of indian tribes here in oregon to to keep it running
but i don't know whether that part's been maintained okay that's a great name though
day in a year i mean but That's a great name though,
day to day here.
I mean,
but it's,
it's a fascinating sort of thing.
Think about how do you actually compare like kilometers against,
I don't know,
like time,
like it just,
it's kind of like,
but I mean, I guess a lot of it sort of depends on the,
on the kind of who's asking the question,
right.
As well.
Yeah.
What's the,
what the information need is.
I found that was, you know, that was sort was sort of that challenge. Another challenge of that was you needed metadata about the data topell them and so we tried to you know as much
as possible rely on metadata that we could harvest from the data so you know simple things like
maybe units data ranges things where you know we could go and harvest the metadata ourselves and have some assurance about its quality
rather than relying on people
to come fill out some form later.
It's really hard to get people to be altruistic.
They've gotten the data,
they've used it for their purposes.
The value, they want to go off and do the next study
rather than sitting around documenting the data
so other people can exploit it.
Totally agree, David. I mean, we of kind of relate to the same sort of principle there we have the same thing we'll
deal with a support case at work for example from a customer we always try and be nice to you future
self and your colleagues do a little write-up a bit of a post-mortem of what you found right
happens maybe one in a hundred times so that actually happens and you go back to it you look
at it like what the hell were the conclusions here but yeah no it's hard to sort of incentivize people to like you
say be altruistic about it and um another so another way to solve that i mean one is to
generate the information yourself had another project with graduate students judy cushing and
minakshi rao we're working with a pacific northwest national Laboratory up in Richland, Washington. And they had people who were doing computational chemistry
and with these number of different computational chemistry codes.
And they were trying to make this accessible to bench chemists.
So you had to be sort of a computational chemist to understand
how to configure these things and so one thing they wanted was to capture information about runs of
these codes so that if somebody like later on had a similar molecule to the one that had run they
could look to see well you know which code did you use you know what parameter settings what basis set for the you know modeling the electron cloud
and you know we realized that trying to get people after the fact to go fill in what were these
things about their computational run just wasn't gonna wasn't gonna cut it so we turned the thing
around and said ah what we should do is build a system that helps people set
up these runs. So we had this object model of molecules and so forth, and you could set up your
run. It would help you monitor the status of your run because sometimes these things are going awry
and you want to stop them. And so rather than capturing the information about the run afterwards,
we made it easy for them to plug it in ahead of time and then make the run. And I believe that
those ideas actually ended up in a tool they built up there called ECCE. I can't remember
what the name is, but this idea of carrot rather than stick.
If you'll write down this information
about what you're doing,
we'll make it easier for you to do that.
And so it doesn't have to be altruistic anymore.
It benefits you.
We can reapply this elsewhere.
Yeah, you streamline it, right?
It makes it easier.
The onus isn't on that person to do it anymore.
You collect it naturally as a byproduct
and everyone's happy, yeah awesome yeah so david the next sort of
section the next sort of set of questions i have for you are all about sort of motivation and
that the first one is is which like people or papers have had sort of the biggest impact on
your career so actually some of the things that have influenced me most
have been computer languages.
So I have SQL.
Obviously, I've talked about Datalog.
I've talked about Smalltalk.
And those languages have taken me into new areas.
In terms of getting me into databases,
you can probably trace it back to Catril Beery.
In terms of looking at the object-oriented ideas and later some of the works on array databases,
there was actually a colleague named Peter Buneman who was at Penn at the time.
He's at Edinburgh now or retired from Edinburgh.
But he described to me this work that one of his master's students had done where he had gone
around to shops that develop database applications and tried to see, well, where do most of the errors come from? And a lot of them came from this interface
between the database and the programming language.
And the problem is that the type system of the database
is not being carried over into the programming language.
So this is where I repurposed this term
from electrical engineering impedance mismatch.
If you have a signal going down to wires of difference impedances,
you know, part of it will reflect back at the juncture.
And so, you know, I looked at this and it's like, okay,
we have this nice, you know, set model,
and all that structure is reflected back at the juncture.
It, you know, you've got records and iterators or something up at the top level.
And so that got me into kind of one of the motivations to look at,
okay, can we use the same language in the database
as you write your programs in?
And there were other efforts along these lines.
There were database programming languages that tried to put relations as a type, like Pascal R, add the relation type.
There were people over, other people in Scotland, Malcolm Atkinson and others, who were doing persistent programming languages.
The idea of orthogonal persistence.
Any data structure in your programming
language could be persistent. You know, so, so Peter had a big, I mean, that, that little
observation from his student actually inspired a lot of work. How did I get into stream processing? And it was, I had a sabbatical in Wisconsin, and the idea, we had this great proposal name I came up with, which was called a petabyte in your pocket.
That's a great name.
And it was this idea of being able to give a person access to all this data.
You know, we talked about everything you ever read and every paper you looked at
and, you know, all the information on your finances and such.
If you were doing that just as an individual thing,
it would take a ton of disk drives to capture all that.
But a lot of that information is shared with others,
and it's out on the Internet. And so really what you want to be able to do is pull that data in.
And when you start looking at that and trying to run queries over remote sources like that,
the problem comes up that there's sort of pauses and delays and information is coming in at different rates.
It's not like doing a query on a local database where you control the disk, know when the data
is going to get there. And so started looking at things where the data was arriving incrementally.
And we had thought we would work mainly with XML data because it looked at that point like that was going to be
what everybody was exchanging. Maybe not so much, but this idea of data coming in incrementally,
it's just a little step from that to saying, well, it's going to keep coming continuously,
and you want to keep computing on it and so then that presented a lot
of interesting problems and at the same time there were other groups at stanford the aurora group on
the east coast looking at it so it was yeah it was a several steps that got me into the into the
stream processing world that about in your pocket that's a great name cool so i i guess we spoke
about a lot of things that have been been
successful during your career today david but obviously research is non-linear right like you
have ups and downs so my question is is how how do you personally deal with that how do you deal
with setbacks and rejections oh poorly the i still you know i still have a hard time making myself read reviews when a proposal or paper is rejected.
Part of it, it gets a little easier.
Rejection is a little bit easier to handle because over the course of a longer career, you've got a batting average going.
And so one failure is not going to move the needle a lot. And, you know, I sort of deal
with a lot of this by what I call unjustifiable optimism. I, you know, I just imagine how, you
know, how things are going to be with really not a lot of basis, maybe in reality, but it helps keep me going. I think I adopted that. My first chairman
when I was at Stony Brook, a guy named Jack Heller, just had these ridiculously optimistic plans about
what was going to happen with the computer science department. You know, we were going to build this
new building and get us all these faculty. You know, just, you know, I looked at that and, you
know, said, there's just no way that's going to happen. But it turns out we got about a third of it. And, you know, we got a new building,
we got some new faculty physicians. And so, you know, a third of a huge amount still substantial.
And so I just like to, I don't know, imagine that things are going to go well in the future.
And that keeps me going. I guess the hardest part right now is that I'm often working with, you know,
students or junior faculty and this earlier in their careers.
And I think rejections are harder to deal with then.
And so, you know, just sort of supporting them and keeping them going.
I did get some advice from my advisor.
He wants,
you know,
he once told me that time,
time spent writing proposals is seldom wasted.
That even if a given proposal gets rejected,
that you can find another opportunity to plug that in and,
and work with it.
And that's,
that's largely been true that,
that you can, or, you know,
you write a proposal and then it doesn't get funded,
but you get, you know,
an invitation to a keynote talk.
So, you know, you get a place to present it.
Yeah, so, yeah,
I guess unjustifiable optimism helps.
I love that, David.
I'm going to adopt that as well.
That's brilliant. Yeah, think also like earlier like you say you have a longer time horizon you kind of develop
a batting average and it doesn't necessarily move the needle but when you sort of earlier on in your
career does well the first one you get for example is horrible right like i remember the first
rejection i got for the competition and didn't deal with it very well i don't think but yeah i
think you maybe get better at dealing with it over time as well of trying to detach yourself from the rejection and not take it so personally
anyway but yeah cool yes i kind of guess related to slightly what you're saying there about sort of
when something does get addicted iterating on it and it's always you can always find a use for it
somewhere else or it may lead to a to a separate idea is this sort of this this question around the creative process and how do you david how do you approach that do you have a systematic
way for for creating ideas and generating ideas and then selecting on which things to work on
i mean you gave me that question in advance i thought about it for a while and i
think one of the my techniques or one of my capabilities is seeing
patterns so you know if you think about it database systems don't do anything that you
couldn't do without them you know they try to do it more reliably more efficiently in terms of like
programmer time and computer resources but you could go go write, you know, the same stuff in a
general purpose programming language and do it. And so, you know, if you look at, at Cod,
when he proposed the relational model, well, he was looking at, you know, what do, you know,
business data processing programs look like inside. And he saw that, you know, there were
these common patterns. It was scanning through something, there was taking a subset of it, you know, either columns or rows, it was combining
it with another table. And he was able to say, well, you know, there's these half dozen or eight
operations that you can do that can explain a lot of what's going on. And so he saw a pattern there
and exploited it. And so I think I've, you know, also, you know, when I talk to
people, especially people who are applying computer technology, looking at what they're doing and
seeing a pattern there and seeing if you can exploit it. I mean, one example of that was
actually somebody else kind of a, Jim Gray, who observed the pattern is that if you look at,
you know,
a COBOL program and now you've adopted a relational database,
well,
the program,
the application program shrinks to about half the size.
So the database part has taken away a lot of the code and we'll do it,
but what's left?
Well,
there's a lot of business logic and
and user interface stuff still to write and so started you know i've thought okay well there's
this pattern of how people write stuff and it's often if you look more closely you'll often have
this sort of intermediate layer where the gurus who understand this particular database
have written you know sort of the update and access subroutines to use in the application program
and like the ui developers and stuff call those and then there's this you know you only access
the database between these through these approved subroutines and And I'm thinking, ah, okay,
so you've got this logic and programs
sitting above the database.
Well, can't we pull that into the database?
And so I think other people do this.
As you see, you look at applications of databases
and you're saying, oh, well, a lot of people
are trying to use this for geospatial information.
So maybe we should have a GIS extension or people are trying to use this for geospatial information. So maybe we should have a GIS extension
or people are trying to process text.
And so, you know, looking at what people are doing,
seeing patterns in what they're doing and saying,
okay, is that something that we can codify
and make into a common service
and get the benefits from that?
Yeah, no, that's a really
kind of nice way i would try and use that when i approach things and try and see pans i mean i
guess humans on some level all we are is pattern recognition pattern recognition machines right on
some i guess fundamental level right that's how we learn right i guess on some base level
well i've also used it in guiding graduate students. Like, you know, this thing about user interface,
I had a student working on building graphical interfaces for objects.
And what I kept doing is just saying, okay, you did it for this,
you did it for that, with the hope of that once they did it,
like the third or fourth time, they'd start seeing a pattern
and then be able to abstract back from that
and build something easier.
And I was really surprised with this student,
Belinda Buonafe.
She had showed me an interface
for describing the user interfaces we wanted.
And I'd asked her to make a certain change
and she knocked on my office door about an hour later, and it was implemented.
And so it turned out she was using her own tool to build her own interface.
So she could update it declaratively rather than actually recoding a bunch. giving a graduate student similar problems over and over until out of self-defense they develop
an abstraction and a way to do it i find this effective yeah that's awesome
that's good good obviously when i at the very top of the show i mentioned that you'd collaborated
and we have spoke quite a lot throughout the course of the podcast that you collaborated a
lot with with industry and kind of one of the missions of
this podcast is to like help further bridge this gap between research and industry so
my question is what what do you think about the current interaction between academia and industry
and how would we make it better is already perfect yeah what's your take on things so i it's hard for me to speak
broadly about academia maybe maybe somewhat about computer science and industry databases you know
it's there's pretty good connections you you get the both the industrial research people and
developers and the academics coming to the same conferences. And people move back and forth,
people build prototypes and then make companies out of them, they send their students to work.
And so there's, I think, good crossing back and forth. I've also seen, it may not be
routine, but there are people I encounter in industry who do actually go read papers.
You know, I talked about, about Mullen Aref going back and reading the literature on
declarative query processing or deductive query processing. You know, this person,
Todd Porter that I work with or talk to from Meta, you know, he's always, you know,
talking about a paper or, or a blog post
that he just read about something. So, you know, there are people who pay attention to the papers.
So I think that's good. I mean, one thing that kind of messes things up is when you get, I mean, I've been around long enough to see like these peaks in demand, you know,
of companies starting and then, you know, sort of going down. And so there was like one early on
around PCs and departmental computers where all these little software companies were starting.
You didn't get your software just from IBM or Burroughs anymore.
And then there was another one with the dot-com or whatever.
And so what ends up happening is that that sucks a lot of people out of academia, at least temporarily.
And so you feel a little left behind with that. And I mean, some schools are getting better
at being able to let somebody, you know,
stay in some role as a faculty member
while still being involved in a company outside.
But, you know, enrollments in computer science
are often counter-cyclical.
You know, when the industry is not doing well,
people decide, oh, it's a good time to come back
and update my credentials or something.
And then when things are doing really well,
you know, you're competing for faculty
with industry and so forth,
and it's a little hard.
Yeah, it's funny you should mention
that sort of like the cycles,
because I was just recently reading
the book about Amazon
and about how the story of it and everything. And i was just recently reading the the book about amazon and
about how the story of it and everything and there was a section when the kind of aws sort of started
to be founded and they've kind of they were mentioning a few of the names and a lot of the
the people that came from i think it was the university of wisconsin-madison or i could bet
like especially kind of like universities in sort of the northwest sort of part of the state of the
states as well i was like that's some serious brain drain on the universities right i mean i mean if you don't replenish it
eventually it's not going to be there cool i guess just one last kind of question question now david
that is the future and current trends and what you observe at the moment what do you see as kind
of the most exciting avenues for future research?
Well, it's hard for me to tell what will excite others, but things that intrigue me, one of them is, you know, what's important in the next generation of stream processing.
And this is another thing where I've talked with my friend Todd at Meta, and we sort of proposed that what it might be is scaling
management of systems.
So, you know, originally
people talked about
single queries, single stream
queries, and
then being able to
parallelize
those across multiple
processors. So, you know, got figured out ways to do parallel evaluation
of single queries on very large data streams.
But now, if you look at what's behind Facebook or whatever,
you've got maybe thousands of stream jobs,
each with maybe 100 or 1, you've got, you know, maybe thousands of stream jobs, you know, each with
maybe a hundred or a thousand parallel tasks. You can't manage things at the individual, you know,
data item or, you know, process level. There's just too many of them for someone to pay attention
to. And so we're sort of saying, well, maybe the direction is that you have to have some
coarser groupings over this flying grain stuff to help you manage.
And one of the things we've been talking about is what we call job chopping. and processing those and maybe getting some of it ready to, you know, figure out what
ads to show them and other things going off to the ML people who are trying to train various
recommenders or whatever.
And so the idea is just, well, in theory, it's a perpetual job, but what if we just
break it up into one hour segments and run it for an hour and then shut it down at the end of the
hour. And, you know, and just before that, we'll start up the next instance of it. And it turns
out that, you know, that, that makes some things like fault recovery easier. And it's also easier
to do auto scaling and migration if you have these built-in boundaries.
So looking at ways of, we had a little presentation that, oh, there's a Northwest Database Society meeting about once a year.
And we had a presentation, we called it Block and Tackle.
So, you know, breaking these things up into blocks and managing them.
I've been accused of pun-driven research, like coming up with a great title and having to write a paper that matches it.
Another thing I think is interesting to me is what I call data productivity.
So there's just data being collected for all sorts of purposes.
Some of my interactions there have been with a traffic archive called Portal at Portland State, where they bring in things like the ramp meters, the loop detectors, bus position information,
stuff that's used operationally, but that if you collect it, you can see patterns or do research. And so this idea of
the value you get out of data over the cost of collecting and maintaining it. And I'm
particularly interested on focusing on the numerator of that ratio.
So how do we improve the value you receive from data?
I mean, there's a lot of work on how do we collect data better and store it more cheaply.
And so, you know, it's hard to put a number on the value of the uses of data,
but I'm pretty sure it correlates strongly with the number of uses of the data.
The more it's used,
the better. And there's a lot of what I call one and done data collected for a purpose and then,
you know, never looked at again, or even encountered none and done data that is
collected and never analyzed. And so trying to think about things that make data reuse easier and thereby boosting data productivity.
And so one of the things I mentioned briefly at the beginning is alignment.
So if you're trying to take existing data sources that are both time series,
but the times don't match, you know, how are you going to adjust them
if you want to do some kind of analysis or simple plot or whatever?
Even a scatter plot, you have to have your two data sets at the same time points.
And so working on declarative ways of saying, how do I align these two data sets?
It's a little corner of this data productivity thing.
But what are the things we can do to make data more reusable
and thus get more value out of it?
Yeah, that's fascinating.
Just to jump back to the job chopping really quick there, David.
I like the idea of having these boundaries makes things like managing tolerance easier.
How do you sort of avoid downtime of the system?
Do you kind of have the next instance and you're like on the hot switch over?
Because that kind of thing that was jumping out, you bring the barrier down at one point,
how do you switch? I remember getting too much into the details of how the scheme would work,
but yeah. In the particular situation that suggested this was they have an underlying
message broker system. It's kind of like Kafka, but it's an internal one.
And so that's where the data is coming into the stream queries.
So you, and it has its own state.
So you can start another instance of the job,
you know, at a point in that message stream
while this one finishes up.
So, you know, when it gets time, close to time, you know, maybe two minutes before the
hour or something, you can start this next one up.
The other thing that Todd noticed is often, well, okay, these things are continually processing
the data.
They want to keep up with it, but they're only emitting outputs like every five minutes
or every 15 minutes.
So you don't necessarily
have to have the job, these
joblets running continuously.
If you get them started up
and they can catch up with the data
before their next reporting point,
then, you know...
Jobs are good, right?
Yeah, things are cool.
I mean, there's, you know,
there's a lot of details to figure out,
but it seems to, you know,
have some advantages that make it worth pursuing.
Yeah, awesome.
Cool.
Well, I think that's a wrap then, David.
We can finish up there.
It's been a fascinating conversation.
I've loved it.
And I'm sure the listeners will as well.
So thank you very much for taking the time
to speak with me today.
Yeah, my pleasure.