Disseminate: The Computer Science Research Podcast - High Impact in Databases with... David Maier

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. Jack here with another episode in our High Impact in Databases series. I'm delighted to say I'm going to be talking to Dave Meyer today. But before we do that, shout out to our sponsor, Pomtree. Pomtree are the developers behind Raftree, the open-source temporal graph analytics engine for Python and Rust. Raftree supports time traveling multi-layer modeling and comes out the box with advanced analytics like community evolution

Starting point is 00:00:50 dynamic scoring and temporal motifs mining it's blazingly fast scales to hundreds of millions millions of edges on your laptop and connects directly to all your data science tooling including pandas pyg and lang chain so yeah go check out what the guy or the palm tree guys are doing at www.rafty.com where you can dive into their trussorial for their latest 0.8 release anyway on to the podcast so yeah like i said at the top of the show it's been a great pleasure to welcome david meyer to the um to the show today so tell you a bit more about david before he tells us about his story david is the Massey Professor of Emerging Technologies in the Department of Computer Science at Portland State University. He's also the author of several kind of key texts in the field of databases, including the theory of relational databases. And across

Starting point is 00:01:36 his career, he has consulted with various companies to name a few, IBM, Microsoft, and Oracle. And he's also won the Codd Award. Award and lastly he's got a habit of coining terms for things so yeah he's famous for coining the term datalog which I'm sure many of our listeners will have come across at some point in their lives so yeah cool welcome to the show David. Well thank you for having me. Awesome stuff cool well let's get started then so that it's customary on the on the podcast for the for the guests to tell their story. So yeah, what has your journey been like so far? And yeah, why did you become a database researcher? What's the story there?

Starting point is 00:02:10 Well, it's interesting. I always knew what I wanted to be, but what that was changed. It was first a fireman, and then I wanted to be a scientist, and then I wanted to be a professor, probably influenced by the fact that my father was a professor of mathematics. And so I started studying math. When I got to college, I was able to take some computer science courses. And so I ended up with a double major, both in math and computer science. And then for reasons I cannot completely reconstruct, I've decided to apply to graduate school in computer science rather than mathematics.

Starting point is 00:02:47 In retrospect, it was a great decision. So I ended up in Princeton because I'd been reading things in cellular automata and those sort of theoretical topics. I was interested at the time, and I saw all these names like, you know, Church being there and Turing and von Neumann. And little did I know that they were all gone. Gertl, he might've still been there when I was there, but they weren't around and they certainly weren't part of the computer science department. But anyway, I got there, fell in with my advisor, Jeff Ullman, and actually my first research was quite theoretical. I mean, I had a math background, so I was coming to that. And so it was on NP completeness of some sequence problems. doc at Toronto and started getting interested in databases there with, with Dennis Sekretsis came to Princeton for another postdoc or something. And all of a sudden, like all of

Starting point is 00:03:52 Jeff Ullman's other students were doing relational database theory. And so I was finishing up my thesis, but at the same time I was getting involved in writing papers with them. And so then that was how I got off into databases. So it was first the very theoretical stuff, theory of relational databases, and then started moving into query processing and then into more database systems work and have sort of stayed more or less around that area since then. Yeah, so that kind of brings us up to today. So what are you working on at the moment, David? Ah, well, so I'm actually an emeritus professor now

Starting point is 00:04:29 and I've mostly been spending time on is helping younger faculty get going. And so I have one person I'm working with, Primal Papachan, who I know is a listener to your podcast on fine-grained privacy issues. I'm working with Binafshir Rekaddar on some ML problems, multimedia. I talk to people at Oregon State frequently about using large language models to learn schema updates i've got i've inherited one graduate student who is

Starting point is 00:05:07 working on call it data alignment which is useful and with temporal and spatial data and trying to reuse it and combine it and then the latest thing i've gotten into is single photon photography. So using single photon cameras to construct distance maps by time of flight kind of things. And it doesn't seem related to databases, but if you stand back far enough, it really is kind of a problem of computing aggregates over data streams of a certain port.

Starting point is 00:05:42 And so I've worked in streaming data a lot so i was able to bring something to the table on that yeah when you first said that it seems like kind of a big pivot from kind of what you've been doing before but yeah when you kind of break it down there's there's some transferable sort of skills there to that problem as well cool i wanted to to you mentioned that you kind of spend a lot of your time sort of helping sort of younger faculty and i know you have this thing on your website about sort of your advice to sort of mid at Stony Brook. And some of my cohort were already getting asked to be on program committees. And I wasn't. And I was a little worried about that.

Starting point is 00:06:32 And I saw my advisor at a meeting and he said, don't worry about it. It'll come. And he was right. And it wasn't that long before there was more things than I could do, but you kind of have this temptation, you know, that you've been waiting for this. And then all of a sudden the opportunities come and you start seizing them. And I find that,

Starting point is 00:06:56 you know, I was getting overwhelmed by that. And so as my, and there was other things, you know, being conference organizing, you know, being on study panels, things like that. And I realized that, you know, I had to be able to not accept every opportunity that came by. And I think it's a temptation among young faculty. And so, you know, I started thinking, well, I need to have, I need to say no to most things. And so I need to have some reasons to say yes. So, you know, so some of those reasons is, you know, I like to do things that will bring me into touch with people outside my area. So one of the things I said yes to was there's a board on mathematical sciences and analytics at the National Academies in the U.S. And I served on that for multiple years as token database person, but got to meet a lot of interesting people in other areas. You know, some things just have a lot of interesting people in other areas. Some things just have a lot more payoff for the amount of investment. So being on a panel at a conference, make up a few

Starting point is 00:08:13 slides, but you get as much face time as if you wrote a paper. And then the other thing was coming up with reasons not to say yes. So if somebody says, oh, you'd be the best person for it, well, that's flattering, but it may be that, you know, the third best person or the 10th best person for it would do an adequate job. Or they might tell you, oh, if you don't do it, it won't get done. Well, maybe it's something we should stop doing. Yeah, that's a great point. get done. Well, maybe it's something we should stop doing. So, and then, you know, advising, I've gotten some feedback from young faculty who've read this. And the one thing I said is

Starting point is 00:08:55 at the very end is, you know, making sure you make time for your family as you're going. And a lot of them wanted that validation that, that that was important. I, myself, when I was, had young kids sort of made a decision that I wouldn't, you know, work on work stuff, you know, from dinnertime until they went to bed and was able to do that for many years. What I hadn't calculated in is, you know, as they got older, you know, bedtime moved from eight to nine to 10. And so, you know, it was trying to get back and do some useful work after 10 wasn't great. So, you know, at some point I, you know, they had homework and I could do some, some things myself at that time. Yeah, no, I think that's great advice. So trying to say no more, I mean'm i'm i'm terrible for it

Starting point is 00:09:45 as well kind of you just it's really hard to say no a lot of the times right and you kind of have to end up kind of doing five things that kind of get you get getting five b's rather than three years right so you've got to go to streamline yourself a little bit and sort of be more selective so i mean that's that's really solid yeah i've i've helped out some young faculty by just writing the word no on a piece of paper and said, and put this on your bulletin board in case you're, or, you know, put it in your wallet in case you're wondering what the answer should be. Yeah, it's really interesting to say that because I was watching, I don't know if you've ever seen the TV series Fargo. I was watching the latest season of that and one of the characters in her office, she just has this giant picture.

Starting point is 00:10:22 Like it just says no behind her desk. Oh, really? Yeah, and I thought that was fantastic. I love that painting. It was awesome. her in her office she just has this giant picture like it just says no behind her desk oh really yeah i thought that was fantastic i love that painting it was awesome yeah but yeah it's anyway digress cool so let's the next sort of section of the of the of the podcast we like to do a bit of a retrospective of your career david so i guess the first question is what are you most proud of in your career? And does that correlate with the work that's been the most impactful? Well, that's an interesting question. So things I'm sort of proud of. So one thing is actually

Starting point is 00:10:57 my thesis research, it was on the complexity of sort of sequence matching problems over multiple sequences. And what's gratifying about it is if you go look on something like Google Scholar, usually the typical pattern for paper citation is there's a little peak a few years after it, and then it sort of has a long tail, maybe a bump here or there, but this paper I have moved steadily upward and is at a quite high, you know, higher level now than when I published it. And the reason is, is that anybody who's working on multiple sequence alignment in bioinformatics cites it as an excuse that their answers are approximate. Because if it's, you know, my problem was a special case of theirs, it's NP-complete, which means you're unlikely to

Starting point is 00:11:56 find, you know, a sub-exponential algorithm for an exact solution. So that's been kind of fun. I didn't, I didn't real, that was sort of happening in the background. And then I decided to help someone co-teach a bioinformatics course and was looking at the back of the text and saw that I was cited and then started noticing it more. But then I'm also very proud of the relational database theory book. And that was, you know, my, my advisor, Jeff Ullman had written a lot of books and, you know, it's not necessarily something that I'd advise a young faculty member to do because you're being evaluated at least in, you know, computer areas,

Starting point is 00:12:39 more on papers and grants, but I wanted to teach the topic anyway. And so essentially I was developing course notes at the time. And that book, even though it's out of print, it still gets cited a lot. And if anybody wants it, I've put a PDF scan of it up on my website, so it's had a lot of legs. Other stuff that I'm proud of, you know, and helped introduce some ideas of stream semantics and this idea of punctuation and processing streams out of order. And I think those have all had influence. And I actually went as a visiting researcher to Microsoft on several occasions and helped with the Stream Insight product they were working on there, although that's sort of been superseded by other technologies they've developed. But maybe five years ago, I started, I guess this first met him,

Starting point is 00:13:55 a guy named Todd Porter, who was at Microsoft then, but moved to Meta, who actually worked more on the development side, supporting their Azure streams or their stream processing products. And I started talking to him and he'd start saying, well, I'm thinking about the following problem. And I was able to say, well, you know, I had a paper on that 12 years ago. And then another problem. And it seemed like everything he brought up almost, you know, I could point him to where we thought about it. You know, I guess it was kind of this thing of looking at some of these problems back then and finally it was getting to the place where they became

Starting point is 00:14:32 important enough that they were of interest to industry so those are some of the things that i think are highlights for me yeah nice i like that sort of it must be quite gratifying to solve i mean because that's kind of i guess how research should kind of work in a sense. Like you kind of, you do something now and maybe it's really kind of at the frontier for thinking. And there's a sort of like bit of a delay before it kind of gets into industry adoption and those sort of ideas can actually be put into practice. So that must be really nice to sort of see that play out in front of you almost. Yeah, there was, I saw an interesting example of that. There was a company called Logic Blocks,

Starting point is 00:15:05 and their head at the time was a gentleman named Mohan Aref. And they was basically a data log engine under the covers. They had some surface syntax that was maybe a little bit more flexible. But I was surprised that he was going back and reading these papers from like you know the 80s on deductive databases and compiling data log and so forth and so and applying the ideas yeah the turns out that some of these theoretical results have have really long shelf lives and can be picked up later yeah that's awesome i guess as well kind of we mentioned we mentioned datalog there so maybe i kind of teased at the top of the show that you have a

Starting point is 00:15:50 habit for for coining terms so yeah let's talk about datalog then so you've kind of used kind of involved and involved in that sort of project from the very early stages i guess right to give it its name so yeah can you tell us a little bit more about the story with Datalog? So back early on, it must have been my first job, so late 70s, early 80s, logic programming started to be more popular, come into view. And I was at Stony Brook. I started working with someone named David Warren, who actually stayed and worked a lot on Prolog

Starting point is 00:16:26 and started a company with it. But we started there and I finished it when I got to Oregon, writing a textbook on logic programming. And our strategy there was we were going to sort of build up both the theory and the technology a little bit at a time. So we started with just propositional logic. So you just have symbols, you don't have any variables or predicates, but with no variables. And you could talk about different things like resolution in that context and how you might build a simple interpreter. And then we did predicate logic. And there you have, now you have variables but no

Starting point is 00:17:08 function symbols and then we did you know full predicate logic with function symbols and so the n1 the first one had an obvious name prop log the last one was already called prologue and so there was this thing in the middle and we were trying to figure out what to call it. And according to David Warren, I left one evening thinking about it and came back the next morning and said, oh, we should call it Datalog. Because these things that you're working with look like database relations. And so that stuck. And around that time, my advisor, Jeff Ullman, had moved to Stanford, and I was visiting him. We were talking about issues with evaluating recursive data logs or logic-styled programs, and I started using the term data log with that group. and it's interesting its first occurrence in print isn't necessarily one of my own publications other people other people put picked it up my book hadn't come out yet where it's there but

Starting point is 00:18:14 then it's funny so that sort of i worked i worked in you know processing recursive data log for a while but then i got pulled off to work on object databases. But then maybe 10 years ago, I got involved with Joe Hellerstein at Berkeley and some of his students, Peter Alvaro in particular, and they were looking at execution of, looking at convergence of distributed computation, and their model was data log log and so i started talking to them again and so i had a couple other papers in the in the data log domain after that cool that's fascinating and there's this is another sort of question that you can probably have this on your while we're talking about kind of impactful work and this is another thing you have on your website about the

Starting point is 00:19:00 best paper you never published and this is a logic for objects and this concept called i don't know if i'm pronouncing this correctly but scolum surrogates yes scolum scolum surrogates so yeah can you tell us about this yeah so i've been working in data log and and logic query optimization and processing. And then I got over partly because of some consulting work that I picked up here in Oregon. There was a company called Serviologic. They're called Gemstone now. But originally they were building a piece of hardware

Starting point is 00:19:41 to work in a nested relational model. So it was going to be a database machine with a nested relational model. And at some point, it morphed into an object-oriented database. So I'd been thinking about that. And then I got involved with some researchers at MCC, the Microelectronics and Computing Corporation. So it was this industry consortium down in Austin, Texas. And one of the things they were trying to do was parallel query processing. So working with big data, this was about the time of the Japanese fifth

Starting point is 00:20:20 generation project. And so I was down meeting with them and they were, at least some of them, were using logic languages as a basis for that and were interested in my inputs on languages. And I started thinking about these extensions you could make with, you know, could we extend the logic languages to have more object-oriented features?

Starting point is 00:20:47 And so I'd come up with this idea of what you really needed was some notion of object identity so that you could talk about updates or that you had, and how do you represent that you have two references to the same item? And so I had, from the logic work, I knew about this, these things called Scalem variables, a trick to kind of get rid of existential quantifiers in certain cases. And so I said, well, I could use something like that to represent the object identities

Starting point is 00:21:23 in these, in this setting. And so that best paper I never published was presented at a workshop on logic and databases that Jack Minker and some of his friends at Maryland organized. And afterwards, you could send your paper in to be considered for a book that would be published of it. And they declined to include that paper in the book. They just thought the work was a little bit too early on or whatever. And the funny thing is, I mean, people can still get out. There was a tech report of it.

Starting point is 00:22:02 There was still a paper. It just wasn't in the book. But I've seen it cited several times as having been in the book. Okay, right. The retrofitting. It was in the book, right? Well, you know, it was... People's assumption is, well, it was presented at the workshop

Starting point is 00:22:18 of logic and databases. And then here's this book that's kind of the proceedings. They sort of assume what was in there, but it was an edited volume, so not everything i guess got in got into it yeah so that so i just you know uh people have cited it there was work by michael keifer and others at stony brook on extensions to it uh f logic and c logic so it had in you know it had influence you know people people took off from there but you know some sense of seminal paper but wasn't obviously so i guess at the time i wrote it yeah no that's awesome we kind of spoke just to kind of change up a little bit we spoke a little

Starting point is 00:22:59 bit about kind of object-oriented databases so and i mean sort of when i first came across uh databases they're kind of not really around in sort and i mean sort of when i first came across uh databases they're kind of not really around in sort of i guess their original form as much anymore so i kind of wanted to get your take i mean the only time i've ever encountered actually was in i think one of jeff ullman's books in like kind of the the database systems implementations books i think i think he's a i believe he's a he's an author on that one yeah i'll just check and and i kind of i was first came across i was like what the hell are these object-oriented databases that they mention here because they kind of don't get spoke about i guess as much these days as i guess they maybe did sort of 10 15 20 years ago

Starting point is 00:23:34 so yeah kind of i wanted to kind of get your take on them kind of looking back on them and kind of what what position they have today yeah so it was it was interesting development there was this so i talked about this company i was in that was originally doing a nested relational database and then you know having the problem is that we were trying to devise a query language and when you're creating a query language and there you're the only product that has that query language, you have a problem because, you know, there's, there's no textbooks on it. There's no training students aren't learning it, you know, in their, in their classes. And so there was a little bit of internal revolt.

Starting point is 00:24:22 There'd been people, some of the people that come over from Tektronix, which at that point means they're best known for sort of oscilloscopes and test equipment, but at that point they had a computer research lab. And they were one of the first outside groups outside of Xerox

Starting point is 00:24:40 Park to do an implementation of Smalltalk. And so there was people who knew about Smalltalk, this object-oriented language. And then somebody said, hey, let's, you know, here's a language that already exists. These nested relational structures kind of look like, you know, complex objects. You know, let's make small talk or something like it our language. And so that's where the Gemstone system had its origins.-aided design, computer-aided software engineering systems. Gemstone, not so much.

Starting point is 00:25:39 And so, you know, I was involved in various debates with, you know, relational proponents like Mike Stonebraker versus this object stuff. And, you know, a lot of the relational database companies started reacting to that by basically having their marketing departments upgrade them to object-oriented databases. Marketing people are a lot cheaper than computer engineers and so we're object oriented because we have these binary large objects or blobs so most of the companies didn't

Starting point is 00:26:17 persist, some are still there, I think objectivity is still there I believe that there is well they have a large contract with the NSA because you could, as you might know, being at Neo4j, you can easily build a graph model on top of an object database. And they also, I've heard, have, were running in a lot of cell phone tower software. Okay. And so, you know in a lot of cell phone tower software. Okay. And so, you know, what kind of happened to them?

Starting point is 00:26:49 Well, some of the features got absorbed into mainstream databases. Some of it showed up in more like object-oriented middleware. Okay. So like Enterprise Java Beans is kind of an object model. There's something at Microsoft called Orleans. That's sort of a middleware object model. And actually it's one of these things where, you know, I'm on the second generation of something that they were,

Starting point is 00:27:19 I was working with Phil Bernstein at, at Microsoft research and trying to add indexing to this. And so that was something I'd worked on back in the 80s and was able to bring that up to one of these sort of object middlewares. And then there are some, you know, there are a few object-oriented systems still around, although it's not a big market segment. Yeah, it's interesting what you're saying about

Starting point is 00:27:45 sort of sql sort of always be at their marketing departments initially sort of consuming whatever it is at the time be it object oriented base this is all kind of with xml and to some extent with kind of graphs as well i mean there's an extension for sql now pg pgq and i know there is actually there is the now that the new graph query language standard, which might prevent graphs from being sort of consumed completely by SQL. But yeah, it's interesting to see how that pattern seems to play out over and over again, to some extent. Cool. Yeah. All these sort of projects and things we spoke about that you've worked on, David, over the year, are there any that stand out as being particularly sort of challenging or rewarding um i'm kind of i'm kind of ticking ticking through various phds i've advised advised in my head so one of the ones that was

Starting point is 00:28:35 was challenging was done with veronica megler and i was involved in something called the Center for Coastal Margin Observation and Prediction. So these were people who were basically concerned with the Columbia River estuary. So Columbia River, for your listeners, is a big east-west mainly river. Well, it starts up in Canada. It drains most of the Pacific Northwest. Okay. Okay. It gets quite wet in that part of the world right yeah and so for things like salmon survivability flooding control and so forth they wanted to understand what was happening in this very complex system where you know the fresh water water meets the ocean and tides. And so one part of

Starting point is 00:29:28 that was various observations, stations, both fixed and some on, they'd get data from cruises. They actually had these underwater robots that could, you know, some were kind of like torpedoes. Other ones were called gliders that basically just changed their buoyancy and sort of zigzagged up through the water column. And then there was a big modeling effort where they were trying to build models that would predict, for example, you know, there's a salt wedge that comes in with the incoming tide that tends to resuspend organic matter on the bottom because it's driving along the bottom under the freshwater. There's a freshwater plume out into the ocean, and at the edge of that is where a lot of fish congregate. I'd worked with these people for many years, and the problem was that all of a sudden they, and they, they were committed to open data. So all the data they had, they wanted to get online as fast as possible. But the problem was there was just about at that point, 30,000 different data sets.

Starting point is 00:30:40 And, and there's no one person who knew all of those. And so, you know, if somebody wanted to come in and use the data, how would you figure out which data sets they could use? And so, and, you know, you could do things like text, you know, text search largely solved problem at that point. So you could search on the name of the file or on column names, but that didn't really satisfy. And so what we wanted is something where you could do approximate search on numeric data. And so we came up with a system called Data Near Here. So it's sort of like the idea is you're on a map and you want to know, are there any service petrol stations near here?

Starting point is 00:31:33 Or are there any restaurants nearby the highway that I'm on? And so what we wanted mainly people to be able to do is, well, they could say what kind of data they wanted, like salinity or temperature, but then be able to give, you know, both a physical extent and a temporal extent. Like I want, you know, I'm, I'm interested in turbidity data near the Astoria Bridge from August, you know, 2018. And there may be no exact match to that. And so we were trying, we, you know, worked on ways that you could figure out, well, which data sets are there that are closest to the query? And one of the problems we had was, you know, how do you trade off time for space? Is something that's in the right area, but two months away, closer than something that's at the right time, but two kilometers away. So how do you compare months to kilometers?

Starting point is 00:32:35 Because what we wanted was rank search. One of the problems is if you just did hard ranges and a Boolean result, then you had this problem that either, you know, you had no data sets coming back or you had a thousand coming back. And so it was really about rank search. And, and so that was, that was challenging. How do you compare months to kilometers? And what we ended up doing is sort of taking the person's query as a yardstick. So if they said, well, I'm looking at this one kilometer square area

Starting point is 00:33:11 and for this period of two weeks, then we said, well, I guess, you know, to them, two weeks is kind of their unit of thinking and one kilometer. And so then we would use, you know, one kilometer and two weeks is to be able, you know, one kilometer equals two weeks to be able to then rank these data sets. And we had a tool running in production for a while, but I'm not the the whole enterprise the the person leading it antonio batista retired it was transferred to a group of indian tribes here in oregon to to keep it running but i don't know whether that part's been maintained okay that's a great name though day in a year i mean but That's a great name though, day to day here.

Starting point is 00:34:05 I mean, but it's, it's a fascinating sort of thing. Think about how do you actually compare like kilometers against, I don't know, like time, like it just, it's kind of like,

Starting point is 00:34:14 but I mean, I guess a lot of it sort of depends on the, on the kind of who's asking the question, right. As well. Yeah. What's the, what the information need is. I found that was, you know, that was sort was sort of that challenge. Another challenge of that was you needed metadata about the data topell them and so we tried to you know as much

Starting point is 00:34:49 as possible rely on metadata that we could harvest from the data so you know simple things like maybe units data ranges things where you know we could go and harvest the metadata ourselves and have some assurance about its quality rather than relying on people to come fill out some form later. It's really hard to get people to be altruistic. They've gotten the data, they've used it for their purposes. The value, they want to go off and do the next study

Starting point is 00:35:19 rather than sitting around documenting the data so other people can exploit it. Totally agree, David. I mean, we of kind of relate to the same sort of principle there we have the same thing we'll deal with a support case at work for example from a customer we always try and be nice to you future self and your colleagues do a little write-up a bit of a post-mortem of what you found right happens maybe one in a hundred times so that actually happens and you go back to it you look at it like what the hell were the conclusions here but yeah no it's hard to sort of incentivize people to like you say be altruistic about it and um another so another way to solve that i mean one is to

Starting point is 00:35:51 generate the information yourself had another project with graduate students judy cushing and minakshi rao we're working with a pacific northwest national Laboratory up in Richland, Washington. And they had people who were doing computational chemistry and with these number of different computational chemistry codes. And they were trying to make this accessible to bench chemists. So you had to be sort of a computational chemist to understand how to configure these things and so one thing they wanted was to capture information about runs of these codes so that if somebody like later on had a similar molecule to the one that had run they could look to see well you know which code did you use you know what parameter settings what basis set for the you know modeling the electron cloud

Starting point is 00:36:46 and you know we realized that trying to get people after the fact to go fill in what were these things about their computational run just wasn't gonna wasn't gonna cut it so we turned the thing around and said ah what we should do is build a system that helps people set up these runs. So we had this object model of molecules and so forth, and you could set up your run. It would help you monitor the status of your run because sometimes these things are going awry and you want to stop them. And so rather than capturing the information about the run afterwards, we made it easy for them to plug it in ahead of time and then make the run. And I believe that those ideas actually ended up in a tool they built up there called ECCE. I can't remember

Starting point is 00:37:42 what the name is, but this idea of carrot rather than stick. If you'll write down this information about what you're doing, we'll make it easier for you to do that. And so it doesn't have to be altruistic anymore. It benefits you. We can reapply this elsewhere. Yeah, you streamline it, right?

Starting point is 00:38:00 It makes it easier. The onus isn't on that person to do it anymore. You collect it naturally as a byproduct and everyone's happy, yeah awesome yeah so david the next sort of section the next sort of set of questions i have for you are all about sort of motivation and that the first one is is which like people or papers have had sort of the biggest impact on your career so actually some of the things that have influenced me most have been computer languages.

Starting point is 00:38:29 So I have SQL. Obviously, I've talked about Datalog. I've talked about Smalltalk. And those languages have taken me into new areas. In terms of getting me into databases, you can probably trace it back to Catril Beery. In terms of looking at the object-oriented ideas and later some of the works on array databases, there was actually a colleague named Peter Buneman who was at Penn at the time.

Starting point is 00:39:04 He's at Edinburgh now or retired from Edinburgh. But he described to me this work that one of his master's students had done where he had gone around to shops that develop database applications and tried to see, well, where do most of the errors come from? And a lot of them came from this interface between the database and the programming language. And the problem is that the type system of the database is not being carried over into the programming language. So this is where I repurposed this term from electrical engineering impedance mismatch.

Starting point is 00:39:45 If you have a signal going down to wires of difference impedances, you know, part of it will reflect back at the juncture. And so, you know, I looked at this and it's like, okay, we have this nice, you know, set model, and all that structure is reflected back at the juncture. It, you know, you've got records and iterators or something up at the top level. And so that got me into kind of one of the motivations to look at, okay, can we use the same language in the database

Starting point is 00:40:16 as you write your programs in? And there were other efforts along these lines. There were database programming languages that tried to put relations as a type, like Pascal R, add the relation type. There were people over, other people in Scotland, Malcolm Atkinson and others, who were doing persistent programming languages. The idea of orthogonal persistence. Any data structure in your programming language could be persistent. You know, so, so Peter had a big, I mean, that, that little observation from his student actually inspired a lot of work. How did I get into stream processing? And it was, I had a sabbatical in Wisconsin, and the idea, we had this great proposal name I came up with, which was called a petabyte in your pocket.

Starting point is 00:41:15 That's a great name. And it was this idea of being able to give a person access to all this data. You know, we talked about everything you ever read and every paper you looked at and, you know, all the information on your finances and such. If you were doing that just as an individual thing, it would take a ton of disk drives to capture all that. But a lot of that information is shared with others, and it's out on the Internet. And so really what you want to be able to do is pull that data in.

Starting point is 00:41:51 And when you start looking at that and trying to run queries over remote sources like that, the problem comes up that there's sort of pauses and delays and information is coming in at different rates. It's not like doing a query on a local database where you control the disk, know when the data is going to get there. And so started looking at things where the data was arriving incrementally. And we had thought we would work mainly with XML data because it looked at that point like that was going to be what everybody was exchanging. Maybe not so much, but this idea of data coming in incrementally, it's just a little step from that to saying, well, it's going to keep coming continuously, and you want to keep computing on it and so then that presented a lot

Starting point is 00:42:45 of interesting problems and at the same time there were other groups at stanford the aurora group on the east coast looking at it so it was yeah it was a several steps that got me into the into the stream processing world that about in your pocket that's a great name cool so i i guess we spoke about a lot of things that have been been successful during your career today david but obviously research is non-linear right like you have ups and downs so my question is is how how do you personally deal with that how do you deal with setbacks and rejections oh poorly the i still you know i still have a hard time making myself read reviews when a proposal or paper is rejected. Part of it, it gets a little easier.

Starting point is 00:43:34 Rejection is a little bit easier to handle because over the course of a longer career, you've got a batting average going. And so one failure is not going to move the needle a lot. And, you know, I sort of deal with a lot of this by what I call unjustifiable optimism. I, you know, I just imagine how, you know, how things are going to be with really not a lot of basis, maybe in reality, but it helps keep me going. I think I adopted that. My first chairman when I was at Stony Brook, a guy named Jack Heller, just had these ridiculously optimistic plans about what was going to happen with the computer science department. You know, we were going to build this new building and get us all these faculty. You know, just, you know, I looked at that and, you know, said, there's just no way that's going to happen. But it turns out we got about a third of it. And, you know, we got a new building,

Starting point is 00:44:30 we got some new faculty physicians. And so, you know, a third of a huge amount still substantial. And so I just like to, I don't know, imagine that things are going to go well in the future. And that keeps me going. I guess the hardest part right now is that I'm often working with, you know, students or junior faculty and this earlier in their careers. And I think rejections are harder to deal with then. And so, you know, just sort of supporting them and keeping them going. I did get some advice from my advisor. He wants,

Starting point is 00:45:07 you know, he once told me that time, time spent writing proposals is seldom wasted. That even if a given proposal gets rejected, that you can find another opportunity to plug that in and, and work with it. And that's, that's largely been true that,

Starting point is 00:45:27 that you can, or, you know, you write a proposal and then it doesn't get funded, but you get, you know, an invitation to a keynote talk. So, you know, you get a place to present it. Yeah, so, yeah, I guess unjustifiable optimism helps. I love that, David.

Starting point is 00:45:44 I'm going to adopt that as well. That's brilliant. Yeah, think also like earlier like you say you have a longer time horizon you kind of develop a batting average and it doesn't necessarily move the needle but when you sort of earlier on in your career does well the first one you get for example is horrible right like i remember the first rejection i got for the competition and didn't deal with it very well i don't think but yeah i think you maybe get better at dealing with it over time as well of trying to detach yourself from the rejection and not take it so personally anyway but yeah cool yes i kind of guess related to slightly what you're saying there about sort of when something does get addicted iterating on it and it's always you can always find a use for it

Starting point is 00:46:20 somewhere else or it may lead to a to a separate idea is this sort of this this question around the creative process and how do you david how do you approach that do you have a systematic way for for creating ideas and generating ideas and then selecting on which things to work on i mean you gave me that question in advance i thought about it for a while and i think one of the my techniques or one of my capabilities is seeing patterns so you know if you think about it database systems don't do anything that you couldn't do without them you know they try to do it more reliably more efficiently in terms of like programmer time and computer resources but you could go go write, you know, the same stuff in a general purpose programming language and do it. And so, you know, if you look at, at Cod,

Starting point is 00:47:11 when he proposed the relational model, well, he was looking at, you know, what do, you know, business data processing programs look like inside. And he saw that, you know, there were these common patterns. It was scanning through something, there was taking a subset of it, you know, either columns or rows, it was combining it with another table. And he was able to say, well, you know, there's these half dozen or eight operations that you can do that can explain a lot of what's going on. And so he saw a pattern there and exploited it. And so I think I've, you know, also, you know, when I talk to people, especially people who are applying computer technology, looking at what they're doing and seeing a pattern there and seeing if you can exploit it. I mean, one example of that was

Starting point is 00:48:01 actually somebody else kind of a, Jim Gray, who observed the pattern is that if you look at, you know, a COBOL program and now you've adopted a relational database, well, the program, the application program shrinks to about half the size. So the database part has taken away a lot of the code and we'll do it, but what's left?

Starting point is 00:48:23 Well, there's a lot of business logic and and user interface stuff still to write and so started you know i've thought okay well there's this pattern of how people write stuff and it's often if you look more closely you'll often have this sort of intermediate layer where the gurus who understand this particular database have written you know sort of the update and access subroutines to use in the application program and like the ui developers and stuff call those and then there's this you know you only access the database between these through these approved subroutines and And I'm thinking, ah, okay,

Starting point is 00:49:05 so you've got this logic and programs sitting above the database. Well, can't we pull that into the database? And so I think other people do this. As you see, you look at applications of databases and you're saying, oh, well, a lot of people are trying to use this for geospatial information. So maybe we should have a GIS extension or people are trying to use this for geospatial information. So maybe we should have a GIS extension

Starting point is 00:49:26 or people are trying to process text. And so, you know, looking at what people are doing, seeing patterns in what they're doing and saying, okay, is that something that we can codify and make into a common service and get the benefits from that? Yeah, no, that's a really kind of nice way i would try and use that when i approach things and try and see pans i mean i

Starting point is 00:49:49 guess humans on some level all we are is pattern recognition pattern recognition machines right on some i guess fundamental level right that's how we learn right i guess on some base level well i've also used it in guiding graduate students. Like, you know, this thing about user interface, I had a student working on building graphical interfaces for objects. And what I kept doing is just saying, okay, you did it for this, you did it for that, with the hope of that once they did it, like the third or fourth time, they'd start seeing a pattern and then be able to abstract back from that

Starting point is 00:50:27 and build something easier. And I was really surprised with this student, Belinda Buonafe. She had showed me an interface for describing the user interfaces we wanted. And I'd asked her to make a certain change and she knocked on my office door about an hour later, and it was implemented. And so it turned out she was using her own tool to build her own interface.

Starting point is 00:50:51 So she could update it declaratively rather than actually recoding a bunch. giving a graduate student similar problems over and over until out of self-defense they develop an abstraction and a way to do it i find this effective yeah that's awesome that's good good obviously when i at the very top of the show i mentioned that you'd collaborated and we have spoke quite a lot throughout the course of the podcast that you collaborated a lot with with industry and kind of one of the missions of this podcast is to like help further bridge this gap between research and industry so my question is what what do you think about the current interaction between academia and industry and how would we make it better is already perfect yeah what's your take on things so i it's hard for me to speak

Starting point is 00:51:47 broadly about academia maybe maybe somewhat about computer science and industry databases you know it's there's pretty good connections you you get the both the industrial research people and developers and the academics coming to the same conferences. And people move back and forth, people build prototypes and then make companies out of them, they send their students to work. And so there's, I think, good crossing back and forth. I've also seen, it may not be routine, but there are people I encounter in industry who do actually go read papers. You know, I talked about, about Mullen Aref going back and reading the literature on declarative query processing or deductive query processing. You know, this person,

Starting point is 00:52:37 Todd Porter that I work with or talk to from Meta, you know, he's always, you know, talking about a paper or, or a blog post that he just read about something. So, you know, there are people who pay attention to the papers. So I think that's good. I mean, one thing that kind of messes things up is when you get, I mean, I've been around long enough to see like these peaks in demand, you know, of companies starting and then, you know, sort of going down. And so there was like one early on around PCs and departmental computers where all these little software companies were starting. You didn't get your software just from IBM or Burroughs anymore. And then there was another one with the dot-com or whatever.

Starting point is 00:53:30 And so what ends up happening is that that sucks a lot of people out of academia, at least temporarily. And so you feel a little left behind with that. And I mean, some schools are getting better at being able to let somebody, you know, stay in some role as a faculty member while still being involved in a company outside. But, you know, enrollments in computer science are often counter-cyclical. You know, when the industry is not doing well,

Starting point is 00:54:05 people decide, oh, it's a good time to come back and update my credentials or something. And then when things are doing really well, you know, you're competing for faculty with industry and so forth, and it's a little hard. Yeah, it's funny you should mention that sort of like the cycles,

Starting point is 00:54:21 because I was just recently reading the book about Amazon and about how the story of it and everything. And i was just recently reading the the book about amazon and about how the story of it and everything and there was a section when the kind of aws sort of started to be founded and they've kind of they were mentioning a few of the names and a lot of the the people that came from i think it was the university of wisconsin-madison or i could bet like especially kind of like universities in sort of the northwest sort of part of the state of the states as well i was like that's some serious brain drain on the universities right i mean i mean if you don't replenish it

Starting point is 00:54:49 eventually it's not going to be there cool i guess just one last kind of question question now david that is the future and current trends and what you observe at the moment what do you see as kind of the most exciting avenues for future research? Well, it's hard for me to tell what will excite others, but things that intrigue me, one of them is, you know, what's important in the next generation of stream processing. And this is another thing where I've talked with my friend Todd at Meta, and we sort of proposed that what it might be is scaling management of systems. So, you know, originally people talked about

Starting point is 00:55:32 single queries, single stream queries, and then being able to parallelize those across multiple processors. So, you know, got figured out ways to do parallel evaluation of single queries on very large data streams. But now, if you look at what's behind Facebook or whatever,

Starting point is 00:56:00 you've got maybe thousands of stream jobs, each with maybe 100 or 1, you've got, you know, maybe thousands of stream jobs, you know, each with maybe a hundred or a thousand parallel tasks. You can't manage things at the individual, you know, data item or, you know, process level. There's just too many of them for someone to pay attention to. And so we're sort of saying, well, maybe the direction is that you have to have some coarser groupings over this flying grain stuff to help you manage. And one of the things we've been talking about is what we call job chopping. and processing those and maybe getting some of it ready to, you know, figure out what ads to show them and other things going off to the ML people who are trying to train various

Starting point is 00:56:52 recommenders or whatever. And so the idea is just, well, in theory, it's a perpetual job, but what if we just break it up into one hour segments and run it for an hour and then shut it down at the end of the hour. And, you know, and just before that, we'll start up the next instance of it. And it turns out that, you know, that, that makes some things like fault recovery easier. And it's also easier to do auto scaling and migration if you have these built-in boundaries. So looking at ways of, we had a little presentation that, oh, there's a Northwest Database Society meeting about once a year. And we had a presentation, we called it Block and Tackle.

Starting point is 00:57:39 So, you know, breaking these things up into blocks and managing them. I've been accused of pun-driven research, like coming up with a great title and having to write a paper that matches it. Another thing I think is interesting to me is what I call data productivity. So there's just data being collected for all sorts of purposes. Some of my interactions there have been with a traffic archive called Portal at Portland State, where they bring in things like the ramp meters, the loop detectors, bus position information, stuff that's used operationally, but that if you collect it, you can see patterns or do research. And so this idea of the value you get out of data over the cost of collecting and maintaining it. And I'm particularly interested on focusing on the numerator of that ratio.

Starting point is 00:58:48 So how do we improve the value you receive from data? I mean, there's a lot of work on how do we collect data better and store it more cheaply. And so, you know, it's hard to put a number on the value of the uses of data, but I'm pretty sure it correlates strongly with the number of uses of the data. The more it's used, the better. And there's a lot of what I call one and done data collected for a purpose and then, you know, never looked at again, or even encountered none and done data that is collected and never analyzed. And so trying to think about things that make data reuse easier and thereby boosting data productivity.

Starting point is 00:59:28 And so one of the things I mentioned briefly at the beginning is alignment. So if you're trying to take existing data sources that are both time series, but the times don't match, you know, how are you going to adjust them if you want to do some kind of analysis or simple plot or whatever? Even a scatter plot, you have to have your two data sets at the same time points. And so working on declarative ways of saying, how do I align these two data sets? It's a little corner of this data productivity thing. But what are the things we can do to make data more reusable

Starting point is 01:00:06 and thus get more value out of it? Yeah, that's fascinating. Just to jump back to the job chopping really quick there, David. I like the idea of having these boundaries makes things like managing tolerance easier. How do you sort of avoid downtime of the system? Do you kind of have the next instance and you're like on the hot switch over? Because that kind of thing that was jumping out, you bring the barrier down at one point, how do you switch? I remember getting too much into the details of how the scheme would work,

Starting point is 01:00:32 but yeah. In the particular situation that suggested this was they have an underlying message broker system. It's kind of like Kafka, but it's an internal one. And so that's where the data is coming into the stream queries. So you, and it has its own state. So you can start another instance of the job, you know, at a point in that message stream while this one finishes up. So, you know, when it gets time, close to time, you know, maybe two minutes before the

Starting point is 01:01:10 hour or something, you can start this next one up. The other thing that Todd noticed is often, well, okay, these things are continually processing the data. They want to keep up with it, but they're only emitting outputs like every five minutes or every 15 minutes. So you don't necessarily have to have the job, these joblets running continuously.

Starting point is 01:01:32 If you get them started up and they can catch up with the data before their next reporting point, then, you know... Jobs are good, right? Yeah, things are cool. I mean, there's, you know, there's a lot of details to figure out,

Starting point is 01:01:48 but it seems to, you know, have some advantages that make it worth pursuing. Yeah, awesome. Cool. Well, I think that's a wrap then, David. We can finish up there. It's been a fascinating conversation. I've loved it.

Starting point is 01:01:59 And I'm sure the listeners will as well. So thank you very much for taking the time to speak with me today. Yeah, my pleasure.

Disseminate: The Computer Science Research Podcast - High Impact in Databases with... David Maier

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.