Drill to Detail - Drill to Detail Ep.48 'Mondrian OLAP, Apache Calcite and Database Dis-Aggregation' With Special Guest Julian Hyde

Starting point is 00:00:00 So welcome back to a new year with the Drill to Detail podcast where we talk about what's interesting in the world of big data and analytics and the people making it happen. So my name is Mark Rittman and I'm pleased to be joined in this episode by Julian Hyde, a name you probably know through his work on the Mondrian Project, and more recently through his involvement with the Apache Calcite Project. Julian, it's great to have yet another Brit who's escaped across the pond to the west coast on the show. So how are you, and what's the weather like over there? And it's great to meet you. Thank you, Mark. Thanks for having me on the podcast.

Starting point is 00:00:42 I'm great. The weather's a bit British today, actually. It's kind of gloomy and gloomy and foggy um but uh well that's the exception it's it's it's usually pretty nice over here so julian just tell us a bit about i suppose how you got into the it industry and i suppose what you how you started off in the uk and you may be your first kind of like you know job and work you did really oracle yeah so i i i had a computer science degree from cambridge um in england and uh immediately went to work on uh software development in in oracle uk working in chertsey um working on oracle's case tools um so i would i was the author of the default database designer, which basically took an entity

Starting point is 00:01:29 model and generated a database schema from it. I worked on other things such as the C++ generator. And I was involved in some ways with the Oracle repository project. So that was the first four years of my career. Very happy time, actually. And I still draw, you know, Richard Barker style entity relationship diagrams. When I have a difficult design problem, I'm trying to figure out the data structure. I learned to think about data models in that time. So what's interesting is I think you must have started, you must be roughly the same age as me, I think you started your career in this area building the tools that I was using. So I actually used to use the CASE tools and I think it was Designer 2000 back in that

Starting point is 00:02:14 sort of time. I was working at the Woolwich and I worked at Toyota as well and it's interesting to hear the person behind it who was, or say one of the people behind it, you know, speak to them. I mean, I suppose CASE tools and modeling tools and what was what they're trying to achieve and and do you think they're ever sort of like did they reach their their their kind of maximum use really do you think um i think there's a cultural difference between the use of case tools in europe and and and the us i think case tools were driven to a certain extent

Starting point is 00:02:45 by the needs of corporate governance to have a paper trail, a well-documented process. Standards like SSADM would drive the use of the case tools. And the goal of it was to produce, not only produce code, but also to produce documentation along the way. And, you know, given how limited source control systems were at the time, it was good to have a fully documented project as you were developing it. I said there was a cultural difference. So it was kind of interesting that American software engineers I was interacting with were really not interested in following

Starting point is 00:03:32 these kind of processes. But if you were doing, let's say, a database-centric project where the end result was going to be a database schema and perhaps some forms, then a case tool was a very useful tool to help you along the way with that project. Okay, okay. So you worked on that for a while, and then you went actually to Oracle HQ to work on text retrieval for the Oracle database. I mean, tell us about that, and I suppose what got your interest in the database area in the first place, really?

Starting point is 00:04:02 Yeah, I was always inspired by databases just from you know back to the database course in in university um uh you know involved in data models in in in case um text retrieval by the way first the first iteration of text retrieval was developed in the uk just down downstairs in chertsey and two, they decided to re-engineer it and rebuild it using a team in the US. They hired Mark Kramer, who had been at, I forget, one of the text companies, and started building it with a US team.

Starting point is 00:04:38 But they recruited some of the UK team to work on it and to have continuity. And so I knew people that were in the Oracle Church that were moving over there, and they invited me to move over as well. And of course, I accepted because it's, you know, to a Brit, being in California is like being on vacation, you know, 365 days a year. So it was wonderful to move over. Interestingly, so we were doing some very interesting stuff. And around this time, Sybase released Sybase IQ, which was kind of the first, well, maybe not the first, but it was a a database with bitmap indexes

Starting point is 00:05:27 and it was creating a lot of splash in the press it was producing some very interesting benchmark results and i think it caught response to that and the response was to add our bitmap indexes oracle 7.3 so it turns out a text index uh unlike a regular database index, you have one entry in the index and it points to one row. A text index, you have a word like red and it points to, it has a list index is going to have a compressed list of row IDs for the rows that it points to. So in the internals, a bitmap index is very similar to a text index. So even though the text indexes were being developed outside of the server technologies division, outside of the kernels, they turned to my team to develop these bitmap indexes, and we had to basically deliver it,

Starting point is 00:06:31 turn it around from a blank sheet of paper to product in nine months, which is pretty rapid for Oracle. Yeah, yeah. That's interesting. I mean, I remember SybaseIQ and I think Sand at the same time and so on. Were they column store databases as well? I don't remember. I mean, I remember, I remember Sybase IQ and I think sand at the same time and so on. They were, they were kind of, were they column store databases as well?

Starting point is 00:06:47 I don't remember. I think it was. I think, yes, yes. You might be right. You might be right. But bitmap indexes were definitely a big part of the.

Starting point is 00:06:56 Yeah. Yeah. I remember, I remember them coming in. I remember them coming in. And I think at the time of the, the tricky bit was that you couldn't update them in the same way that the performance wise and so on,

Starting point is 00:07:04 but certainly I suppose for query performance, they were fantastic. And so you were involved in that. And that obviously is a product area that's gone on over sort of several versions since then. I mean, what led you to, I mean, how far did you go with that? And what led you then to move on, I suppose, from the Oracle world to maybe sort of embrace more open source? Yeah, it was a couple of years before I got involved in open source, but what was the next step? It was ironic in that I was, you know, at that point I was in the Oracle kernel team,

Starting point is 00:07:38 which I, you know, dreamt of being in. But we saw this opportunity that, you know, this team at Oracle saw this opportunity to produce a data warehouse. Well, let's say a data mart, which was a Windows NT had just been released. And so, you know, rather than having to spend $20,000 on or $50,000 on a machine to be able to do data warehousing. Now you could realistically spend $2,000, $3,000 on a Windows machine. Windows NT had a much better scheduler than Windows, and therefore you could do multiprocessing and so forth. It had a real operating system with memory protection and so forth so we set out to build a sequel DBMS that was optimized for analytics on Windows NT and that company

Starting point is 00:08:33 was called broad base and we founded it in 1995 and it went public in 2000. But it was totally fun building a database from the ground up in C++. I guess that must have got you a taste for, I suppose, product and building your own thing. And I suppose kind of starting a company from scratch, especially in that area. But I suppose building a database engine for Windows is always kind of risky, obviously, with SQL Server and that sort of thing. I mean, how did you find it in terms of the commercial side of it and, I suppose, kind of getting a user base and making it successful? Yeah, we were successful at first, but then we found, I mean, you're exactly right regarding SQL Server.

Starting point is 00:09:22 Microsoft started aggressively adding features to SQL Server. And it was becoming very, very difficult to sell head to head against SQL Server, even though we were, you know, maybe five times faster. Microsoft was pretty much giving it away for free if you bought an enterprise edition of Windows. So we realized we needed to climb up the value stack. So rather than selling a database, we were selling an ETL pipeline, an ETL tool, an analytic framework so you could create dashboards and so forth,

Starting point is 00:10:03 and ultimately analytic applications. So this coincided with the birth of the web. People were running servers, generating web logs. They wanted to understand their interactions with their customers. And so on top of the SQL database inside Broadbase, we developed this suite of analytic applications for people to develop to analyze web traffic. Okay. Okay. Okay. What was your involvement with the actual kind of, I suppose, the engineering behind that product? I mean, did you get involved in the, I suppose, the database engine at all or anything, that kind of area? Yeah, I was mainly working on the database engine. So the initial area I was involved was building some of the data structures,

Starting point is 00:10:49 so the bitmap indexes, the column store, and also building some of the high-performance algorithms. So I'm very, very proud of an algorithm I wrote to do hash join, which it was a hybrid hash join implementation where it would partition both sets of data and use Bloom filters to figure out which rows could possibly, could not possibly match rows on the other side, repartition the data to disk until it was small enough

Starting point is 00:11:19 to be able to do a join. So I was doing those kinds of tasks kind of deep in the heart of the database server. Somebody else wrote the initial version of the optimizer, but it very quickly became spaghetti code. And I read the Volcano paper by Gertz Graf and Bill McKenna about creating an optimizer that was cost-based but involved applications of transformation rules. And the idea was that you could register these transformation rules that would match particular patterns of relational algebra and transform the relational algebra to different relational algebra statements and so i started applying that within broadbase as an attempt to

Starting point is 00:12:12 just rationalize this optimizer that had gotten out of control okay i mean how did i mean i remember from those days that when i was when i switched from the rule-based optimizer in oracle to the cost base one the hardest thing was getting predictable performance, you know, and actually, I suppose, really kind of hard coding in there. Or certainly, you know, you knew the path it wanted to take, but it would take different ones because of the nature of a cost-based optimizer. I mean, how did you sort of deal with that and get performance out of it? And what were the kind of things that you might have brought to that, really,

Starting point is 00:12:40 beyond that kind of volcano paper? Yeah, I think that was more, as we were a startup, predictability was less of a concern than simply maintainability of the code. So our customers were in general happy that we produced good plans, and they didn't have huge existing applications running that they would say,

Starting point is 00:13:08 this one query out of 100 is running slower. So we didn't have to face that problem, luckily. But we were, just like a typical startup, we developed a lot of code and we were getting slower and slower and slower because we were maintaining the code. So this optimizer with pluggable rules allowed us to basically just do better software engineering in inside our company and rationalize what we're producing because we wanted to be able

Starting point is 00:13:37 to add new features like um summary tables uh week week we we had an op a a DDL statement called create cube, which basically meant create a structure within which the system would automatically create summary tables, which were a form of materialized view. We wanted to be able to add those kinds of features without breaking the optimizer so that was our main goal um okay so that must have been the i mean you you mentioned cube there and and

Starting point is 00:14:14 some of the techniques and um i suppose some of the sort of things you're talking about there obviously what it would have led on to what you then did with mondrian and the kind of you know the olap world there i mean what led to yourAP world there. I mean, what led to your interest in OLAP and what led to your, I suppose, kind of focus on that more towards the end, really? It was driven by when we first released Broadbase, we imagined that end users, business users, would write SQL.

Starting point is 00:14:43 And this just turned out to be, I think to some extent these days, business users or analysts or data scientists will write SQL. But back in those days, we were deceiving ourselves. End users didn't want to write SQL. And, you know, the most convenient interface for a business user is a multidimensional slice and dice interface where you see a dimensional view of the data, you know, dimensions and measures. You can, you know, double click on a member of a dimension and then see the results. So we realized that our analytic applications had to have that interface. And since we were running on Windows, Microsoft pretty much, you know, free in every Windows installation was Microsoft, it was then called Microsoft SQL Server OLAP services, what became analysis services. It was rebranded as analysis services a couple of years later. And it was originally an Israeli company. I forget the name of it, but it's Panorama, right? And it was,

Starting point is 00:16:00 I believe, written in a portable way. It was probably run on Unix when it was originally written. And it certainly didn't mind which database it was running against. And one of the kind of bizarre, surprising things was that when Microsoft included it with SQL Server, it still didn't mind which database it was running against. And it ran perfectly happily against Broadbase. So we used Microsoft MSOLAP, I guess we called it then, as the kind of multidimensional engine. And then we built analytic applications using ActiveX controls and JSP pages. And the JSP page, sorry not it was ASP pages and back in those days

Starting point is 00:16:46 before Sun and Microsoft had a falling out Microsoft's Java implementation could speak to calm objects and so we we wrote this application the ASP pages were talking to our Java objects and the Java objects were talking to my mso app via com so um yeah it was java was part of the microsoft stack back then um and it it all it all worked pretty nicely yeah i remember i remember everything you're talking about that i worked with at the time and um i mean we had um i had um donald farmer on the uh podcast a few months ago obviously obviously you know from from being one of the pms on on that product i mean that analysis services brought along a whole bunch of things that have just become i suppose just kind of part of the landscape now mdx and

Starting point is 00:17:35 things like that i mean what what was your view on i suppose the mdx language and and generally how they've kind of written it and put it together i mean i know obviously it came from panorama but what was your view on on i suppose how groundbreaking it was or different really it was very groundbreaking so the context back then there were there were two things about i mean you remember the multi-dimensional database manifesto um uh yeah there was there was all this excitement about multi-dimensional databases and they were portrayed as this new thing that were going to, if you believed the hype at the time, they were going to defeat the relational database. And what I guess Panorama brought to it is this, Owen, of course, if you had a multidimensional database, you had to put all the data into the multidimensional database. You know, you couldn't store it relationally.

Starting point is 00:18:29 And MSOLAP ran in three modes, the multidimensional mode, rollout mode, and hybrid mode. In other words, they were saying, yeah, sure, you can do multidimensional data analysis on data in your relational database. It may be, for certain operations, it might be more efficient if you put it into this multidimensional format. In other words, if you convert it into arrays or matrices inside, but it's still just data. So that was one thing, was getting rid of this idea of the multidimensional database as kind of the end point, or the hub of everything. So it fitted

Starting point is 00:19:15 more within a conventional SQL-based data warehouse. The other thing was they actually had a textual query language. And if I recall, all of the other multidimensional databases, you would build those queries programmatically. Or in fact, you'd probably just have to use their user interface, their dedicated tool to build these slice and dice queries. So the idea of MDX, MDX isn't perfect. I think one of the mistakes they made was to make it look too much like SQL. So people imagined that if they knew SQL, they didn't have to learn MDX. But simply having a textual language, something that you could, you know, check into a source

Starting point is 00:19:56 control system and share with your colleagues by pasting it, and it made it much more tractable than the API-based OLAP that people like S-Base were doing at the time. I think there's probably other features in MDX that were paralleled by other multidimensional databases at the time, but the general way that you you define for example calculated members and and so the integration with visual basic for uh the built-in functions was excellent um so i mean it was it was it was revolutionary i mean it was interesting i used to work on oracle space sorry oracle express at the time and it completely it completely it completely kind of you know revolutionized that

Starting point is 00:20:45 market because it was you said mdx was there it was virtually free um and uh it did well but then but then i noticed the product coming out which was mondrian which is obviously when i first heard about you and what what led to i mean obviously you had you had you had your kind of company broad base and then you moved on from there and so on it sold and ipo'd i think which is fantastic but but then how did you get involved in Mondrian and what was the motivation behind that I can see the obvious technology links there and so on but what led to that really so the um there was a business motivation which was that broadbase merged with Karnak Communications, which is a company that was doing ECRM and so forth. And that company had not decided to be Windows only.

Starting point is 00:21:34 They ran on Windows and Solaris, probably HBUX as well. And so we could no longer assume that analysis services was going to be there in the box. So we were told that we needed to make our server running. Linux was not around at that point. But we were told that we needed to make our server run on operating systems other than Windows. And so we considered moving to another multidimensional database, but we realized that would have been a complete rewrite. And I realized, you know, having used analysis services for a while, I understood the MDX language.

Starting point is 00:22:19 We actually had an MDX parser because we had the early versions of analysis services didn't do any kind of security. So we would actually pause the Mdx insert access controls like to make sure that you know, Mark is only allowed to access customers in Europe, that kind of thing. So we would instrument the Mdx and then send it pass it on to Microsoft we had that parser already written in Java and I said to myself well how difficult would it be to just write a model file in XML

Starting point is 00:22:58 that defines the dimensions and measures and how they map onto the fact table and then just make uh just make this thing run in java with no storage whatsoever so it would be completely passed through and i knocked this up in a weekend and presented it to it was think it was august 2001 presented it to my boss on monday morning said, look, here's a start. I know I can't finish this, but if you will let me open source this stuff, I think in a year or possibly six months,

Starting point is 00:23:37 we'll have something that we can ship our products on. So they gave me permission to open source the stuff that we already had and spin off this open source project. And that's what became Mondrian. So it was done with the support of my management in, well, it was then Kana Software. But, you know, I parted company from them, you know, probably like nine months later but mondrian by that point had developed enough momentum that it carried on okay okay so just for anybody here that doesn't know what mondrian is i mean just basically explain what it is and i

Starting point is 00:24:18 suppose what was the kind of essence of of hat work you mentioned there about it was passed through it was kind of generating sequence on you know what was it and how did it differ perhaps from say analysis services in molap mode yeah um so i guess architecturally it's most similar to analysis in rollout mode so what it is i i refer to it as an open source OLAP engine. So, and it's, I think it's fair to say it's by far the most successful open source OLAP engine. And it implements the MDX language, so we would speak to Mondrian, either from Java or ASP, in exactly the same way. As far as our ASP pages were concerned, speaking to Mondrian was exactly the same as speaking to analysis services. You'd send an MDX statement and you'd get a multidimensional cell set back.

Starting point is 00:25:27 And the difference is that there was nothing to install. It didn't have anything on disk. So all you needed to supply were a JAR file containing Mondrian, a JDBC connect string to the database, the relational database where your data was, and the URL of a XML file that contained your model. With those three things, it was sufficient to be able to say, you know, select from sales to do an MDX query on a cube.

Starting point is 00:26:04 Mondrian would then be able to figure out, you know, populate its dimension cache in memory, know where to get the cell values, the measure values from, and it would submit several SQL queries to the relational database to get all of the necessary information to return the results of that multi-dimensional query. So because it didn't have anything on disk, you didn't have to, you know, build a cube or process a cube or anything like that. It was extremely convenient to embed in an application because it would just start up. And of course, the first couple of

Starting point is 00:26:42 queries were slow because it was populating the cache. after that it it you know ran very nice and fast so so when i first heard about mondrian and when i first looked at it i mean my my um i suppose my um i was anticipating it would be slow and uh awful to use because oracle oracle went through a phase of rewriting all their ola all their ola frontends in java and their engines in in bits of it in java but also they moved some of it to roll up as well and they were slow the the you know java would be the kind of you know the worst thing to build things in them but you but mondrian was fast and what was it how did you how did you solve the problem of making it fast how did you first of all i suppose question is why did you choose java

Starting point is 00:27:23 over say say c++ and secondly how did you make it run fast really so um one of the principles um was to make let the database do what the database was good at um you know, if you need to sum up 10 million rows, the database is very well optimized for that. So all of those kinds of aggregations, if you can possibly do it in the database, do it in the database, which means those 10 million rows stay in the database. And then all you get back from the database is one row that is the is the total so monium would then cache that row or cache that cell value so it doesn't have to go back to the database again but really the java has not had to do very much it hasn't had to do much of the number crunching because because the database has done it and and an awful lot of multi-dimensional queries are actually not very computation intensive.

Starting point is 00:28:27 People are using OLAP because they simply want a slice and dice interface to their data warehouse. And so the SQL that's generated is relatively straightforward. So Java is not imposing any kind of a penalty there. The other very strong reason to use Java is it's a very forgiving language. You know, your system isn't going to crash if you make a mistake, which basically is important for an open source project

Starting point is 00:28:54 because you basically want to be able to accept contributions without worrying too much about whether they're going to crash the system. So it allowed us to move faster and be more open to contributions than if we'd... It's also easier language for people to learn and to program in. So it increased the size of our audience. And by the way, when newer languages such as Scala come along, the rationale to stay on Java is kind of similar. Even though Scala is an excellent language,

Starting point is 00:29:30 still the number of people who can program in Scala is smaller than the number of people that can program in Java. So Java's aging these days, but still there's a lot of open source projects still written in Java because it represents a large audience of developers. Okay, so the thing that, from what I could see, you seemed to get right was converting, I suppose, multidimensional queries to relational, you know, going from MDX to relational. That's not easy, and I've seen various attempts that other vendors have done. Again, how did you sort of tackle that problem?

Starting point is 00:30:02 And I suppose, in a way, you know, what was your approach to trying to take one type of query there, make it into SQL, which obviously has paid off later on because you've been doing that all the way through the other projects you've been doing as well? Yeah, so it is difficult, in fact, probably impossible to translate an MDX statement into one SQL statement unless you're using like Oracle extensions to SQL, that kind of stuff.

Starting point is 00:30:29 So in essence, a typical Mondrian query would be three SQL statements, one for the columns axis, one for the rows axis, and one for the cells. And if you decompose it that way, then it's relatively straightforward. And then we have our computation engine. If there's calculated members, we have a computation engine which functions.

Starting point is 00:30:56 Internally, the MDX language is kind of like this virtual spreadsheet where you can reference a cell might be Q2 cells of red widgets, and another cell might be 1997 cells of green widgets. And so when you're referencing one, one of those cells might be a value which is stored in the database or an aggregator of columns stored in the database, or it might be another formula. So we have this computation engine that evaluates these cell formulas.

Starting point is 00:31:31 And if it's an expression, then it will just recursively, using an interpreter, it will recursively evaluate this expression until it comes down to a base cell value. And if you're lucky, most of the time in Mondrian, this won't have to generate a SQL statement to get that cell value, you know, Q2 sales of red widgets. It will find that in Mondrian's cache. And so I jokingly say that

Starting point is 00:32:01 Mondrian only does two things, one of which is it converts relational data into multidimensional data. So it transfers it into this dimensional view. And the other thing it does is caching. And in the case of OLAP, caching is, in my view, essential. Because of the nature of the application. You've got a sequence of queries, perhaps over a 30-minute period. Someone is analyzing a particular set of data. And they don't mind if the first query of the day is slow,

Starting point is 00:32:36 as long as the second and third and fourth queries are fast. And they tend to be looking at the same, not executing the same query, but accessing the same underlying data in those sequence of queries. And so that's, you know, caching makes that kind of application very, it will ask for give me all of the sales of all widgets aggregated by quarter for all years between 1990 and 2005, something like that. And this comes back as this kind of rectangular array of data that is stored in Java's memory. And once it has that, all cells that are from that area will be answered from cache. If you happen to have a cache miss, if you're asking for a cell like, you know, sales in

Starting point is 00:33:37 1989, then it will store up that cache miss and it'll lie. It will basically say the sales in 1989 were zero. So it will make that entire pass over kind of trying to calculate the query. It will complete the pass, it will evaluate all of the MDX expressions, but it will produce a result which is junk because at least one of the values in it was a lie. But what's happened at the end of that is it's generated this list of cache misses. So 1989, it didn't know. 1987, it didn't know.

Starting point is 00:34:16 So now it will take those cache misses, gather them together in a kind of a rectangle, and then submit another SQL query to get to resolve those cache misses. And hopefully when that data has come back, it will be able to make a pass over the data without, you know, generating any cache misses and without telling any lies to itself. And the query will complete. Generally speaking, you know, in one or two passes, um will have everything it needs in the cache and so the computation will terminate so so the other thing that i noticed about um wandering at the

Starting point is 00:34:50 time was was the amount of other projects that were kind of in its ecosystem and you had that jpivot was there and what was the what was the what was the kind of role of was it jp jpivot that was uh the front end for what you were doing yes Yes, it was. So we realized that in order to use it in applications, we needed a front end. We didn't want to build the front end into Mondrian because I wanted to disaggregate the stack. I was just kind of tired of multidimensional databases that came with only one. They could only use one front end.

Starting point is 00:35:27 And the promise of MDX was that, you know, because it's a textual query language, this separation was possible, right? The kind of client-server separation was possible. And so I was contacted very early on by a guy called Andreas Voss, who was an architect at a Frankfurt, Germany company called Tonbella. They were building analytic applications, I think, for financial companies in Germany. And he proposed to build JPivot, which was a JSP application generating pivot tables. So a slice and dice, you know, a tabular interface that you could drill down and slice and dice. Also some charts embedded into it.

Starting point is 00:36:26 He actually flew over to California and we spent the weekend kind of thrashing this out. And we came to this agreement that I would support him building Jpivot and he would support me and Mondrian. So we would make the two products work together. It wasn't going to be an exclusive relationship. So, you know, Jpivot would be able to run on other MDX backends and Mondrian would support other frontends.

Starting point is 00:36:57 But we, you know, agreed to prioritize if he had bugs in Mondrian, I would prioritize fixing them. And it worked extremely well. So these were parallel projects and people very rapidly. I think we actually embedded JPivot inside the Mondrian distribution for a very long time. So it was a jpivot.war file that you could download from the Mondrian site and basically get going very very fast um

Starting point is 00:37:26 this single web archive was sufficient to run a you know you drop it inside tomcat and you could you could you could have an all app server running in in very little time so where did where did pentaho come into it then i mean because there's a again a wider set of tools it was part of uh you know you find you can find mondrian i think it's part of the Pentaho suite now. I mean, what was the story there, really? So Pentaho is a company based in Orlando, Florida, it was a bunch of guys who had done a couple of previous BI startups and been successful and then they noticed that open source business intelligence was you know it was a thing it was happening and so they they decided to create an open source bi company and they realized that that there were a number of they would call them pillars or a number of technologies that constituted

Starting point is 00:38:26 a BI suite and they set about bringing those existing projects which were kind of best-of-breed open-source BI tools bringing those under the umbrella of So Kettle was and is an open source ETL tool. JFreeReport was a reporting tool. Also the Waker project from New Zealand, which is doing machine learning data mining. So they brought all of these things in, and the lead developers of the projects became either Pentaho employees or in my case a kind of

Starting point is 00:39:12 ongoing consultant. And then built I realized that Mondrian was useful but it was not a big enough footprint to be a company in itself. So it suited my purposes to have it as part of this broader suite and also with things like support available for it. So, yeah, my involvement with Pentaho ended about two or three years ago. I mean, Mondrian is, you you know the main emphasis the main focus of Pentaho which is now Hitachi Ventura the main focus of that is on ETL and you know data movement and

Starting point is 00:39:58 so forth and less on user facing analytics I mean they're still part of the product but it's not the main driver of the company anymore. I wrote a book on Mondrian called Mondrian in Action and left Mondrian in a pretty good state. I would still love to get involved kind of reboot Mondrian based on Calcite and

Starting point is 00:40:32 some of the things I've been doing more recently I mean in fact we'll probably discuss this in a few minutes but Calcite was motivated in large part by architectural challenges that I could see in Mondrian. And so in a lot of ways, Calcite is the platform that I wanted in order to build a better version of Mondrian. And with the resurgence of interest in OLAP, projects such as Apache Kylan and Druid are evidence that there is a continuing interest in open source OLAP. to those projects to be able to do things like complex calculations

Starting point is 00:41:25 and dimensional access control, that kind of thing. So I think there would be an interest in it. I've just been mainly focused on working on CalSight and just haven't had the time, the cycles, to invest in Mondrian recently. Yeah. And it was actually Cal was it was a it was cal site that i suppose in a way reminded me of the fact i knew of you and what you've been doing with with kind of mondrian i came across cal site with um i was working with druid and of

Starting point is 00:41:55 course it's a kind of sequel front end to druid it was the imply i think the imply um company i was working with and doing some stuff looking at that um and saw CalSight and I saw you behind it and I could see the themes in there I could see why you'd be involved in that and I could see that you're sort of like your I suppose your mark on that really I mean tell us what CalSight is and how you got involved in it and I suppose just kind of maybe just recap on what those links are back to to what you're doing at Mondrian. Yeah, so I'll briefly describe what Calcite is, and then I'll describe why it came out of my work on Mondrian. So Calcite is, I describe it as a framework for building databases. You can think of it as kind of a toolkit for building databases. What it is,

Starting point is 00:42:47 it has a SQL interface. It has a query planner, but it doesn't actually own any of its own data. It's own algorithms for executing queries. So you can think of it as a virtual database, perhaps. And so you can use it. It's used in Apache Hive, for example, as Hive's cost-based optimizer. So in that case, Hive has its own SQL parser, but it translates those SQL queries to relational algebra. And then Calcite's optimizer applies statistics and transformation rules and converts it to optimized relational algebra, which is then converted into Hive's execution model. Let me just trot out a couple of other examples.

Starting point is 00:44:14 So Apache Drill is similar, but Drill actually uses Calcite SQL parser. And then you've got projects such as, yeah, you mentioned Druid. So the Druid thing started off as a SQL adapter inside Calcite. So you could do a SQL query in Calcite that would then translate it down to Druid's multidimensional data structures. So you're doing SQL on this multidimensional engine. And then the Druid folks took that, you know, with my blessing and, you know, enthusiastic support, they took that Calcite and they pushed it into Druid. So now you've got a kind of SQL out of the box for Druid.

Starting point is 00:45:00 So that's three examples. There's many other examples of how Calcite is being used in various data management projects. The idea of Calcite was to kind of building a relational database is like climbing a big mountain. I've climbed several times personally. And I kind of know that mountaineering only gets interesting when you get up to the snow line. So the idea of Calcite is to get you up to the snow line, build the query plan or the SQL parser and so forth, so you can start adding your own unique value of your particular storage data structures,

Starting point is 00:45:39 your algorithms, or something like that. Okay, okay. So I've heard you talk about you know you know disaggregating the database and that kind of statement there i mean that's a very common theme in i suppose in hadoop and a lot of these distributed kind of engines where you know the data is separate to the query engine and and uh that way you can apply different kind of query engines to it and so on i mean that's a very common theme really, isn't it now within the world we work within? Yes, yes. Hadoop was all about that. I mean, it was also about Hadoop was a

Starting point is 00:46:11 massively distributed system solving a problem that previously a database is a relatively tightly integrated traditional Oracle database. SMP database is a relatively tightly integrated. Traditional Oracle database, you know, SMP database is a relatively tightly integrated piece of software, whereas Hadoop is a, you know, massively parallel system. But yeah, the other thing was about Hadoop, as you say, was it disaggregated the database. So it's not like no longer would you have, you know, a piece of software from a single vendor that provided where you had to start off with an empty box. You had to pump the data in, store it in that vendor's format, access it through the parser provided by that vendor and using all of the algorithms of Hadoop, you can store it wherever you like in whatever data format you like.

Starting point is 00:47:06 You can write code to access it or you can use a SQL parser. And you can also embed your own algorithms in it. So Hadoop is an environment into which you can basically build your own database. But tools like Calcite kind of make it faster for you to build a database. It gives you the tools that you can, it's like a kit that you can then put together the pieces, such as if you want to use Parquet data and if you want to use, I don't know,

Starting point is 00:47:50 a particular engine to execute queries, CalSight is the framework that you can use to put all those pieces together. So I suppose in a way it's quite a similar kind of objective as Wes McKinney had with the Pati Aro project in that you're providing, I suppose, parts of the toolkit that other projects can use to save reinventing it and to get some kind of, I suppose, collaboration and, yeah, I suppose in a way to put best of breed things into these open source projects and, I suppose, in a way kind of use your knowledge you've built up over the years really to try and benefit everybody. Yes, absolutely.

Starting point is 00:48:23 Yeah, I know Wes and we tend to be very simpatico in the way we see the world. I think both of us are on a kind of personal crusade to make things better. Because creating these kind of these hubs that, you know, a common place that allows, you know, multiple applications to come together in one point as a kind of standard. It requires a huge amount of investment to create this.

Starting point is 00:48:56 And this hub only becomes valuable when you have five or six or ten or twenty different applications all speaking to your common data format or your common relational algebra. So getting this to be exothermic so that it's actually producing more value than you're putting into it takes a lot of investment and WES I know has that kind of individual just passion to just go ahead and put the work in even even before it's it's producing any any you know palpable results so so what about um i mean we've been talking about you know writing sequel i suppose it runs in batch and so on i I mean, is there any kind of thoughts about extending this

Starting point is 00:49:46 or extending the work you're doing to, say, streaming SQL? I mean, that's the kind of big new thing at the moment. Yeah, it was a big new thing in 2004 when I co-founded SQL Stream. I mean, the big unopening thing here is relational algebra. So just going back to Arrow briefly, the unifying thing in Arrow is this idea of a memory-efficient format for data. And in the case of Calcite, the central principle is this idea of relational algebra.

Starting point is 00:50:31 So representing your queries algebraically in a way that can be manipulated and optimized. So that's the central thing inside Calcite. And the relational algebra is not just for tape flat tables on it it turns out that streaming applications are doing something which is can be represented in relational algebra. It turns out that document databases and OLAP databases can also, the queries can be represented in relational algebra. And when you're writing all of these kind of systems,

Starting point is 00:51:18 every single one of those systems, the idea of pushing down a filter to a data source makes sense. Pushing down projections to a database, to a source makes sense. Pushing down aggregations, so you're, you know, you're combining rows into subtotals as early in the process as possible makes sense because just using less memory and less network and so forth so um yeah it it makes a lot of sense to apply the relational uh the the the engineering techniques you use to build a relational database it makes sense to apply those techniques to a streaming system. Yeah, I guess similar things come along with kind of, was it Confluent?

Starting point is 00:52:12 That's it, Confluent and KSQL. That's a very similar kind of approach, isn't it? Yeah. Well, possibly, maybe not to you, but certainly it's kind of using SQL and set-based transformations in kind of streaming data. No, I would agree. I mean, there are other... From within the Apache ecosystem, I've been promoting this idea of streaming SQL

Starting point is 00:52:38 and trying to build some consensus about what it should look like. I firmly believe that some future version of the SQL standard, you know, SQL 2020 or something, will include streaming extensions. But for now, what will drive that standard is people, you know, the innovative data technology these days tends to come out of open source

Starting point is 00:53:09 so I've been working with the various streaming projects such as Apache Flink Apache Storm Apache Apex and you know also speaking to Spark and Kafka

Starting point is 00:53:29 to get some kind of consensus about what SQL should look like when it's run to build streaming applications. And there is a consensus emerging. So I'm doing this kind of as part of, you know, as part of Calcite. But the goal is really not to produce. We are producing a reference implementation of streaming SQL. And so there's, you know, extensions to the parser and so forth. And you can execute some streaming queries inside Calcite. But my bigger goal is to try and build consensus,

Starting point is 00:54:08 which basically involves producing examples and doing talks about it and just putting up straw men. So what do you think about this query? What should it produce? And creating this discussion. So there are difficult cases, like what happens when you join a stream to a table? Let's suppose you've got a stream of orders

Starting point is 00:54:31 and you're joining it to a price list table, and the contents of the price list table are changing from minute to minute. And then what happens if I replay that streaming query 10 hours later, and the price of that item that you ordered that was joined to 10 hours ago? Does this query need to produce different results because the contents of the price list has changed since then? Or do we need to introduce some kind of temporal database semantics so that when it when i rerun that query 10 hours

Starting point is 00:55:05 later it can automatically figure out the price what the price list table or what that item in the price list table looked like at the time that order was placed so which we're we're trying to work through those those kind of uh semantic issues okay so so i mean so just to kind of round off really i mean you're you're at you're at Hortonworks now, aren't you? And what is your role? Do you work with Hortonworks or for them? I mean, certainly you're out there now, but what's your kind of role there

Starting point is 00:55:33 and what are you doing there that is linked with Calicow site? Okay. So Hortonworks is one of the main Hadoop vendors. And historically, Hortonworks has been very strongly open source. Just about all of their projects are Apache projects. And we have a huge number of, you know, just about all of our engineers contribute actively to Apache projects day in, day out. And I'm an architect at Hortonworks.

Starting point is 00:56:15 I think there are, you know, five or six architects. So I have responsibility to oversee the whole product set to just make sure that it works together. So oversight is one of my responsibilities there. But also, you know, the reason Hortelmercs approached me was they wanted a cost-based optimizer inside Hive. And they said very clearly, we don't want you to work on on hive we want you to produce calcite we see the benefit of calcite being an independent framework so it was actually called optic back at that time but we renamed it because of clashes with existing products that were out there and so they asked me to establish CalSite as an Apache project that was used inside Hive and many other places, including, by the way, Apache Drill, which is produced by MapR, which is one of Hortonworks' competitors. And this being open source, they don't have a problem with that. I mean, an Apache drill developer will need a particular feature in drill, a particular query optimization, for example, and will contribute that back as an Apache committer. And then that will benefit the next release of Hive. So the Apache

Starting point is 00:57:45 Software Foundation provides this excellent medium for people that work for competing companies to collaborate without conflicts of interest being a continual worry.

Starting point is 00:58:02 So I'm making CalSight successful and that benefits hive and you know we've been able to achieve some uh stunning improvements in hive because of the improved query plans that we're we're generating yeah so so just to kind of round off then i mean how would people find out about cal site how would they find out about um uh what you're doing there and maybe um contribute tools that or just even just download things and play around with it how would they find it and use it really okay you can go to our website cal site dot apache.org um it being a apache project um the the community is is is centered around the developer list. So we just welcome people coming on the developer,

Starting point is 00:58:51 even if you're not a developer, come on the list and ask a question. Of course, you can download it and try it out and kick the tires. And make contributions. If you want to try something, then make a contribution to the project. I speak, I don't have any talks scheduled right now,

Starting point is 00:59:14 but I speak fairly frequently at conferences. I'm also active on Twitter. So I'm Julian Hyde on Twitter. CalSite also has a Twitter handle, Apache CalSite. So you can follow that to get news about CalSite releases and talks and so forth. Okay, and I guess people will come across CalSite most through using it through things like Druid and so on as well. So as well as it being a sort of project on its own, the key thing is it's actually kind of enabling those projects

Starting point is 00:59:45 to be kind of SQL compatible and that sort of thing. Yeah, it's true that CalSight does not have a large number of direct customers. The main audience for CalSight, frankly, is people building data management systems. So you may be using CalSight today without realizing it. Yeah, I've been using it yeah um and and you know my my dream is to allow you know federated systems so you know uh heterogeneous systems where perhaps some of

Starting point is 01:00:15 your data is living in druid and some of your data is living live um well you know that example is already is already happening um and perhaps in the case of Mondrian, some of the data is living in HDFS and some of the data is living in a data grid. So those kind of heterogeneous scenarios are being made possible by Calcite in a way that the kind of the traditional you know single monolithic database would would never be able to achieve because um you know the box it's a closed box and you can't add your new algorithms and so forth to it well that's fantastic i mean it's been great speaking to julian i mean what's what's fascinating is how you've been working on many many things in the background that i've been using as a developer and end user and so on and uh the stuff you've been doing in terms of contribution to the to the open source community is fantastic and uh it's been fantastic speaking to you really so

Starting point is 01:01:13 yeah thank you very much for coming on the show and what i do is i'll put uh notes in the show notes with links to the various projects you've worked on and so on particularly cal site now and um it's been great to have you on here and thank you for your time well thank you for giving me the opportunity mark so i i hope you uh continue to use my stuff and i hope there's some more interesting stuff coming down the pipe that you'll get to use in the future okay cheers and thanks bye Bye. Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.48 'Mondrian OLAP, Apache Calcite and Database Dis-Aggregation' With Special Guest Julian Hyde

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.