Drill to Detail - Drill to Detail Ep.2. 'Future of SQL on Hadoop', With Special Guest Dan McClary

Starting point is 00:00:00 Hello and welcome to the second episode of Drill to Detail, a new podcast series hosted by me, Mark Whitman, and where I'll be talking to a special guest each week about some of the issues and thoughts and ideas behind the news and what's happening in the big data analytics and data warehousing industry. Okay, so this episode's guest is Dan McCleary, someone I've known for a little while now, back from my Oracle days really, or working with Oracle. Dan was one of the, was the PM for Big Data SQL, which most of you heard of as being Oracle's take, I suppose, on SQL and Hadoop. And Dan recently actually joined Google. So actually, Dan, let's introduce yourself first of all, and just tell everyone who you are and what you've been doing. Yeah, hi. So thank you, Mark, for having me on. So my route to Google has been a little

Starting point is 00:01:02 twisted. I started off as a researcher many years ago, went and played the startup game for a little bit. And then, as you mentioned, was at Oracle for a number of years working on distributed SQL problems and SQL on Hadoop. And yes, now I've joined Google and look after things related to the SQL language as well as to things like block storage business. So largely, I'm part of the product management team for Google's cloud platform. And I should say right at the beginning that any opinions expressed here are my own and do not reflect those of either Oracle or Google or, you know, any startups that I might have played with in the past. Excellent. So thanks, Dan. And yeah, it's great to have you on here. And really,

Starting point is 00:01:45 I suppose, interested in your take on the SQL and Hadoop market and just generally, I suppose, really, how the kind of big vendors and how the small vendors really are kind of doing in this space. And we'll come on to later on, we kind of suppose some of the things coming out of Google and Yahoo, and I suppose where you see the market and where you see things going in terms of, I suppose, Hadoop for enterprise customers and for my area, which is kind of BI analytics and so on, really. So, Dan, let's start off then really with something that I suppose most people would know you for, which is the kind of area of SQL on Hadoop.

Starting point is 00:02:18 So, as I said earlier on, you were the PM for Oracle Big Data SQL, was kind of one of the obviously Oracle's take on this so Dan do you want to tell us first of all actually what what were you doing with Big Data SQL and what was the problem it was trying to solve let's talk about first of all then really sure I mean I think you know broadly speaking when I think SQL on Hadoop has become almost an overloaded term at this point because depending on who the implementer who the vendor is the motivation technically may be somewhat different I think that I think the the common motivation across both open source products and small vendors and large vendors is the economics of how we do large-scale data analysis have shifted or are shifting.

Starting point is 00:03:10 And to some extent, this means that many of our large-scale warehousing systems, many of our analytical systems are moving to more distributed, not necessarily strictly distributed, but more distributed constructs. And I think when we think about SQL on Hadoop as sort of a broader category, what you're seeing is different kinds of initiatives to take advantage of the fact that distributed file systems have become reasonable to operate and cost efficient to maintain. And so then with Oracle, what we were effectively looking at was there are many customers who run large Oracle data warehouses. The cost of turning down such a system can have tremendous impact to a business and could be quite challenging. And then is there a sane and rational way to take advantage of

Starting point is 00:04:08 the growing economic benefits of distributed file systems while still maintaining a declarative language interface querying all of your data? And so from that perspective, I think the notion is you run Oracle SQL on a data warehouse, now extend it with the Hadoop distributed file system and be able to harness more power from that distributed environment with reasonable economics and lower risk. you know, something like Cloudera or Hortonworks, I think you're very often looking at trying to enable users who have decided that they want to take further steps to sort of distance themselves from whatever their traditional infrastructure is. And that may be, in fact, buying into an entirely new query engine. Thus, we see sort of the rise of things like Impala, things like Hive, things like hive and and ultimately

Starting point is 00:05:05 uh uh you know both the tennis project and then phoenix which they now support important works um and then i i think for the open source community in general there is this notion that distributed systems for scalable storage are to some extent solved if we're simply talking about storing bytes in something that behaves like a file system, it is maybe not painless, but it is somewhat solved. And if that's the case, then what are the tools that need to be built in order to do real query processing and real declarative language-driven analysis of data in those environments? And thus, we see the rise of things like Hive, the rise of things like things like drill potentially um so i think the motivation broadly is the economics of data management are shifting thus the sql language and its its ability to act on data at scale also needs to shift yeah exactly

Starting point is 00:05:57 i mean i think kind of broadly you could you could argue that i suppose the big vendors supporting uh sql on hadoop and hadoop it classic kind of, I won't say embrace and extend, but certainly it was always very obvious that the data was, with Big Data SQL, for example, the output of that was always going to be Oracle. You know, you could select against Hadoop, but then the data always came out via Oracle. So you were still kind of locked in, I suppose, to that. But if you were looking to do, I suppose, query offloading,

Starting point is 00:06:24 data warehouse offloading and so on it was a very good way of doing that and and and so certainly for customers that were heavily invested in either oracle or c or sql server or whatever it's perfect in that respect in that respect isn't it but i think the more the more kind of organic um sql on hadoop engines that are purely sql on hadoop you know kind of hive and impala and drill and so on it's a bit different really isn't it and that's where i think certainly thereop, you know, kind of Hive and Impala and Drill and so on. It's a bit different, really, isn't it? And that's where I think certainly there's been different kind of branches of innovation there, really, haven't there? Yeah.

Starting point is 00:06:51 And in fact, I think when I look at the sort of newer approaches to performing, let's just call it SQL and Hadoop, you look at something like Impala, for example, or you look at Drill, or you look at the sort of parent of many of these, which was the Dremel project at Google, a paper which was written a number of years ago, and that Google actually exposes to the world as a system called BigQuery. One of the things I think is really fundamentally interesting is all of the available research suggests that at a certain scale, distributed query processing requires a shift from more normalized data models, sort of 3D and beyond, to something that actually includes nested fields and repeated fields. And so, you know, we see this with sort of JSON fields, we see this with nested and repeated fields within, you know, systems like Impala that can have child tables. And one of the really interesting lessons that sort of emerged both from the research and then also from the open source community is when you want to talk about doing really, really broadly distributed SQL processing, you actually have to start thinking about what can we nest,

Starting point is 00:08:07 what can we repeat, simply because the processing will eventually become too challenging if you try and broadcast all of those joins. Yes, yes. I mean, interesting. And so, I mean, one of the things I noticed was that obviously each vendor, IBM, Microsoft, and so on, had a take on this. And it struck me, obviously my background is in Oracle, but it struck me that Big Data SQL was probably a fair bit ahead of what IBM were doing and so on and so forth. I mean, again, I'm conscious of things you can and can't say, but how did you feel, I suppose, the different vendors,

Starting point is 00:08:41 the mega vendors, in terms of how they did things? Were they all broadly the same solution? Or did the vendors take different approaches and any kind of preferences or ideas on those at all? I mean, I think, you know, the commonality among the mega vendors that we see is that it is more important to sort of continued integrity of business process. And then ultimately most important to the vendor to maintain the processing logic, to maintain the statement itself, not necessarily the underlying storage or some of the execution underneath it.

Starting point is 00:09:19 I like to think that at least with what we did at Oracle, I think the ambition was understanding that there are open APIs that are becoming real standards, right? I think you can look at the HDFS APIs and say, you know what, these are largely solid. We see them supported not only across Hadoop distributions, but also by cloud vendors. For example, you can use HDFS APIs to talk to Amazon's S3 or to Google's cloud storage. And if that is the case, then any reasonable extension of a mega vendor SQL system needs to respect and take proper advantage of those APIs. And I think that gave Oracle an advantage

Starting point is 00:10:13 in so much as it allows that solution to take good advantage of the innovation that is happening in that community as those APIs become standard and as the underlying functionality in the open source community develops. I think the approaches that Microsoft and IBM have taken are also very interesting. I think they are much more suited to perhaps what the bulk of their customers wish to see, which is simply I'm able to treat that as a reasonable source of byte storage

Starting point is 00:10:52 that is lower cost, and I'm not hugely concerned about integration with a rapidly evolving field. Yeah, interesting. And so I suppose, again, respecting your position before, how much take-up in the market in general, I suppose, really, within traditional data warehouse customers, how much take-up do to see that there is you know there's understanding of this idea of doing what we do now kind of cheaper but something i see less of is people using these sql and hadoop engines in in a kind of innovative way so you're using it against say mongo using it um in its nested sources what was your feeling about that the take up and the right and the degree of innovation you were seeing with these engines by these customers?

Starting point is 00:11:52 To be honest, at least, you know, I think the two broad things I noticed and still tend to think, and maybe let's extend it to three broad things. I think the first thing that I noticed is that for its early days for all of this, right, and, you know, many, many organizations are still very much in a higher kicking phase, which maybe doesn't lead to wild experimentation or innovation. And I think organizations are trying to understand whether or not they are going to consume things like SQL and Hadoop strictly from their existing vendors of preference, or whether or not they're going to break with tradition. There was a Gartner webcast maybe about a year ago in which

Starting point is 00:12:30 they sort of ran a survey with the folks on the phone and sort of asked like, well, where are you likely to get your SQL on Hadoop? Like from an open-source project, from, you know, Cloudera or Hortonworks or from your database vendor. And it was really, it was very split. It was very split between, you know what, we're going to go with whoever our vendor is today, or we're going to try one of these open source projects. I think there's a lot of sorting out to be done. I think the other thing that maybe slows some of the adoption here is, I do think that SQL and Hadoop is actually, as a movement, while it has led to sort of a fascinating amount of sort of open source innovation, I think as a market, it is actually

Starting point is 00:13:12 strongly in competition with the broader shift to cloud. In so much as if you talk about, well, I need to lower my infrastructure costs. I need to get access to querying more data. to some of the growing cloud databases that exist, cloud data warehouses that exist, or even the sort of managed Hadoop things that you see from Elastic MapReduce, Google Dataproc, things like that. So I think there's an underlying race in terms of where will the infrastructure settle that is maybe undercutting some of the adoption of SQL on Hadoop at large scale.

Starting point is 00:14:03 I agree. And I think that certainly, I mean, there's, yes, so cloud is one thing. And obviously, Oracle, for example, have, you know, big data SQL in the cloud as an option there as well. But if you're looking to store a lot of stuff cheaply at high, a big scale and so on, then other vendors out there, I mean, SnowflakeDB is an example out there and so on, where, yeah, really, in fact, it's Hadoop or not Hadoop is really irrelevant.

Starting point is 00:14:24 It's just an abstracted kind of elastic store of data that then certainly i guess the other thing really is when you're looking at say vendor sql and hadoop solutions compared to say open source cost is is not insignificant really and and well we'll get on later on to talk about i suppose how how well the big vendors you think will do in this kind of area but cost is an issue as well and it's kind of in a way counterintuitive sometimes to kind of pay a lot of money for these vendor solutions when normally people expect this to be kind of free for open source it's true i mean i think one thing that that is that is a real a bright line between what you get in the open source community and what you get from any vendor is that any sufficiently mature vendor iteration will likely have more

Starting point is 00:15:08 capabilities around security, around governance, around metadata management. And I think for larger organizations, that's going to matter quite a lot. Again, business continuity, regulatory compliance, these are real issues for a number of organizations. And it may be such that if you are a traditional RDBMS vendor, and you've already taken care of everything around encryption and session isolation, and you are HIPAA compliant, and you are SOX compliant, and so on and so forth, that while you may not have the most novel extension to query processing you may you may have such an entrenched advantage in compliance that you will naturally pick up some amount of customer so so i suppose a tangent to this really i totally agree with what you're saying there i

Starting point is 00:16:00 think you know we sometimes forget how important this stuff is to real customers um one one issue i had i suppose when when oracle big data sequel first came out was was almost a philosophical one which i kind of saw it as as there was say there was a good blog post a while ago by a guy called jeff needham who talked about how how hadoop is not just a cheap enterprise data warehouse and you know and and running sequel on hadoop sometimes is missing the point and i get the point of there are certain tasks that suit kind of set-based processing. But I suppose, you know, in a way, how much do you feel that in a way the kind of energy

Starting point is 00:16:33 and the kind of movement around SQL on Hadoop is almost missing the point of what Hadoop is about? I mean, do you think it's a valid point or do you think actually that all things will converge on that in the end? I mean, I suppose I do and I don't, which is probably not a great answer. I think that it's easy to look at SQL and Hadoop or using big data systems for data warehousing as not taking full advantage of the advance in technology with respect to how we deal with scale. And, you know, we can certainly look at some of the more interesting architectures that have come around, things like lambda architectures, things like kappa architectures,

Starting point is 00:17:15 and say like, oh, there's so much more you could be doing. However, ultimately, I suppose maybe the best way to think about this is I'm currently working on a blog post and I need to make some figures for it. And I'm analyzing a bunch of data and I could make any number of sort of wonderful interest in charts. But ultimately, when I sort of went through to try and tell the most effective story with the data, I discovered that most of what I wanted was bar charts. Because bar charts got the message across. And a lot of what we want to do with data comes down to data warehousing kinds of workloads,

Starting point is 00:17:48 you know, SQL kinds of workloads. And maybe I'm just making bar charts. Yeah, exactly. Exactly. So just before we get on to the next bit, I mean, a product I've seen in this area that I've been surprised at how much I've been impressed by is Drill from the Apache sort of project. Any sort of thoughts on drill,

Starting point is 00:18:05 or I suppose some of the engines that are less traditional in how they do things? I suppose, again, a bit of context for this is one of the things that I've noticed about using SQL on Hadoop is a lot of it is just doing the same thing, but in a cheaper or more scalable way than you used to do with, say, Oracle. You define columns, you define metadata.

Starting point is 00:18:23 Drill seems quite different. Any thoughts on that at all? Yeah, I've been watching Drill with some interest over the last several years. I think Drill's original ambition was to effectively be something like an open-source Dremel.

Starting point is 00:18:37 I think it's moved to a really interesting space in which it is really pushing the bounds of what we think of as SQL. It's certainly sort of far deviant from what we think of as an ANSI 2011 SQL. And I think that is interesting. And I think it is, the other thing I sort of compare it to is some of the things that we've seen in Spark 2.0, where we're starting to see the sort of typed semi-declarative language constructs around data sets. And to me, what it speaks to, and again, this is just me sort of

Starting point is 00:19:14 thinking out loud about it, is that it speaks to a real renewed interest in the power of declarative language. And I think that's really, really compelling. I don't know if drill will be the thing, but I think what we will see is that increasingly we will see greater expressiveness and flexibility in declarative languages that are perhaps SQL-like. Yeah, I think it's a powerful concept, and I think it's a powerful concept that and and i think it's interesting that people are remembering like oh yeah like there are many other you know types of expressions that i would like to declare and still get the power of having something that optimizes

Starting point is 00:19:56 and executes on my behalf exactly exactly i mean for me you know sql is is one form of engineering part of your data in hadoop really. And certainly having that there is useful. I think the innovation that I'm seeing in things like Drill and also some of the things we can do around, say, Query Federation with, say, Spark SQL, some of the stuff coming out, some of the vendors around Data Fabric and so on as well. I mean, it's kind of interesting sort of area.

Starting point is 00:20:18 I mean, and that kind of in a way leads on to probably the next thing I want to talk to you about. Now, Dan, obviously you were at Oracle, and that's how I know you and how most people probably listening to this know you but you you moved on to Google now presumably because there was interesting things going on there um in general one of the things that you you start to sort of notice about the whole Hadoop kind of uh and big data area is that everything that we see now was invented 10 years ago at at kind of Google at Yahoo and so

Starting point is 00:20:42 on there you know what what are you seeing out there at the moment? What are some of the sort of the trends and ideas that you're seeing happening? It's probably kind of, you know, in those areas that we might hit on in the future, really. Well, I mean, I think two things I would say. One is, you know, yes, it's absolutely right to sort of look at sort of the history of Google papers over the last, you know, decade or so and sort of say, hey, look at, of the history of Google papers over the last decade or so and sort of say, hey, look at, you know, this is really the sort of origination point of a lot of the ideas that end up in the broader big data ecosystem. I think one of the things I'm noticing is that the time lag between the sort of research publications we produce at Google and their emergence as entities

Starting point is 00:21:27 in the open source ecosystem is shortening, which is really, really interesting. And we're trying to play a bigger role in that as well. And I think when I look at what we're publishing on and what we're helping workbooks to expose more broadly in the community, there are two or three things that really stand out to me. One, which I think probably doesn't need a lot of introduction,

Starting point is 00:21:53 is the machine learning work we've done around TensorFlow. The amount of interest the community has had around TensorFlow has been really, really tremendous. And the fact that it's all being done in the open as an open source project, I think is going to... So do you want to just explain what actually is TensorFlow? I mean, I know, but for the audience, what is TensorFlow? And why is that so significant and interesting now?

Starting point is 00:22:17 So TensorFlow is, at a high level, it is a framework for doing large scale deep learning using Python and precompiled C code. It's exceptionally flexible, exceptionally powerful, and it's a tool set we use to solve a lot of problems at Google. Now, by open sourcing it, we've brought to the larger community not only the ability to sort of quickly define very powerful deep learning models, but at the same time also the infrastructure necessary to run those things at scale. in a distributed fashion on anyone's cloud or in your data center that allows you then to build these models at scale and very rapidly, as well as to begin to introspect them and understand

Starting point is 00:23:10 where model performance is varying and how you might build better models. And that's, I think if deep learning is going to become something that becomes part of an analyst or data scientist sort of standard toolkit, it's projects like TensorFlow and the things that the community is sort of building around it that will really help push it, really push it into the hands of more and more, you know, analysts and data scientists and even beyond.

Starting point is 00:23:37 Okay, okay. So it's interesting you say about machine learning and so on. So, I mean, I've been, I'm speaking at OUTAG CaseScope next week and I'm doing a session on using machine learning on so on. So I mean, I've been, I'm speaking at ODTUG Kscope next week and I'm doing a session on using machine learning on wearables data. So I've been gathering all my data

Starting point is 00:23:51 on cover my Fitbit and from the bike and from the house and all that kind of stuff and then bring it into one place and then applying, you know, Python based machine learning on it. But one of the things that keeps striking me is when you don't know what you're doing and when you don't get some of the kind of concepts around you know data having

Starting point is 00:24:08 to be a measure for every kind of like for every kind of row and so on and the different kind of um i suppose uh algorithms and so on you're really one you're really lost and secondly you're in you potentially into the kind of realms of being quite dangerous do you think machine learning will ever be democratized do you think it will ever be something where anyone can do that or is it always going to be a scientific thing really i mean going back to tensorflow and so on there is it going to become mass i suspect it will but i suspect it will be consumed in different ways um i think you can i think you can look at some of what we do at google around exposing machine learning to to end users as a as a way in which in a way in which it might become consumerized in so much as you

Starting point is 00:24:47 can, you know, TensorFlow is open source, you can use it, you can build your own models. But at the same time, if you say, I don't really have the time or sufficient understanding how to build an image classification model, Google then offers up its own vision API by which you don't have to worry about, you worry about how you construct your model, how the network should be formed. You can simply say this is my training set. These are the images I send or simply I have an image. Tell me what's in it. And so I think to some extent there will be the people who want to craft their own models and there will be people who simply want to say i i you know i have a i have a data that give me the bar chart right the the effect of the bar

Starting point is 00:25:30 chart for images or text or speech i and i think machine learning will become democratized in different ways yeah okay um we'll come back to that in a moment actually um but what one i guess you you joined google from oracle and we all all know Oracle is kind of fantastic and so on there. But you mentioned early on about, in a way, kind of SQL, the question will become less about the engine and more about things like the cloud and so on. I mean, do you see, I suppose, initiatives going on in Google and other places really where, you know, I suppose, in a way, will cloud become more of this? And will big data and machine learning be more in the cloud and how i suppose how do you see the areas that google kind of work in as as big impacting on how this is going to the future particularly areas you're

Starting point is 00:26:14 working in i i i think the shift to cloud will actually become a more pronounced advantage for organizations over the next several years and and and. And the reason I say this is just, you know, the experience I've had sort of looking at how, let's take SQL and Hadoop as a good example of this, at how the benefits of the technologies are truly enhanced by scale, such that if you set up a little pseudo-distributed Hadoop cluster and you run, say, you run Impala or you run Hive or something like that, you'll get reasonable performance, and then maybe you'll move up to a five-machine cluster, and you'll get much more reasonable performance. And maybe you move up to a full rack of servers, and you get much more reasonable performance. And maybe you move up to a full rack of servers

Starting point is 00:27:05 and you get much more reasonable performance. But what we see economically with cloud deployments is that you can have at your instantaneous disposal thousands of cores, tens of thousands of cores potentially and tremendous sorts of throughput of networks and disks. And I think what will ultimately be a huge

Starting point is 00:27:26 advantage for consumers of data or consumers of data analysis is the ability to say, because the density of infrastructure is increasingly concentrated in large cloud providers, I can achieve the real benefits of economy of scale on my queries because I am using vast resources for very short periods of time. Interesting. I mean, yeah. So I don't know if you noticed, there was a couple of interesting blog posts that were published recently that come to this sort of area.

Starting point is 00:28:01 So there was a post by a guy called Marco Arment who's an Apple blogger and he posted this article kind of say that if anything brings down Apple or certainly leads to the eclipse of Apple it will be its lack of investment in machine learning so the background to that really I suppose is in the fact that Apple refuses for various reasons to you know in a way kind of capture lots of your personal data and then work on it centrally it wants to use apps and that's an area that Microsoft and Google have been investing in it a lot really and there was also an article I think it was Stephen Levy yesterday posted about how how Google is now becoming effectively a machine learning first company I mean do you think do you think I mean

Starting point is 00:28:39 the investment they're making are you seeing this is very strategic to them and do you think that there's kind of I don't know if you saw those blog posts, but do you think there's kind of like, you know, a point there really? Yeah, I saw, I saw Stephen Levy's blog post and I think, you know, certainly, certainly the stance we take as an organization is machine learning is incredibly important to what Google does and increasingly part of more and more of the products that we bring to market. And I think two things to note. One, we and Facebook and many other large organizations have a tremendous amount of data

Starting point is 00:29:13 that we can leverage to user advantage. And that could be from everything from query planning to figuring out what's in an image to recommending what restaurant you should eat at. I think the real challenge, and a thing that is always, always top of mind at Google, is that this must be done in a way in which no one's privacy is actually compromised. And I think this is a really interesting challenge that consumers of potential technologies should consider, and also maybe more particularly data scientists and analysts who are exploring machine learning, exploring processing at scale need to keep in mind, that, you know, at some point you, you, you are embodied with your users trust.

Starting point is 00:30:07 And it should mean that you, you may process every, every byte of data that you have very, very efficiently and to, to great effect, but you should never be able to, to put it in a situation in which it might be compromised. And, and ultimately when you look at places like Google, you should not be able to see it. I agree. And I think going back to your original kind of, you know, your job, I knew you from big data superlative Oracle. Certainly, my experience has been that a lot of projects, a lot of big data projects I've worked on, when the customer gets to that point, when they suddenly realize the amount of data that's under their custodianship and the responsibility they have, that is where I found that oracle solution with the ability to kind of

Starting point is 00:30:45 apply security over it was important but i think generally perceptions are really important and people at the moment there's i think it's a general there's a general kind of benign feeling that if they get value out this data it's worth doing but but the opinion can shift quite significantly and um certainly i think that people are be very mindful of security, of privacy, and the perception of that really as well. So absolutely, I agree on that. So one question before we get on to the last part was, why did Google and everyone publish all these white papers? So if you look at the whole Hadoop kind of movement, it really is effectively, certainly the open source movement has been re-implementing everything Google has been documenting and

Starting point is 00:31:30 so on. Why do they publish all this stuff and why do they kind of in a way lay out how they do things in such and such detail? Well, I think there are probably a number of motivations for this. I mean, one very obvious motivation is that there are a number of people at Google who have very strong research backgrounds and are very interested in contributing to the scientific literature because it's part of what's important to them. I think the other part actually can be traced all the way back to Google's mission statement in terms of trying to organize the world's information, make it useful and accessible to everyone. The work that we've done on systems like Dremel, systems like Spanner, systems like Dataflow,

Starting point is 00:32:09 which I think is turning into a really exciting Apache project called Team. I think we at Google view this as part of the world's information, and we need to make it useful and accessible to everyone. And while we can't necessarily give everyone a Dremel in their own data center, we can make services like BigQuery available. But in the lag time, we can make that information about how we see SQL at scale working or how we see data flow processing working at scale. And we can make it accessible to the world by publishing research papers. I agree. And certainly, I mean, I've worked, well, I've been at Google

Starting point is 00:32:45 before, and I've spoken to people there. And it's always struck me how kind of like altruistic some of this stuff is. I mean, obviously, Google is Google, it's a company and so on there. But certainly, I wouldn't kind of, I would not the fact that certainly this stuff had been published and shared. And I know from my own experience that certainly, you know, I gain more out of sharing things and the world gains more out of it really as well. So I can sort of see why really. So actually on a sort of tangential point to this really, I suppose in a way carrying on. So Dan, you worked at Oracle for a while and you've observed the kind of, I suppose, the big vendors operating in this kind of Hadoop space here really. And I guess probably there must be a kind of, I suppose, a contradiction or tension in there really between wanting to kind of, I suppose, like yourself, want to build the best kind of implementation of a SQL on Hadoop engine or to get Oracle to work with Hadoop with the fact that, you know, the big vendors that have a commercial kind of model

Starting point is 00:33:45 that are now kind of working in this space that was all about in a way um doing things at cheaper and at scale and from also that applies even things like consultancies so um you know is there a market for for high-end consultancy in the hadoop market and and so on so i suppose the question to you is you know how relevant do you think the old world um mega vendors are in the hadoop world do you think they're going to be do you think they've got a point to you is, you know, how relevant do you think the old world mega vendors are in the Hadoop world? Do you think they're going to be do you think they've got a point to it? Or do you think they'll be or do you think it'll be eclipsed over time, really? So, I mean, I think, you know, the virtue of a mega vendor, right, when you look at Microsoft, you look at Oracle, you look at IBM, is that there's a great diversification in the products and services that they can make available to their

Starting point is 00:34:25 customers uh i i think i i think to that extent there will always be some amount of relevance that can be maintained i think i think the question is where are and certainly when i was at oracle canonization of a business unit was something that you know thought about quite a lot uh you know there were obviously you know entrenched threats from from the no sequel market entrenched threats from from the sort of larger Hadoop market and the big data market. And I think if I had, you know, candid advice I could give to any megathreader, it would be first and foremost, stop selling hardware. Because the density of hardware concentration used for enterprise computing across the planet is consolidated. I mean, if you look at data centers that Amazon is building, data centers that Microsoft is

Starting point is 00:35:12 building, data centers that even companies like Oracle are beginning to build, the notion that we want to go out and sell a hard drive or sell a file server is it's becoming an increasingly difficult economic argument to make in so much as capital expenditures are necessary for many businesses, but some of these don't necessarily make sense. I think the other piece of it we talked a little bit about earlier in so much as the greatest business value, I think, for the mega vendors is in fact owning execution, owning the query. And in part, that also provides the greatest business continuity for existing users. I would hope that that would be where things shift. I think it probably varies based on what sort of revenue streams a given vendor sees from their hardware

Starting point is 00:36:05 lines versus their software products. And building data centers is hard. Building data centers is a really, really hard task. And the amount of sort of investment not only to provide the facilities, but then also to provide the people who understand how to maintain site reliability at scale is is is a real challenge that I think I think some of the mega vendors are reacting to it very well I think Microsoft does a very good job of this I think companies like Oracle are learning how to do this yeah it's tricky it definitely I mean certainly for my experiences you know I've been in sales engagements with with oracle and so on in the past and certainly going in there and and the

Starting point is 00:36:49 first conversation you have with a customer is trying to sell them in their case you know a big data appliance is an unusually kind of it's an unusual conversation to have because typically you know the person you're speaking to does not want to talk about you know hardware and yeah they want to talk about the vision and the idea around things and certainly my analogy at the time is like it's like going into a kind of audio shop hi-fi audio shop and you want to hear how good the music is and you want to hear what it's going to sound like to have you know to have this fantastic music playing but then the actual kind of salesman is trying to sell you a very high-end walnut cabinet with monster cables and so on there and and it's yeah there's point of that, but it's probably not what the customer wants to hear at that point.

Starting point is 00:37:26 And, you know, really the margin on hardware is minimal. And so it was an unusual kind of, I suppose, angle to have. And you must have experienced that quite a lot with Big Data SQL, where at the start there was this dependency on it and it having to be with the kind of big data appliance and with Exadata. And obviously part of that, I'd imagine, I imagine probably can't say part of that probably is is is for technical reasons because the infinity band but part of it probably is because that's the you know it supports wider objectives really and and but you in the end you managed to get it to be or you managed to get

Starting point is 00:37:57 it to be kind of freed from those restrictions i mean was that quite a battle there or or that was your last thing you did really wasn't it before you left it it was the it was the very last thing i did at oracle yes was was get us get to get that product to a point where it could be available effectively to a much broader use of a group of customers i i think i think i think it it in part represents a uh i think it in part and i think maybe the main motivator for me and the main if there were battle lines drawn, the main battle line was effectively that there is greater value in doing this for all users than there is protecting a specific business area. Because long-term, I think technology companies are, at least in the modern age, most successful when they put their users first.

Starting point is 00:38:47 It is interesting, though. We think about the comments on hardware, but you'd asked about whether or not there's still a market for high-end consultancies around data. And I think the answer is perhaps maybe more than ever, because as we get further and further away from having to buy the nice, you know, having to first buy the nice walnut cabinet to hear what the music's like, having, you know, qualified and talented individuals that can help organizations get to value, get to the song they wanted to hear, is increasingly relevant.

Starting point is 00:39:21 That's interesting, yeah, because certainly my experience has been is is that um the people like that would go and join google or or kind of like you know facebook and so on so i mean it's an interesting one i mean i think probably how consultancies and how serve you know how integrators work in this market is interesting so as more stuff moves to kind of things like machine learning as more stuff moves to hadoop there's going to be some kind of you know low-end work and so on although the cloud obviously will take that away but I think what what a consultancy is and and how you operate in that kind of area and how you would how you would add value to someone like Google or to Facebook it's kind of interesting and

Starting point is 00:39:56 whether it become less people but more skilled or whatever you know I don't know on that but so it's interesting it's what encouraging you say that really um so so down one other area on that is one thing I've always noticed is that every one of the mega vendors or the vendor solutions for SQL on Hadoop is, you know, it works, obviously, from their product to SQL, say, to Hadoop. So your, you know, big data SQL was Oracle to that. Microsoft One is like that. Do you think there's a market or do you think there's a need for solutions that kind of in a way link together different proprietary database engines through to Hadoop? A more kind of like fabric style thing or is that a problem that doesn't really need to be solved? I mean, did you think about that at all when you were at Oracle?

Starting point is 00:40:37 I think the federation piece is extremely, I think when we think about the future of query processing, there are two ways to think about it. You can either take the stance that all data will be consolidated in one kind of system, or I think the more rational view to take is that data will be increasingly federated across many different kinds of storage and processing systems and will occasionally need to be processed in concert. And for that reason, I think federation and federation sort of beyond the language level is increasingly important. It's, you know, it is something we aim to solve at Oracle in terms of being able to use a single SQL dialect to query beyond sort of the Oracle database and HDFS, but reach out to SQL databases. It's a problem that is relatively well solved at Google. We have many, many internal systems that collect data. We can effectively federate across all of these for our own purposes. And I think it's important to understand that ultimately, we'll talk about finding value from data.

Starting point is 00:41:46 What we want to be able to do is interrogate the data wherever it may exist with a single construct that best suits our workflow. And so if SQL is the right workflow for me, excellent. If a declarative Scala API is better for you, so be it. But I need to get to all the data in the way that best enables me. Exactly, exactly. And I think certainly from my perspective, something I've been saying for a while is that I think that all analytic workloads in the end

Starting point is 00:42:13 will move to this kind of platform. I think that in time, certainly on the bulk of it anyway, although there'll be this interchange, as you kind of said there, but this ability to actually kind of put it in one place and then apply different engines, different languages, and so on to it is important as well. But then i suppose in a way going beyond the basics of that and i noticed there's been some startups recently um uh i think the guy at vertica sort of did one where where i suppose in a way looking at say automagic sort of uh i suppose discovery of

Starting point is 00:42:38 the kind of the meaning of data in there and schema and that sort of thing and adding adding smarts to it i mean i think certainly at the moment we're plumbing it all together but going beyond that you know i suppose anyway what do you if you if you were to sort of like to look forward to i don't know five ten years and you saw the kind of analytic platform of the future really you know running on probably the kind of descent of this technology yeah what do you think it would be what would you you know what would you be aiming for if you were doing this really the kind next-gen analytic and integration platform really? I think ultimately metadata and sort of catalog management will become almost a separate entity such that you may have a

Starting point is 00:43:32 service that is a that is we can actually look at you know so the hive meta stores an early version of this right in which we see one catalog which can contain information about data stored in many different places it may contain it may contain metadata information about data stored in many different places. It may contain metadata information about data stored in HBase or data stored in HDFS or data stored in another NoSQL database. Even now, I think through some of the various APIs, you can actually store data about other RDBMSs in there. I think we'll start to begin to see that

Starting point is 00:43:58 as being a much more distinct and separate piece of the process. The other thing, and I think it's, I think it's, I think it's actually, I think it's incredibly important. And I think it's why I'm so excited about Apache Beam project is, I think going to be increasingly as streaming workloads become more interesting to organizations. I think we will move to a situation in which we stop talking about the difference between batch processing and stream processing and simply say data exists as a flow. And you can either choose to process it in a fixed batch. You can choose to process it as a window. You can process it in any particular way. And I think that sounds a little wild and outlandish at first until you sort of

Starting point is 00:44:46 think about transaction logs or redo logs that we would see in Oracle. In so much as effectively, if you had an infinite redo log, you would be able to slice and dice that as needed to either say, here is a batch of data that I want to process or process the next thing that comes in or process the next five-minute window. And I think we're going to increasingly see systems built around those concepts. I think a lot of the work that's going on in the Kafka space is really beginning to push this way. The guys at Confluent, I think, are doing an interesting job evangelizing some of these concepts. Again, the work that Google's doing with the Beam community is very much along the same lines. I think we're going to see that as a fundamental shift in the sort of underlying treatment of data sources. That's interesting. And I mean, I suppose one of the things that I've always been kind of saying is that, you know, whilst Hadoop and this kind of world is going to eat into analytic workloads, it sounds very much like it could almost kind of start to eat into what we consider to be normal transaction processing now it sounds like you know you're

Starting point is 00:45:47 saying there that yeah that's kind of interesting and yeah i mean that that sounds a much more kind of i suppose bigger goal really doesn't it than just doing data warehousing better yeah yeah i mean i think doing data warehousing better is the beginning of it right because i i think there are still many of the promises of of the original data warehousing movement many years ago that organizations are still working to realize. But I think at some point we will end up with enough data and enough desire to look at it in different ways that we begin to change the fundamental model of, well, it's not so much a table as it is a table-shaped stream. Interesting. Interesting. Yeah, definitely.

Starting point is 00:46:24 Well, I'm going to obviously approach Gwen at some time and see if she wants to come on the show as well. It'd be interesting to see, I think certainly from her kind of this Gwen Shapiro,

Starting point is 00:46:33 probably both of us know, working at Confluent. And certainly there's a lot of, there's a lot of parallels really between some of the stuff going on there and what you're doing there and general kind of processing

Starting point is 00:46:42 of data and so on really. So, I mean, Dan, that's been fantastic and I've been, thank you so on really so um i mean dan that's been fantastic and i've been i've been thank you very much for your time on this it's been really interesting to catch up with you i guess it's probably sunny over there is it where you are it's it's pouring with rain over here it's middle of summer and it's yeah and it's yes you've got you've got english summer i've got california summer which is just what you would think 70 70 degrees fahrenheit and. I know. So you're at Google. It's sunny. I mean, you should be over here. It's raining and

Starting point is 00:47:08 so on, really. But Dan, that's fantastic. And it's been really good to speak to you. Thanks very much for your insights there. It's very interesting what you're doing. And yeah, great to speak to you. So thank you very much. And thank everyone for listening. And yeah, thanks a lot. Brilliant. Thank you. All right, Mark. Thanks very much. It's been a pleasure. Cheers. thanks. you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you Thank you.

Drill to Detail - Drill to Detail Ep.2. 'Future of SQL on Hadoop', With Special Guest Dan McClary

Mark Rittman is joined by Dan McClary, ex-Oracle Big Data SQL PM and now working on Google's Storage Division on big data projects, to talk about the future of SQL-on-Hadoop....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.