Drill to Detail - Drill to Detail Ep.2. 'Future of SQL on Hadoop', With Special Guest Dan McClary
Episode Date: September 27, 2016Mark Rittman is joined by Dan McClary, ex-Oracle Big Data SQL PM and now working on Google's Storage Division on big data projects, to talk about the future of SQL-on-Hadoop....
Transcript
Discussion (0)
Hello and welcome to the second episode of Drill to Detail, a new podcast series hosted by me, Mark Whitman,
and where I'll be talking to a special guest each week about some of the issues and thoughts and ideas behind the news
and what's happening in the big data analytics and data warehousing industry. Okay, so this episode's
guest is Dan McCleary, someone I've known for a little while now, back from my Oracle days really,
or working with Oracle. Dan was one of the, was the PM for Big Data SQL, which most of you heard of as being Oracle's take, I suppose,
on SQL and Hadoop. And Dan recently actually joined Google. So actually, Dan, let's introduce
yourself first of all, and just tell everyone who you are and what you've been doing.
Yeah, hi. So thank you, Mark, for having me on. So my route to Google has been a little
twisted. I started off as a researcher many years ago, went and played the startup game for a little bit.
And then, as you mentioned, was at Oracle for a number of years working on distributed SQL problems and SQL on Hadoop.
And yes, now I've joined Google and look after things related to the SQL language as well as to things like block storage business.
So largely,
I'm part of the product management team for Google's cloud platform. And I should say right
at the beginning that any opinions expressed here are my own and do not reflect those of either
Oracle or Google or, you know, any startups that I might have played with in the past.
Excellent. So thanks, Dan. And yeah, it's great to have you on here. And really,
I suppose, interested in your take on the SQL and Hadoop market and just generally, I suppose,
really, how the kind of big vendors and how the small vendors really are kind of doing in this
space. And we'll come on to later on, we kind of suppose some of the things coming out of Google
and Yahoo, and I suppose where you see the market and where you see things going in terms of,
I suppose, Hadoop for enterprise customers
and for my area, which is kind of BI analytics and so on, really.
So, Dan, let's start off then really with something that I suppose most people would know you for,
which is the kind of area of SQL on Hadoop.
So, as I said earlier on, you were the PM for Oracle Big Data SQL, was kind of one of the obviously Oracle's take on
this so Dan do you want to tell us first of all actually what what were you doing with Big Data
SQL and what was the problem it was trying to solve let's talk about first of all then really
sure I mean I think you know broadly speaking when I think SQL on Hadoop has become almost an
overloaded term at this point because depending on who the implementer
who the vendor is the motivation technically may be somewhat different I
think that I think the the common motivation across both open source
products and small vendors and large vendors is the economics of how we do large-scale data analysis have shifted or are shifting.
And to some extent, this means that many of our large-scale warehousing systems,
many of our analytical systems are moving to more distributed,
not necessarily strictly distributed, but more distributed constructs.
And I think when we think about SQL on Hadoop as sort of a broader category, what you're seeing is different kinds of initiatives to take advantage of the fact that distributed file systems have
become reasonable to operate and cost efficient to maintain. And so then with Oracle, what we were
effectively looking at was there are many customers who run large Oracle data warehouses.
The cost of turning down such a system can have tremendous impact to a business
and could be quite challenging. And then is there a sane and rational way to take advantage of
the growing economic benefits of distributed file systems while still maintaining a declarative
language interface querying all of your data? And so from that perspective, I think the notion is
you run Oracle SQL on a data warehouse, now extend it with the Hadoop distributed file system and be able to harness more power from that distributed environment with reasonable economics and lower risk. you know, something like Cloudera or Hortonworks, I think you're very often looking at trying to enable users
who have decided that they want to take further steps
to sort of distance themselves from whatever their traditional infrastructure is.
And that may be, in fact, buying into an entirely new query engine.
Thus, we see sort of the rise of things like Impala,
things like Hive, things like hive and and ultimately
uh uh you know both the tennis project and then phoenix which they now support important works
um and then i i think for the open source community in general there is this notion that
distributed systems for scalable storage are to some extent solved if we're simply talking
about storing bytes in something that behaves like a file system, it is maybe not painless, but it is somewhat solved. And if that's
the case, then what are the tools that need to be built in order to do real query processing and
real declarative language-driven analysis of data in those environments? And thus, we see the rise
of things like Hive, the rise of things like things like drill potentially um so i think the motivation broadly is the economics of data management are shifting
thus the sql language and its its ability to act on data at scale also needs to shift yeah exactly
i mean i think kind of broadly you could you could argue that i suppose the big vendors supporting
uh sql on hadoop and hadoop it classic kind of, I won't say embrace and extend,
but certainly it was always very obvious that the data was,
with Big Data SQL, for example, the output of that was always going to be Oracle.
You know, you could select against Hadoop,
but then the data always came out via Oracle.
So you were still kind of locked in, I suppose, to that.
But if you were looking to do, I suppose, query offloading,
data warehouse offloading
and so on it was a very good way of doing that and and and so certainly for customers that were
heavily invested in either oracle or c or sql server or whatever it's perfect in that respect
in that respect isn't it but i think the more the more kind of organic um sql on hadoop engines that
are purely sql on hadoop you know kind of hive and impala and drill and so on it's a bit different
really isn't it and that's where i think certainly thereop, you know, kind of Hive and Impala and Drill and so on. It's a bit different, really, isn't it?
And that's where I think certainly there's been different kind of branches of innovation there, really, haven't there?
Yeah.
And in fact, I think when I look at the sort of newer approaches to performing,
let's just call it SQL and Hadoop, you look at something like Impala, for example,
or you look at Drill, or you look at the sort of parent of many of these, which was the Dremel project at Google, a paper which was written a number of years ago,
and that Google actually exposes to the world as a system called BigQuery.
One of the things I think is really fundamentally interesting is all of the available research suggests that at a certain scale, distributed query processing requires a shift from more normalized data models, sort of 3D and beyond, to something that actually includes nested fields and repeated fields. And so, you know, we see this with sort of JSON fields, we see this with nested and repeated fields within, you know, systems like Impala that can have child tables.
And one of the really interesting lessons that sort of emerged both from the research and then
also from the open source community is when you want to talk about doing really, really broadly
distributed SQL processing, you actually have to start thinking about what can we nest,
what can we repeat, simply because the processing will eventually become too challenging if you try
and broadcast all of those joins. Yes, yes. I mean, interesting. And so, I mean, one of the
things I noticed was that obviously each vendor, IBM, Microsoft, and so on, had a take on this.
And it struck me, obviously my background is in Oracle,
but it struck me that Big Data SQL was probably a fair bit ahead
of what IBM were doing and so on and so forth.
I mean, again, I'm conscious of things you can and can't say,
but how did you feel, I suppose, the different vendors,
the mega vendors, in terms of how they did things?
Were they all broadly the same solution? Or did the vendors take different approaches and any kind of preferences or ideas
on those at all? I mean, I think, you know, the commonality among the mega vendors that we see
is that it is more important to sort of continued integrity of business process. And then ultimately most important to the vendor
to maintain the processing logic,
to maintain the statement itself,
not necessarily the underlying storage
or some of the execution underneath it.
I like to think that at least with what we did at Oracle,
I think the ambition was understanding that there
are open APIs that are becoming real standards, right? I think you can look at the HDFS APIs and
say, you know what, these are largely solid. We see them supported not only across Hadoop
distributions, but also by cloud vendors.
For example, you can use HDFS APIs to talk to Amazon's S3 or to Google's cloud storage.
And if that is the case, then any reasonable extension of a mega vendor SQL system needs to respect and take proper advantage of those APIs.
And I think that gave Oracle an advantage
in so much as it allows that solution
to take good advantage of the innovation
that is happening in that community
as those APIs become standard
and as the underlying functionality in the open source community develops. I think the approaches
that Microsoft and IBM have taken are also very interesting. I think they are much more suited to
perhaps what the bulk of their customers wish to see,
which is simply I'm able to treat that as a reasonable source of byte storage
that is lower cost, and I'm not hugely concerned about integration
with a rapidly evolving field.
Yeah, interesting.
And so I suppose, again, respecting your position before, how much take-up in the market in general, I suppose, really, within traditional data warehouse customers, how much take-up do to see that there is you know there's understanding of this idea of doing what we do now kind of cheaper but something i
see less of is people using these sql and hadoop engines in in a kind of innovative way so you're
using it against say mongo using it um in its nested sources what was your feeling about that
the take up and the right and the degree of innovation you were seeing with these engines
by these customers?
To be honest, at least, you know, I think the two broad things I noticed and still tend to think,
and maybe let's extend it to three broad things.
I think the first thing that I noticed is that for its early days for all of this, right, and, you know, many, many organizations are still very much in a higher kicking phase,
which maybe doesn't lead to wild experimentation or innovation.
And I think organizations are trying to understand whether or not
they are going to consume things like SQL and Hadoop
strictly from their existing vendors of preference,
or whether or not they're going to break with tradition. There was a Gartner webcast maybe about a year ago in which
they sort of ran a survey with the folks on the phone and sort of asked like, well, where are you
likely to get your SQL on Hadoop? Like from an open-source project, from, you know, Cloudera or
Hortonworks or from your database vendor. And it was really, it was very split. It was very split
between, you know what, we're going to go with whoever our vendor is today, or we're going to
try one of these open source projects. I think there's a lot of sorting out to be done. I think
the other thing that maybe slows some of the adoption here is, I do think that SQL and Hadoop
is actually, as a movement, while it has led to sort of a
fascinating amount of sort of open source innovation, I think as a market, it is actually
strongly in competition with the broader shift to cloud.
In so much as if you talk about, well, I need to lower my infrastructure costs. I need to get access to querying more data. to some of the growing cloud databases that exist, cloud data warehouses that exist, or even the sort of managed Hadoop things
that you see from Elastic MapReduce,
Google Dataproc, things like that.
So I think there's an underlying race
in terms of where will the infrastructure settle
that is maybe undercutting
some of the adoption of SQL on Hadoop at large scale.
I agree.
And I think that certainly, I mean, there's, yes, so cloud is one thing.
And obviously, Oracle, for example, have, you know,
big data SQL in the cloud as an option there as well.
But if you're looking to store a lot of stuff cheaply at high,
a big scale and so on, then other vendors out there,
I mean, SnowflakeDB is an example out there and so on,
where, yeah, really, in fact, it's Hadoop or not Hadoop is really irrelevant.
It's just an abstracted kind of elastic store of data that then certainly i guess the other thing
really is when you're looking at say vendor sql and hadoop solutions compared to say open source
cost is is not insignificant really and and well we'll get on later on to talk about i suppose
how how well the big vendors you think will do in this kind of area but cost is an issue as well and
it's kind of in a way counterintuitive sometimes to kind of pay a lot of money for
these vendor solutions when normally people expect this to be kind of free for open source
it's true i mean i think one thing that that is that is a real a bright line between what you get
in the open source community and what you get from any vendor is that any sufficiently mature vendor iteration will likely have more
capabilities around security, around governance, around metadata management. And I think for larger
organizations, that's going to matter quite a lot. Again, business continuity, regulatory compliance,
these are real issues for a number of organizations. And it may be such that if you are a traditional
RDBMS vendor, and you've already taken care of everything around encryption and session
isolation, and you are HIPAA compliant, and you are SOX compliant, and so on and so forth, that
while you may not have the most novel extension to query processing you may you may
have such an entrenched advantage in compliance that you will naturally pick up some amount of
customer so so i suppose a tangent to this really i totally agree with what you're saying there i
think you know we sometimes forget how important this stuff is to real customers um one one issue i had i suppose when when oracle big data sequel first came out was
was almost a philosophical one which i kind of saw it as as there was say there was a good blog
post a while ago by a guy called jeff needham who talked about how how hadoop is not just a cheap
enterprise data warehouse and you know and and running sequel on hadoop sometimes is missing
the point and i get the point of there are certain tasks
that suit kind of set-based processing.
But I suppose, you know, in a way,
how much do you feel that in a way the kind of energy
and the kind of movement around SQL on Hadoop
is almost missing the point of what Hadoop is about?
I mean, do you think it's a valid point
or do you think actually that all things will converge on that in the end?
I mean, I suppose I do and I don't, which is probably not a great answer.
I think that it's easy to look at SQL and Hadoop or using big data systems for data warehousing as not taking full advantage of the advance in technology with respect to how we deal with scale.
And, you know, we can certainly look at some of the more interesting architectures that have come around,
things like lambda architectures, things like kappa architectures,
and say like, oh, there's so much more you could be doing.
However, ultimately, I suppose maybe the best way to think about this is I'm currently working on a blog post and I need to make some figures for it.
And I'm analyzing a bunch of data and I could make any number of sort of wonderful interest
in charts.
But ultimately, when I sort of went through to try and tell the most effective story with
the data, I discovered that most of what I wanted was bar charts.
Because bar charts got the message across.
And a lot of what we want to do with data comes down to data warehousing kinds of workloads,
you know, SQL kinds of workloads.
And maybe I'm just making bar charts.
Yeah, exactly.
Exactly.
So just before we get on to the next bit, I mean, a product I've seen in this area that
I've been surprised at how much I've been impressed by is Drill from the Apache sort
of project.
Any sort of thoughts on drill,
or I suppose some of the engines that are less traditional
in how they do things?
I suppose, again, a bit of context for this is
one of the things that I've noticed about using SQL on Hadoop
is a lot of it is just doing the same thing,
but in a cheaper or more scalable way
than you used to do with, say, Oracle.
You define columns, you define metadata.
Drill seems quite different.
Any thoughts on that at all?
Yeah, I've been watching
Drill with some interest over
the last several years. I think
Drill's original ambition was to
effectively be something like an open-source
Dremel.
I think it's moved to a really
interesting space in which
it is really
pushing the bounds of what we think of as SQL. It's
certainly sort of far deviant from what we think of as an ANSI 2011 SQL. And I think that is
interesting. And I think it is, the other thing I sort of compare it to is some of the things that
we've seen in Spark 2.0, where we're starting to see the sort of typed semi-declarative language
constructs around data sets. And to me, what it speaks to, and again, this is just me sort of
thinking out loud about it, is that it speaks to a real renewed interest in the power of declarative
language. And I think that's really, really compelling.
I don't know if drill will be the thing, but I think what we will see is that increasingly
we will see greater expressiveness and flexibility in declarative languages that are perhaps
SQL-like.
Yeah, I think it's a powerful concept, and I think it's a powerful concept that and and i think it's
interesting that people are remembering like oh yeah like there are many other you know types of
expressions that i would like to declare and still get the power of having something that optimizes
and executes on my behalf exactly exactly i mean for me you know sql is is one form of engineering
part of your data in hadoop really. And certainly having that there is useful.
I think the innovation that I'm seeing in things like Drill
and also some of the things we can do around, say,
Query Federation with, say, Spark SQL,
some of the stuff coming out,
some of the vendors around Data Fabric and so on as well.
I mean, it's kind of interesting sort of area.
I mean, and that kind of in a way leads on to probably
the next thing I want to talk to you about.
Now, Dan, obviously you were at Oracle,
and that's how I know you
and how most people probably listening to this know you but you you moved on to
Google now presumably because there was interesting things going on there um in general one of the
things that you you start to sort of notice about the whole Hadoop kind of uh and big data area is
that everything that we see now was invented 10 years ago at at kind of Google at Yahoo and so
on there you know what what are you seeing out there at the moment?
What are some of the sort of the trends and ideas that you're seeing happening?
It's probably kind of, you know, in those areas that we might hit on in the future, really.
Well, I mean, I think two things I would say.
One is, you know, yes, it's absolutely right to sort of look at sort of the history of Google papers
over the last, you know, decade or so and sort of say, hey, look at, of the history of Google papers over the last decade or so and sort of say,
hey, look at, you know, this is really the sort of origination point of a lot of the ideas that end up in the broader big data ecosystem. I think one of the things I'm noticing is
that the time lag between the sort of research publications we produce at Google and their emergence as entities
in the open source ecosystem is shortening,
which is really, really interesting.
And we're trying to play a bigger role in that as well.
And I think when I look at what we're publishing on
and what we're helping workbooks
to expose more broadly in the community,
there are two or three things that really stand out to me.
One, which I think probably doesn't need a lot of introduction,
is the machine learning work we've done around TensorFlow.
The amount of interest the community has had around TensorFlow
has been really, really tremendous.
And the fact that it's all being done in the open as an open source project, I think is going to...
So do you want to just explain what actually is TensorFlow?
I mean, I know, but for the audience,
what is TensorFlow?
And why is that so significant and interesting now?
So TensorFlow is, at a high level,
it is a framework for doing large scale deep learning using
Python and precompiled C code. It's exceptionally
flexible, exceptionally powerful, and it's
a tool set we use to solve a lot of problems at Google.
Now, by open sourcing it, we've brought to
the larger community not only the ability to sort of quickly define very powerful deep learning models, but at the same time also the infrastructure necessary to run those things at scale. in a distributed fashion on anyone's cloud or in your data center that allows you then to build
these models at scale and very rapidly, as well as to begin to introspect them and understand
where model performance is varying and how you might build better models. And that's,
I think if deep learning is going to become something that becomes part of an analyst or
data scientist sort of standard toolkit, it's projects like TensorFlow and the things that the community
is sort of building around it
that will really help push it,
really push it into the hands of more and more,
you know, analysts and data scientists
and even beyond.
Okay, okay.
So it's interesting you say about machine learning
and so on.
So, I mean, I've been,
I'm speaking at OUTAG CaseScope next week and I'm doing a session on using machine learning on so on. So I mean, I've been, I'm speaking at ODTUG Kscope next week
and I'm doing a session on using machine learning
on wearables data.
So I've been gathering all my data
on cover my Fitbit and from the bike
and from the house and all that kind of stuff
and then bring it into one place
and then applying, you know,
Python based machine learning on it.
But one of the things that keeps striking me
is when you don't know what you're doing
and when you don't get some of the kind of concepts around you know data having
to be a measure for every kind of like for every kind of row and so on and the different kind of
um i suppose uh algorithms and so on you're really one you're really lost and secondly you're in you
potentially into the kind of realms of being quite dangerous do you think machine learning will ever
be democratized do you think it will ever be something where anyone can do that or is it always going to be a scientific
thing really i mean going back to tensorflow and so on there is it going to become mass i suspect
it will but i suspect it will be consumed in different ways um i think you can i think you
can look at some of what we do at google around exposing machine learning to to end users as a
as a way in which in a way in which it might become consumerized in so much as you
can, you know, TensorFlow is open source, you can use it, you can build your own models.
But at the same time, if you say, I don't really have the time or sufficient understanding
how to build an image classification model, Google then offers up its own vision API by
which you don't have to worry about, you worry about how you construct your model, how the network should be formed.
You can simply say this is my training set.
These are the images I send or simply I have an image.
Tell me what's in it.
And so I think to some extent there will be the people who want to craft their own models and there will be people who simply want to say i i you know i have a i have a data that give me the bar chart right the the effect of the bar
chart for images or text or speech i and i think machine learning will become democratized in
different ways yeah okay um we'll come back to that in a moment actually um but what one i guess
you you joined google from oracle and we all all know Oracle is kind of fantastic and so on there.
But you mentioned early on about, in a way, kind of SQL, the question will become less about the engine and more about things like the cloud and so on.
I mean, do you see, I suppose, initiatives going on in Google and other places really where, you know, I suppose, in a way, will cloud become more of this?
And will big data
and machine learning be more in the cloud and how i suppose how do you see the areas that google
kind of work in as as big impacting on how this is going to the future particularly areas you're
working in i i i think the shift to cloud will actually become a more pronounced advantage for
organizations over the next several years and and and. And the reason I say this is just, you know, the experience I've had sort of looking at how, let's take SQL and Hadoop as a good example of this, at how the benefits of the technologies are truly enhanced by scale,
such that if you set up a little pseudo-distributed Hadoop cluster
and you run, say, you run Impala or you run Hive or something like that,
you'll get reasonable performance,
and then maybe you'll move up to a five-machine cluster,
and you'll get much more reasonable performance.
And maybe you move up to a full rack of servers, and you get much more reasonable performance. And maybe you move up to a full rack of servers
and you get much more reasonable performance.
But what we see economically with cloud deployments
is that you can have at your instantaneous disposal
thousands of cores,
tens of thousands of cores potentially
and tremendous sorts of throughput
of networks and disks.
And I think what will ultimately be a huge
advantage for consumers of data or consumers of data analysis is the ability to say,
because the density of infrastructure is increasingly concentrated in large cloud
providers, I can achieve the real benefits of economy of scale on my queries
because I am using vast resources for very short periods of time.
Interesting.
I mean, yeah.
So I don't know if you noticed, there was a couple of interesting blog posts that were
published recently that come to this sort of area.
So there was a post by a guy called Marco Arment who's an Apple blogger
and he posted this article kind of say that if anything brings down Apple or certainly leads to
the eclipse of Apple it will be its lack of investment in machine learning so the background
to that really I suppose is in the fact that Apple refuses for various reasons to you know in a way
kind of capture lots of your personal data and then work on it centrally it wants to use apps
and that's an area that Microsoft and Google have been investing in it a lot really and there was
also an article I think it was Stephen Levy yesterday posted about how how Google is now
becoming effectively a machine learning first company I mean do you think do you think I mean
the investment they're making are you seeing this is very strategic to them and do you think that
there's kind of I don't know if you saw those blog posts, but do you think there's kind of like, you know, a point there really?
Yeah, I saw, I saw Stephen Levy's blog post and I think, you know, certainly, certainly
the stance we take as an organization is machine learning is incredibly important to what Google
does and increasingly part of more and more of the products that we bring to market.
And I think two things to note.
One, we and Facebook and many other large organizations
have a tremendous amount of data
that we can leverage to user advantage.
And that could be from everything from query planning
to figuring out what's in an image
to recommending what restaurant you should eat at.
I think the real challenge, and a thing that is always, always top of mind at Google,
is that this must be done in a way in which no one's privacy is actually compromised. And I think this is a really interesting challenge that consumers of potential technologies should
consider, and also maybe more particularly data scientists and analysts who are exploring
machine learning, exploring processing at scale need to keep in mind, that, you know, at some point you, you, you are embodied with your users trust.
And it should mean that you, you may process every, every byte of data that you have very,
very efficiently and to, to great effect, but you should never be able to, to put it in a
situation in which it might be compromised. And, and ultimately when you look at places like Google,
you should not be able to see it. I agree. And I think going back to your original kind of, you know, your job,
I knew you from big data superlative Oracle. Certainly, my experience has been that a lot
of projects, a lot of big data projects I've worked on, when the customer gets to that point,
when they suddenly realize the amount of data that's under their custodianship and the
responsibility they have, that is where I found that oracle solution with the ability to kind of
apply security over it was important but i think generally perceptions are really important and
people at the moment there's i think it's a general there's a general kind of benign feeling
that if they get value out this data it's worth doing but but the opinion can shift quite
significantly and um certainly i think that people are be very mindful of security, of privacy, and the perception of that really as well.
So absolutely, I agree on that.
So one question before we get on to the last part was, why did Google and everyone publish all these white papers?
So if you look at the whole Hadoop kind of movement, it really is effectively,
certainly the open source movement has been re-implementing everything Google has been documenting and
so on. Why do they publish all this stuff and why do they kind of in a way lay out how
they do things in such and such detail?
Well, I think there are probably a number of motivations for this. I mean, one very
obvious motivation is that there are a number of people at Google who have very strong research backgrounds and are very interested in contributing to the scientific
literature because it's part of what's important to them. I think the other part actually can be
traced all the way back to Google's mission statement in terms of trying to organize the
world's information, make it useful and accessible to everyone. The work that we've done on systems like Dremel,
systems like Spanner, systems like Dataflow,
which I think is turning into a really exciting Apache project called Team.
I think we at Google view this as part of the world's information,
and we need to make it useful and accessible to everyone.
And while we can't necessarily give everyone a Dremel in their own data center,
we can make services like BigQuery available. But in the lag time, we can make that information
about how we see SQL at scale working or how we see data flow processing working at scale.
And we can make it accessible to the world by publishing research papers.
I agree. And certainly, I mean, I've worked, well, I've been at Google
before, and I've spoken to people there. And it's always struck me how kind of like altruistic some
of this stuff is. I mean, obviously, Google is Google, it's a company and so on there. But
certainly, I wouldn't kind of, I would not the fact that certainly this stuff had been published
and shared. And I know from my own experience that certainly, you know, I gain more out of
sharing things and the world gains more out of it really as well. So I can sort of see why really. So actually on a sort of tangential
point to this really, I suppose in a way carrying on. So Dan, you worked at Oracle for a while and
you've observed the kind of, I suppose, the big vendors operating in this kind of Hadoop space
here really. And I guess probably there must be a kind of, I suppose, a contradiction or tension in there really between wanting to kind of, I suppose, like yourself, want to build the best kind of implementation of a SQL on Hadoop engine or to get Oracle to work with Hadoop with the fact that, you know, the big vendors that have a commercial kind of model
that are now kind of working in this space that was all about in a way um doing things at cheaper
and at scale and from also that applies even things like consultancies so um you know is there
a market for for high-end consultancy in the hadoop market and and so on so i suppose the
question to you is you know how relevant do you think the old world um mega vendors are in the hadoop world do you think they're going to be do you think they've got a point to you is, you know, how relevant do you think the old world mega vendors are in the Hadoop world?
Do you think they're going to be do you think they've got a point to it?
Or do you think they'll be or do you think it'll be eclipsed over time, really?
So, I mean, I think, you know, the virtue of a mega vendor, right, when you look at Microsoft, you look at Oracle, you look at IBM,
is that there's a great diversification in the products and services that they can make available to their
customers uh i i think i i think to that extent there will always be some amount of relevance
that can be maintained i think i think the question is where are and certainly when i was
at oracle canonization of a business unit was something that you know thought about quite a lot
uh you know there were obviously you know entrenched threats from from the no sequel market
entrenched threats from from the sort of larger Hadoop market and the big data market.
And I think if I had, you know, candid advice I could give to any megathreader, it would be first and foremost, stop selling hardware.
Because the density of hardware concentration used for enterprise computing across the planet is consolidated.
I mean, if you look at data centers that Amazon is building, data centers that Microsoft is
building, data centers that even companies like Oracle are beginning to build, the notion
that we want to go out and sell a hard drive or sell a file server is it's becoming an increasingly difficult
economic argument to make in so much as capital expenditures are necessary for many businesses,
but some of these don't necessarily make sense. I think the other piece of it we talked a little
bit about earlier in so much as the greatest business value, I think,
for the mega vendors is in fact owning execution, owning the query. And in part, that also provides
the greatest business continuity for existing users. I would hope that that would be where
things shift. I think it probably varies based on what sort of revenue streams a given vendor sees from their hardware
lines versus their software products. And building data centers is hard. Building data centers is a
really, really hard task. And the amount of sort of investment not only to provide the facilities,
but then also to provide the people who understand how to maintain site reliability at scale is is is a real challenge that I think I think
some of the mega vendors are reacting to it very well I think Microsoft does a
very good job of this I think companies like Oracle are learning how to do this
yeah it's tricky it definitely I mean certainly for my experiences you know
I've been in sales
engagements with with oracle and so on in the past and certainly going in there and and the
first conversation you have with a customer is trying to sell them in their case you know a big
data appliance is an unusually kind of it's an unusual conversation to have because typically
you know the person you're speaking to does not want to talk about you know hardware and yeah
they want to talk about the vision and the idea around things and certainly my analogy at the time is like it's like going into a kind
of audio shop hi-fi audio shop and you want to hear how good the music is and you want to hear
what it's going to sound like to have you know to have this fantastic music playing but then the
actual kind of salesman is trying to sell you a very high-end walnut cabinet with monster cables
and so on there and and it's yeah there's point of that, but it's probably not what the customer wants to hear at that point.
And, you know, really the margin on hardware is minimal.
And so it was an unusual kind of, I suppose, angle to have.
And you must have experienced that quite a lot with Big Data SQL,
where at the start there was this dependency on it
and it having to be with the kind of big data appliance and with Exadata.
And obviously part of that, I'd imagine, I imagine probably can't say part of that probably is is is for technical
reasons because the infinity band but part of it probably is because that's the you know it supports
wider objectives really and and but you in the end you managed to get it to be or you managed to get
it to be kind of freed from those restrictions i mean was that quite a battle there or or that was
your last thing you did really wasn't it before you left it it was the it was the very last thing i did at oracle yes was was get us get to get that product to a point where
it could be available effectively to a much broader use of a group of customers i i think i think i
think it it in part represents a uh i think it in part and i think maybe the main motivator for me
and the main if there were battle lines drawn,
the main battle line was effectively that there is greater value in doing this for all users
than there is protecting a specific business area. Because long-term, I think technology
companies are, at least in the modern age, most successful when they put their users first.
It is interesting, though.
We think about the comments on hardware, but you'd asked about whether or not there's still
a market for high-end consultancies around data.
And I think the answer is perhaps maybe more than ever, because as we get further and further
away from having to
buy the nice, you know, having to first buy the nice walnut cabinet to hear what the music's like,
having, you know, qualified and talented individuals that can help organizations get
to value, get to the song they wanted to hear, is increasingly relevant.
That's interesting, yeah, because certainly my experience has been is is that um the people like that would go and join google or or kind of like you know facebook and
so on so i mean it's an interesting one i mean i think probably how consultancies and how serve
you know how integrators work in this market is interesting so as more stuff moves to kind of
things like machine learning as more stuff moves to hadoop there's going to be some kind of you
know low-end work and so on although the
cloud obviously will take that away but I think what what a consultancy is and
and how you operate in that kind of area and how you would how you would add
value to someone like Google or to Facebook it's kind of interesting and
whether it become less people but more skilled or whatever you know I don't
know on that but so it's interesting it's what encouraging you say that
really um so so down one other area on that is one thing I've always noticed is that every one of the mega vendors or the vendor solutions for SQL on Hadoop is, you know, it works, obviously, from their product to SQL, say, to Hadoop.
So your, you know, big data SQL was Oracle to that.
Microsoft One is like that.
Do you think there's a market or do you think there's a need for solutions that kind of in a way link together different proprietary database engines through to Hadoop?
A more kind of like fabric style thing or is that a problem that doesn't really need to be solved?
I mean, did you think about that at all when you were at Oracle?
I think the federation piece is extremely, I think when we think about the future of query processing, there are two ways to think about it.
You can either take the stance that all data will be consolidated in one kind of system, or I think the more rational view to take is that data will be increasingly federated across many different kinds of storage and processing systems and will occasionally need to be processed in concert. And for that reason, I think federation and federation sort of beyond the language level
is increasingly important.
It's, you know, it is something we aim to solve at Oracle in terms of being able to
use a single SQL dialect to query beyond sort of the Oracle database and HDFS, but reach out to
SQL databases. It's a problem that is relatively well solved at Google. We have many, many internal
systems that collect data. We can effectively federate across all of these for our own purposes.
And I think it's important to understand that ultimately, we'll talk about finding value from data.
What we want to be able to do is interrogate the data wherever it may exist with a single construct that best suits our workflow.
And so if SQL is the right workflow for me, excellent.
If a declarative Scala API is better for you, so be it.
But I need to get to all the data in the way that best enables me.
Exactly, exactly.
And I think certainly from my perspective,
something I've been saying for a while is that
I think that all analytic workloads in the end
will move to this kind of platform.
I think that in time, certainly on the bulk of it anyway,
although there'll be this interchange, as you kind of said there,
but this ability to actually kind of put it in one place
and then apply different engines, different languages,
and so on to it is important as well. But then i suppose in a way going beyond the basics of
that and i noticed there's been some startups recently um uh i think the guy at vertica sort
of did one where where i suppose in a way looking at say automagic sort of uh i suppose discovery of
the kind of the meaning of data in there and schema and that sort of thing and adding adding
smarts to it i mean i think certainly at the moment we're plumbing it all together but going beyond that you know i suppose anyway what
do you if you if you were to sort of like to look forward to i don't know five ten years and you saw
the kind of analytic platform of the future really you know running on probably the kind of descent
of this technology yeah what do you think it would be what would you you know what would you be aiming
for if you were doing this really the kind next-gen analytic and integration platform really?
I think ultimately metadata and sort of
catalog management will become almost a separate entity such that you may have a
service that is a that is we can actually look at you know so the hive
meta stores an early version of this right in which we see one catalog which
can contain information about data stored in many different places it may
contain it may contain metadata information about data stored in many different places. It may contain metadata information about data stored in HBase or data stored in HDFS
or data stored in another NoSQL database.
Even now, I think through some of the various APIs,
you can actually store data about other RDBMSs in there.
I think we'll start to begin to see that
as being a much more distinct
and separate piece of the process.
The other thing, and I think it's, I think it's, I think it's actually, I think it's incredibly important. And I think it's why I'm so excited
about Apache Beam project is, I think going to be increasingly as streaming workloads become more
interesting to organizations. I think we will move to a situation in which we stop talking about the difference between batch processing and stream processing and simply say data exists as a flow.
And you can either choose to process it in a fixed batch.
You can choose to process it as a window.
You can process it in any particular way. And I think that sounds a little wild and outlandish at first until you sort of
think about transaction logs or redo logs that we would see in Oracle. In so much as effectively,
if you had an infinite redo log, you would be able to slice and dice that as needed to either say,
here is a batch of data that I want to process or process the next thing that comes in or process the next five-minute window.
And I think we're going to increasingly see systems built around those concepts.
I think a lot of the work that's going on in the Kafka space is really beginning to push this way. The guys at Confluent, I think, are doing an interesting job evangelizing some of these concepts.
Again, the work that Google's doing with the Beam community is very much along the same lines. I think we're going to see that as a fundamental shift in the sort of underlying treatment of data sources.
That's interesting.
And I mean, I suppose one of the things that I've always been kind of saying is that, you know, whilst Hadoop and this kind of world is going to eat into analytic workloads, it sounds very much like it could almost kind of start to eat into what we consider to be normal transaction processing now it sounds like you know you're
saying there that yeah that's kind of interesting and yeah i mean that that sounds a much more kind
of i suppose bigger goal really doesn't it than just doing data warehousing better yeah yeah i
mean i think doing data warehousing better is the beginning of it right because i i think there are
still many of the promises of of the original data warehousing movement many years ago that organizations are still working to realize.
But I think at some point we will end up with enough data and enough desire to look at it in different ways that we begin to change the fundamental model of, well, it's not so much a table as it is a table-shaped stream.
Interesting.
Interesting.
Yeah, definitely.
Well, I'm going to
obviously approach Gwen
at some time
and see if she wants
to come on the show as well.
It'd be interesting to see,
I think certainly from her
kind of this Gwen Shapiro,
probably both of us know,
working at Confluent.
And certainly there's a lot of,
there's a lot of parallels really
between some of the stuff
going on there
and what you're doing there
and general kind of processing
of data and so on really.
So, I mean, Dan,
that's been fantastic and I've been, thank you so on really so um i mean dan that's been
fantastic and i've been i've been thank you very much for your time on this it's been really
interesting to catch up with you i guess it's probably sunny over there is it where you are
it's it's pouring with rain over here it's middle of summer and it's yeah and it's
yes you've got you've got english summer i've got california summer which is
just what you would think 70 70 degrees fahrenheit and. I know. So you're at Google. It's sunny. I mean, you should be over here. It's raining and
so on, really. But Dan, that's fantastic. And it's been really good to speak to you.
Thanks very much for your insights there. It's very interesting what you're doing. And
yeah, great to speak to you. So thank you very much. And thank everyone for listening. And
yeah, thanks a lot. Brilliant. Thank you.
All right, Mark. Thanks very much. It's been a pleasure.
Cheers. thanks. you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you Thank you.