Drill to Detail - Drill to Detail Ep.3 'Apache Kudu and Cloudera's Analytic Platform' with Special Guest Mike Percy
Episode Date: October 4, 2016Mark Rittman is joined by Cloudera's Mike Percy, to talk about Apache Kudu, Analytics on Hadoop, and Cloudera's work in this area...
Transcript
Discussion (0)
Hello and welcome to Drill to Detail, the podcast series about the world of big data,
business intelligence and data warehousing, and the people who are out there leading the
industry. I'm your host, Mark Rittman, and Drill to Detail goes out twice a month with
each hour-long episode featuring a special guest, either from one of the vendors whose products we
talk about all the time, or someone who's out there implementing projects for customers or
helping them understand how they work and how they all fit together. You can subscribe for free at
the iTunes Store podcast directory, and for show notes and details of past episodes, visit the
Drill to Detail website at www.drilltodetail.com,
where you'll also find links to previous episodes and the odd link to something newsworthy that we'll probably end up discussing in an upcoming show.
So in this episode, we've got Mike Percy from Cloudera.
And I'm particularly pleased to have Mike on the show because Mike's actually a software engineer that works on the Kudu project there
and as probably some of you might have heard Kudu is a new technology or a new project out of
sponsored by Cloudera I think but it's now been sort of donated open source and so on
that is going to be in my into my mind one of the kind of key analytic kind of platform pieces.
Mike do you want to introduce yourself first of all and just kind of just tell us what you do
at Cloudera and how you got involved in this?
Sure. Thanks a lot, Mark. Thanks for having me on.
So, yeah, my name is Mike. I am a software engineer at Cloudera, as you said.
I'm also a PMC member, which is Project Management Committee member and committer on the Apache Kudu project.
Apache Kudu is a columnar open source distributed data store.
And so we can go into that a little bit more later.
However, in terms of my own background,
prior to Cloudera, I was at Yahoo for several years
working on machine learning and sort of building a machine learning distributed system using the Hadoop stack.
And I've been at Cloudera for a little over four years.
Fantastic.
So, Mike, you said you've been involved in this project called Kudu.
And I came across this, I think it was probably late last year it was mentioned so there was a few kind of press releases and news articles um from cloudera and so on about
a couple of new technologies that they'd kind of uh launched around kind of analytics one of them
was called record service which is about security but the one i was really interested in was this
thing called kudu and you know looking at the i suppose looking at the kind of the the kind of
the news articles and the write-up at the time, it was positioned as this kind of like, in a way, best of both worlds
of kind of fast analytics and fast loading and so on.
I suppose, in a way, to help anybody who's not really heard of Kudu
or understood what it's about, what problem does Kudu solve?
And why did Cloudera and you guys get involved in this, really?
Why was it really kind of done?
Yeah, sure. So the previous options for storage in the Hadoop ecosystem were HDFS and HBase.
And HDFS is based on something called GFS, the Google File System. And HBase is based on,
from a design perspective, on something called BigTable, which is a technology that was invented at Google.
Both of them were invented in the early 2000s and papers were published about them.
So HDFS is a file system and it's a special kind of file system.
It's distributed. And the way it's architected and designed, it's really made for
writing very, very large files in a batch, sort of a batch scenario. So say you would ingest a large
amount of data every hour, or maybe every 10 minutes, that would go into its own HDFS file.
And you wouldn't really want to try to access that before you were done loading it.
So there's some latency built in there, as you can see.
However, it's really efficient at scanning large amounts of data.
So it's really efficient to have a bunch of jobs read data from HDFS and sort of scan through all of it.
And so this is good for stuff like machine learning where you're building models from a lot of data.
And so the opposite sort of end of the spectrum
in terms of throughput and latency
is something called HBase.
And so what HBase does is instead of being a file system,
it looks like a, it's a basically sort of a NoSQL store.
It's basically a table abstraction
on top of HDFS. And it allows for mutating individual rows. You can seek to a single row,
insert, update, and delete, all that stuff. But it's actually not really designed to be
as efficient as HDFS at scanning lots of data at sort of throughput in general
on the read side. So there was this gap, essentially, where, well, what if you want
to be able to have random access to something that, you know, feels like a database, so something
more like HBase, but you really want very fast
scans. You really want to be able to churn through lots and lots of data in a parallel,
high throughput manner. There was really nothing there that could really fit this particular use
case, which is a huge use case. And so that's why we built Kudu to fill that gap and to sort of
meet those to sort of do a very good job at those things.
Okay. So yeah, I mean, certainly my experience on using, I suppose, Cloudera, you know,
the Cloudera stack on projects was that certainly if we were going to be loading,
bulk loading data into kind of, you know into kind of a data lake or something,
putting it into HDFS was kind of fine.
But when we wanted to do these kind of incremental, especially things like updates and deletes and so on to data,
what we tended to do was to use HBase and then put a Hive table over the top of it, which worked okay.
But then, first of all, it was fairly cumbersome, you know, having to use kind of, which works okay but then first of all it was fairly cumbersome you know having to use kind of uh at that point i don't think the uh the hive on h base kind of you know jars whatever
actually shipped at the time with cdh but but also the problem we found was that like you say
if you try and do aggregations on those hive on h base tables they're quite slow and so on
and i think also other things we found were uh we use, say, things like Parquet, which had kind of column store, it had its own limitations as well.
And so when I saw what Kudu was trying to do, I thought that was kind of interesting, really.
And so I guess looking at things, there are already things out there like, say, Parquet, that column store.
Again, what does Kudu kind of bring to the party there, really?
I mean, how does it improve on that and do things better this this goes back to sort of the initial conception of uh what kudu uh would
be um and uh so you know really kudu was um first conceived when parquet was first conceived
around the same time uh and really a lot of the same people were involved. So Todd Lipkin is a software engineer at Cloudera.
And Kudu is his original idea, really.
And so he essentially was helping design Parquet as well
with some folks from Twitter,
as well as Nong Lee, another software engineer at Cloudera at the time.
And so essentially, as they were designing Parquet, they're like, OK, this is going to be great.
We're going to have extremely fast throughput in a schema-oriented file, and this is going to work great with HDFS.
But the next thing that people are going to say is, well, you know, what if I want to update one of those records? Because people are used to analytical databases that, you know, do this for
you, right? You can have really fast scans because it's column oriented. It's really good for
analytics. However, you know, you can also go and insert, update, and delete. Well, you can't do
that in a parquet file. Not really. If you want to mutate one row in a Parquet file, you have to rewrite the whole file,
which is obviously very inefficient if you have, you know, like multi, you know, hundreds of
megabytes or even gigabytes in a Parquet file, then it's going to take forever. So really,
that was the idea behind Kudu that sort of set up the spark as, hey, you know,
people are going to want updatable parquet.
Let's figure out how we can build an updatable parquet.
And that was the beginning of Kudu.
So another problem I found with parquet was I did a project where we were streaming stuff
into, you know, in real time into Hadoop.
I was loading into parquet and I was finding that I had a problem there
where because of the compression it uses and so on,
Parquet didn't really suit very well to streaming data and so on there.
I mean, how do you deal, I suppose, with the fact that data is streaming
and you've got to perhaps compact it and so on?
How does that kind of work?
So there are multiple layers of memory and background tasks inside Kudu.
And so really the main two parts are,
well, there are four parts.
So I'll start with the, for initial inserts, there's something called the memro set, which is essentially where everything goes that's being inserted into a particular shard of Kudu is a distributed system.
And when you, um, create a table in Kudu, so, you know, really Kudu feels like, um,
if you use it with Impala, um, for example, then it really feels like my SQL circa 2001.
So if, if, you know, so for those of us who, of us who remember back that far,
there was MyISAM storage engine for MySQL,
and that was really the standard.
And with that, you can do all kinds of SQL stuff.
It was actually really fast,
but it didn't support transactions and triggers and stuff like that.
And actually, Kudu really feels like that.
So it really feels a lot like you're using MySQL circa 2001.
But instead of being a row-oriented database, it's a columnar database.
And instead of being on one node, it can be sharded across.
You can have a Kudu cluster that's a thousand nodes.
And so the tables that you create are distributed across all of these nodes. So when you say create table with some schema and you specify your partitioning in your create table statement,
then that eventually creates things we call tablets.
And those tablets are essentially one partition is one tablet.
Certainly my experience has been that,
I guess when I first heard about Kudu,
I was thinking, is it another SQL engine?
Is it a kind of storage format and so on?
And I think kind of the thing that worked for me
was going through some of the,
I suppose the initial tutorials and seeing that effectively it's like a store it's like a
storage um storage engine isn't it for hive or something i know it's obviously impala um and
and i guess the bits with the tablets there uh it struck me as that's where part of the hbase
heritage or certainly it reminded me of kind of like how hbase worked as well um so so one of the
things again that is worth maybe talking about is
what's the relationship between, say, Kudu and Impala?
So they do seem to be kind of
fairly closely linked initially.
Is that always going to be
the close link between the two?
I mean, first of all,
what's the link between Kudu and Impala?
And do you intend that to extend to other areas
as well, other kind of tools as well?
Sure.
Yeah, that's, you know, that's something that people initially sort of scratched their heads at.
You know, so like, what is, what is this thing?
And so I guess I'll, I could make two comparisons.
Continuing with my MySQL analogy,
MyISAM is a storage engine for MySQL.
I think these days people use InnoDB,
which supports transactions but is maybe slower.
So Kudu is comparable to a storage engine like that, like MyISAM.
However, the way that we've designed these APIs,
any system can use it.
So it's not specific to Impala. It's not,
like, my ASM is very specific to my SQL. It's actually, Kudu actually exposes APIs
and Java clients. So there's a Java client, there's a C++ client, and there is a Python client. And these are like client libraries that you could use to talk to this data store.
And that's actually what systems that integrate with Kudu use if they want to implement SQL.
Or if you just want to insert a record without going through SQL, you can write an application to do that as well. So it exposes APIs that look like insert, update, delete,
create table, but these are programmatic APIs.
And so systems today that integrate with Kudu,
SQL systems include Impala, Spark.
So Spark SQL can talk to Kudu.
Drill, Apache Drill is a system that can scan and load data into Kudu.
And so essentially what Kudu does is it provides basic APIs.
It also provides sort of advanced APIs so that these SQL engines that were sort of traditionally, you know, monolithically built into databases like MySQL or like Oracle,
are now, you know, in the Hadoop ecosystem, we're splitting them out.
We're sort of, I don't want to use the word shard, we're tiering them, we're layering them out
so that you can have at this base layer, Kudu, which is your storage engine.
And then the next layer up, you can have two things.
You can have Spark and Impala. And maybe if you're building a machine learning system,
or if you want to do like programmatic analytics, or you're doing something that requires some
tricky kinds of a combination of joins and something else, and maybe business logic that you want to execute in parallel,
then you could write a Spark job that talks to Kudu.
And then for stuff for your reporting
or for your sort of ad hoc analysis,
you could use something like Impala
that gives you a nice SQL shell
to just run SQL on Kudu.
And they both work.
Okay, okay. I mean, it's certainly for me that the biggest kind of revelation SQL shell to just run SQL on Kudu. And they both work. Okay.
Okay.
I mean, it's certainly for me that the biggest kind of revelation was seeing that I could
do an insert statement in Parler now.
So, you know, when you come from...
And update.
Yeah, I know.
When you come from the kind of the data warehouse world that I do, you know, you're so used
to being able to do single row inserts and you can't do those in Hive and so on.
And the other thing I was talking to you earlier before we started the recording that um the other thing i've been using
is uh is stream sets as well which is the kind of etl tool type thing that comes that you can install
now as a as a kind of service into into cloud era as well and in cloud era you know hadoop and that
has a kind of i was really pleased to see straight away that that had you know inputs as kind of
kafka and it can load into kind of uh into into
kind of uh into this as well so so so one question for you i suppose really i suppose maybe devil's
advocate question um why why did cloudera choose to do this rather than say just work more on say
sort of hdfs you know why why go and create another kind of uh another storage technology
than just maybe extend kind of hdfs or you know what was the thinking behind that
well the it's um it's sort of an impedance mismatch i think you know HDFS's original goals
um were to be really good at batch and um to really efficiently and economically store bytes on commodity hardware in a way that wasn't really designed to be modified.
In fact, as I'm sure you're aware, HDFS is an append-only file system.
You can't even change bytes in the middle of a file. While HBase is able to build on top of that and essentially get a store that is mutable,
it was really a design decision to not modify HBase because we felt that HBase is actually really good at what it does. It's really fast for random inserts and updates and deletes and also random seeks.
It's also extremely efficient at loading.
And it's got a lot of things going for it for the use cases that are really ideal for it. So we didn't want to basically come in and say,
well, in order to implement efficient encodings
on a column store,
we're going to now impose schemas on top of HBase.
And in order to get really efficient columnar scans,
we're going to break everything out
so that every column is a column family.
There are a lot of things
that we would have had to break in HBase.
And so Kudu is really an attempt at saying, you know what, let's just go back to first principles.
Let's figure out the problem that we're trying to solve.
And then rather than try to force some other system to sort of conform to what our goals are, let's just go straight for these goals.
And I think that that was the right choice do you see do you see impala and do you see impala being something
will be commonly used to to load data into into kudu or do you see it mainly being programmatic
and the insert statements and things and so on in impala are more of a kind of side issue i mean
i think i noticed i was trying to get today set up the JDBC drivers.
I don't think the JDBC drivers for Impala yet support inserts, do they?
They're kind of read-only.
What's the vision in terms of loading?
Is it generally going to be programmatic loading? Or can you imagine an ETL tool, for example, a non-Hadoop one,
using insert statements through Impala?
Is that the vision for it really,
or is it more through programmatic? So I think in terms of Kudu integration with Impala,
there's still work to be done. And so Kudu's inserts work really well. However, the, you know, there's, there's quite a bit of roadmap
still, like maybe a few months of roadmap for Impala to do, to get better integrated with Kudu.
So that's how I would sort of, that's how I would put it. It's, there are definitely holes in,
because Impala's support of insert and update and delete are actually
very new for Impala.
So I think in a few months, that'll
get better. So it's not intended to
just be a user tool. It's intended to be
programmatic as well.
That said,
the
inserts
and sort of loading directly
into Kudu,
if you really just want to get the fastest possible inserts,
then probably going through the NoSQL APIs like through Spark or something today
is probably the fastest option.
Although if you have your data on HDFS already,
then something that you can do with Impala on the shell
is you can execute something like create table from select star
as it's like, I forget the exact syntax,
but it's like you can create table from insert,
you can create table from select star from some other table.
So if you have some data like CSV files on HDFS,
then in a one-liner, Impala will set up a job
that will very efficiently load data into Kudu
from that other data source.
So you can easily get started and try out,
try loading data into Kudu
and then try it out on your actual.
Yeah, you mentioned about the Impala side and clearly the Impala kind of team and project is separate to Kudu and they've worked with different kind of like cadence and so on.
But what's the kind of what's the long term?
I mean, I guess in a way, take it taking an even further step back.
Why did why did why did Cloudera do this and why did you do this?
And what's the longterm vision for this really? I think the long-term vision from my perspective and from the project's perspective
is really to be the best possible analytics system out there.
So Kudu's goals are sort of two-pronged, really. We certainly want Impala to continue sort of evolving its support for Kudu, and it will.
And that's definitely on their roadmap. the first thing that you would think of when you want to do analytics
in a big data distributed environment.
And so what that means for us
is integrating with all of the systems.
So the beauty of the Hadoop ecosystem
is that kind of everything works with everything.
And so you can use Hive, you can use Impala,
you can use Spark, they can all talk to HDFS. And so different can use Hive, you can use Impala, you can use Spark, they can all talk to
HDFS. And so different workloads that have different requirements can use these different
systems, but they all get to use the same data. It's like the whole data lake or the data hub
idea. It's based on lots of tools that can talk to the same data store. So that is what we are, that is what we're going for.
And so that's why we're integrating with all these different systems.
I think one, you know, missing piece is Hive.
And I know we want to get that done.
I'm not sure when Hive is going to integrate with Kudu,
but I'm sure it'll happen, at least in the medium term.
As far as Kudu's long-term vision, we are going to implement a bunch of features that are coming up. So 1.0 is right now.
And so Kudu is now prime time, ready for running in production. And actually, there are already production users of Kudu.
Large deployments like 200 nodes,
clusters of Kudu running in production,
and people are happy with it.
Next up is, looks like security is really,
you know, something that a lot of people want.
And so right now, Kudop doesn't do authentication and authorization.
What that means is that like Hadoop was a couple of years ago,
you should really like firewall it off.
But in the next few months, we'll start adding security features
and soon you'll be able to have your users authenticated and whatnot
through Sentry, for example. And then ultimately, we're also going to add multi-row transactions
to Kudu and we'll add that support, integrate it with all the query engines like Impala and Spark.
And also we plan to support multi-data center
replication and operation.
Is the vision for Kudu squarely within analytics or do you see it
potentially supporting OTP type workloads?
You know, it's
easy to imagine that we could add row-oriented storage into the Kudu backend, and it's not really that hard.
But it's actually a pretty tall order to go and implement an OLTP database, so that's really not what we're focusing on right now.
I mean, in the future, I suppose,
once we feel like we've nailed analytics,
then I think it might make sense to take a look
and see where the low-hanging fruit is
for a sort of row-oriented access
and if we could support some basic OLTP workloads.
But as I'm sure you're aware,
there's a lot of stuff that goes into these databases,
like query planning and all kinds of optimizations
that might make sense one way for analytics
and make sense another way for OLTP workloads.
And so it wouldn't only depend on Kudu.
It would also depend on these query engines
that have their optimizers
to also support efficient OLTP access.
And I don't know if anybody's working on that right now.
So if no one's working on it,
then I think we're a ways away from it.
I guess there's two things.
It's can you do it and should you do it, really, isn't there?
I mean, why would you do transactions
in a column store database?
I mean, it's kind of crazy.
So I guess another question really is
where does this relate to Spark?
So the way I think about integration with Kudu and Spark is that, well, number one, it works.
But if you haven't used Spark a lot, then I would say for people who want to get an idea of where it would make sense, Spark provides a programmatic API to do all this data processing stuff.
And so for people who have written MapReduce jobs,
it's like a lot of work to really,
you have to create this mapper and this reducer.
And so Spark basically comes in
and really simplifies that MapReduce API
and really expands it.
And so it makes it really much more expressive
and lets you do grouping and really have a lot of flexibility
in sort of the order in which things occur
and how the parallelism works out.
And plus it adds some nice performance features related to caching.
So I think that the way I would tend to architect these kinds of systems is you have Kudu and
potentially HDFS and also potentially HBase as data sources, depending on where you want
your data to live and sort of what legacy jobs you have. And then for running your reports, I think mostly people want to use SQL to, you know,
sort of generate reports.
Plus, you know, tools like Impala work really well for integration with BI tools like Tableau,
for example. So where Spark comes in is where you want to do custom programming at scale
against some data store.
And so that's where it really shines.
And so, like, I keep coming back to machine learning,
but I think other great examples are, like, fraud detection
or, like, sessionization in like websites and figuring out click-through
rates on ads. So all these things, you can do some of it with SQL. And if you're a SQL wizard,
maybe you can do most of it with SQL. But some of it's sort of tricky enough that it might be more better or more easily expressed,
more simply expressed as a program.
And so that's where picking up Spark is really nice.
And I think Spark SQL as well, essentially just feels like you can add your SQL statement
inside your Spark program, and maybe part of what you want to do is really better expressed
in this sort of Spark language or Spark syntax,
but some of what you want to express might be easier to do as a SQL statement.
And so Spark makes it easy to sort of mix those things together
and use them both in the same program.
So obviously, Cloudea are quite big backers of Kudu. Is it being picked up by other vendors? mix those things together and uh and use them both in the same program so obviously cloud era
are quite big backup backers of kudu is it being is it being picked up by other vendors i mean it
is i noticed that it's been donated to apache as well i mean i suppose the question really is how
much of this do you think is how much of kudu is going to get picked up and used universally like
say spark as an example or is it going to be... benefit from having differentiation. And so they don't want to all be like, you know,
different colors of the same flavor. They want to have their own sort of things that they focus on.
So, you know, Cloudera, I know, really hopes that other vendors will pick this up. And,
you know, and that's why we from the beginning planned on making this an open source project,
because, you know, and we did, and then
we donated it to the Apache Software Foundation. And then it was, and we've since graduated from
the incubator that Apache has signed off on our development practices. We have open development
practices. We accept patches from all over the place. And we really welcome sort of other vendors, Hadoop vendors and non-Hadoop vendors
and really individuals to contribute.
So, you know, I think that's the best answer I can give you is we really hope they will.
I think that there's some resistance right now.
But personally, I feel that like ultimately they won't be able to say no because there's really no other good alternative,
in my opinion, for Kudu unless you go proprietary.
Yeah, exactly.
I mean, certainly if you look at Hortonworks,
look at MapR and so on,
everyone's got their own kind of favorite SQL engine
and so on there.
And I've seen that obviously Impala is ostensensibly open source and i'm sure it is but people don't i suppose it's
not being picked up by other vendors because they want to have some differentiation there
but other technologies that have come out of different vendors have been have been adopted
because as you say if you take say yarn from say sort of holton works or take sort of what you're
doing here with kudu you know it fills a gap that is not filled elsewhere really um and and and so therefore you know that's kind of interesting i mean it's so so yeah that that so i mean i suppose
um so you said earlier on that it's going to actually by the time this goes out it'll be
version one so does it mean it'll be supported it's supported it's a supported product from
cloud era at that point really so clutter lags uh the the upstream or we call upstream the apache releases a little bit by uh
usually by like a month or a few months and so i think that that's what's going to happen here as
well is um it like so cloud era will essentially semi-support uh 1.0 but uh sort of the officially
supported version is going to be just a few more months.
So how would somebody get started on Kudu?
I mean, when I first looked at it, there was a developer VM you could download and so on.
I mean, if somebody was listening to this and thinking, how do I get started with it?
Where would they start? What would they do and where would they go to?
So the best place to go to get started is the Kudu website.
It's easy to remember.
It's Apache Kudu.
And so that makes it the website kudu.apache.org.
And so that's, I think, the first place I would go to.
Just click on community.
And we have a Slack channel, which is just like a chat room.
And it's a public sort of auto invite thing.
So you just click on the link, it'll invite you, and then you can join the Slack channel.
And you can ask questions.
We also have a user list, a user mailing list.
And there's documentation on the website, sort of getting started with Kudu, how to install it. There's also a VM,
if you want to just spin up a virtual machine and sort of try it out already installed. That's
another way to kind of just give it a whirl. So just as a last kind of thing, really, I was
interested, I mean, looking, I look back at some of the stuff you've done on, you know, in terms
of presentations and YouTube and so on. And I can see you've been involved in analytics as a topic
for a while, really. And I'm just interested to get from your perspective, where do you see kind
of analytics on Hadoop going and in the next few years? What do you think the kind of where would
you see the technology and the opportunities going really for say, analytics, and I guess
data science really going forward? Yeah, I think that's a really good question.
I think it's really broadening in terms of the use cases
that can be used on, that can be executed on Hadoop.
So today people are already doing a lot of like large scale BI
and sort of analytics and data science.
These terms are starting to get really muddled
together because I think people are starting to essentially have multiple teams using sort of
the same data sets and there are more and more people that are really getting into the advanced
analytics side and data science. And so people are using Hadoop for all kinds of stuff. It's
really incredible. For example, implementing, doing fraud detection, credit card companies,
I think I mentioned that one, do it in banks, doing like cheat detection. Like they're like
these, you know, these big online games that, you know, lots of
companies have these massively multiplayer online games and sort of these like arena based games.
And those guys are doing all kinds of number crunching using Hadoop because they have so
much data coming in that, you know, they have to do it in parallel. And so they're using things like Spark and Impala to do it.
And so more and more people are essentially creating new use cases that can run on Hadoop.
And I think what we're trying to do with Kudu and at Cloudera is really simplify the whole process.
So there's something called the Lambda architecture that people were talking about a couple years ago and has sort of started to become mainstream now.
Where essentially you do this combination of sort of some streaming analytics and then periodically you do a big batch job that then sort of like
corrects any errors in your streaming analytics and sort of you do this union merge at the end.
And this kind of an approach is, while it works, it's like super complicated and hard to maintain
and, you know,
really hard to get your head around in the first place. And so the more that we can say, you know
what, you only have to use one data store that has really efficient streaming inserts and really
efficient scans, you know, then all of a sudden you don't, you have to worry less about this
Lambda architecture. It doesn't necessarily always fully go away
if you really need up to the second
sort of running counters or something like you might do.
But the faster and the more flexible
we can make the backend storage,
the better it is for the user, right?
Yeah, yeah, definitely.
I mean, certainly from my perspective,
I think it was Michael Stonebraker a while ago
kind of said that in his view,
I think all analytic workloads will move to Hadoop.
And I think that's completely true.
I think that anybody who's doing anything
that is kind of doing classic data warehouse
or analytic, or in this case, in the new world, I suppose,
of kind of streaming jobs and fall detection and so on.
You would do that on Hadoop,
or really whatever the success is
of the various parts of Hadoop is in time.
And for me, that ability to land everything,
the fact that you can land everything
in one place economically now,
the fact that you could do it in the past in theory,
but you can never afford to do it,
and you couldn't really pay the cost of all the kind of the the
schemer on right stuff where you had to work it all out in advance and so on there i think people
the fact you can land it all now in a fairly sort of vague form and apply different engines to it
and so on that that has kind of won the argument really um and and certainly things like kudu are
filling in those gaps really things that from the, from the data warehouse world we've had in the past, you know, inserts and deletes and so on,
and the ability to land stuff in streaming form and query at the same time, you know, it's filling in those things there.
But certainly, and that's working kind of well.
I think an area that is ripe for innovation is things like semantic discovery
and I suppose in a way
data governance as well.
And certainly from my side,
Hadoop has had,
I say Hadoop in a general sense here,
has had a bit of an easy ride so far
in terms of data governance
and all that kind of stuff.
I mean, I know it's not your area really,
but do you see that as well?
Do you see that probably
the next kind of big sort of thing
is getting, I suppose, data quality and data governance and so on? Or is, but maybe for different reasons than it used to be.
So I think one of the reasons that,
that data governance is still important has a lot to do with like regulations
around like financial industry,
the financial industry,
or maybe like HIPAA,
you know,
privacy in the United States,
it's,
we have regulations around healthcare records and stuff.
And so I think data governance in terms of,
like, did we retain this record
or can we make sure that privacy was maintained on this record
is something that will never go away.
Yeah, it's interesting, isn't it?
I mean mean certainly clients
i've had i've had in the past um i've had the issue where you know they they buy in entirely
to the whole kind of data lake idea and the landing stuff in there and there's customer data
there in there as well and then they kind of think about the fact well what what did we get kind of
permission to do from the customer and and so you've got that whole thing of and that's not
that's not hadoop's fault that's not kind of you know it's now oh now we're suddenly now we can use this data a very kind of granular you know we're
doing things like micro sort of segmentation so we're now going to give you an offer based actually
on your transactions not on not into which segment you kind of you fit into and and there's practical
there's practical issues there around well how do you go and redact information out there as well
but but certainly i think an area where i think the technology we're using now could help is if you think about say semantic
discovery you've got all this data landing into kind of a data lake at some point you're gonna
have to apply schemas to it really um and and it's interesting to see tools like drill for example
that can that can read the schema in there and that parquet and so on but certainly again there's
products i'm seeing coming out there tamir for example and and so on. But certainly, again, there's products I'm seeing coming out there, Tamiya, for example, and so on,
that how do you make sense of the data in there, really?
And how do you do the kind of provenance and so on there?
I mean, these are, in a way, they're grown-up problems.
They're problems of success, in a way.
And we need to be careful we don't get bound
into the same kind of inertia we had in the data warehouse world.
But certainly, I think the platform is being built out now,
and that's one, the kind of argument.
But then going forward, it's like, well,
now this is becoming a mainstream technology.
How do we then deal with the fact that there are rules out there
and there are kind of needs to audit stuff as well and so on,
whilst not forgetting this is not a transactional system.
This is a kind of information system.
Well, there are tools that are being built.
So there are a couple of things that Cloudera is doing
to try to resolve this problem.
One of them is called Cloudera Navigator.
And essentially what this thing does is it looks at all
of your log files and integrates with different
APIs throughout the different systems in Hadoop.
Part of what makes this problem harder than maybe traditional systems is that you're talking like maybe 10 or more different components that are sort of accessing the same data.
And so this system gives you sort of like audit logs and data providence information and stuff like that.
So it helps with the data governance.
And really, you know, that's their mission is to improve data governance over time.
And so I think that's one aspect that Cloudera is trying to address.
The other thing is something called record service.
And so that's sort of a newer project that's still, in many ways, getting off its feet
or getting on its feet.
Anyway, I'm not sure I got the idiom right. And so essentially, you know, what it tries to do is provide a single access layer from a network API perspective to all the data.
And if you can really enforce a single access layer across all the disparate systems in your data lake, then it's easier to sort of do this auditing and do this enforcement.
So there are a couple of different approaches.
Yeah, I mean, it was certainly when I saw the two things launched at the same time,
Kudu and record service, I thought, that's very interesting.
I think, you know, you can see Cloudera there trying to be, I suppose,
the Hadoop vendor for analytics, really.
And that was kind of interesting.
Just what I'm conscious of time for you as well.
But the other thing is is is cloud as well i mean so one of the things that i often
think is is that in a way hadoop is this is this kind of store of that thing you can store data
into it's effectively you know unlimited storage and it's it's effectively free you know in terms
of kind of you know how little cost it is which in a way is kind of describing cloud and and
taking kind of cloud and hadoop going
forward and so on you know do you sort of see where do you see cloud coming into this really
i mean you can obviously host hadoop in cloud but do you think cloud is going to change things in
other ways as well really or what's your thoughts on that i think cloud changes things in a couple of ways. Number one, it takes our, the sort of traditional Hadoop
argument that, you know, let's get away from specialized hardware, let's get away from
like vertically scaled databases, let's go to horizontally scaled databases as a sort of paradigm
shift. And so, you know, so it makes things, it makes it really much more economical,
you know, to buy a bunch of cheap drives and, and let them fail and buy a bunch of, you know,
like one U or four U, you know, units and, and, and just let them, you know, let them,
let them fail as needed. And, and instead of buying, you know, like a refrigerator appliance or something. And so what the cloud does is it makes it even more economical than before,
right? So you can essentially, if you want to build like a, if you want to run a big number
crunching job, you know, you just need to go to Amazon and say, Hey, can I rent this, you know,
essentially like data center full of machines for, you know, eight hours or something. And you can. And so,
likewise, even if you want to run them long term, you can periodically pause them and bring them
back up. So I think what this is really doing is, in a lot of ways, you know, there's still
the traditional database vendors, you know, that are sort of, you know are starting to get into big data, starting
to work on sharding their systems more.
But in many ways, they still want special networks, special hardware, their own hardware,
maybe even FPGA chips and stuff.
And so what this means is that they'll never run in the cloud.
I think the other thing that the cloud brings into the equation, as you mentioned, the economics make it easier for you to feel the pain or maybe you never storing more data than you really need to be storing.
And then you start to, you know, need to take a harder look at how the data is, you know, how long the data is lasting.
And, you know, do we really, are we really using this data and sort of do more upfront, careful planning than you have to do with Hadoop?
Yeah, definitely.
I mean, I think for me, it strikes me that in a way,
if you take a level of abstraction from above Hadoop and above cloud,
they're just kind of elastic storage, aren't they, really,
that is effectively unlimited and can compute things on there as well.
And certainly, to my view, you're going to get a blurring between on-premise and cloud and so on here.
And, yeah, I think the other thing I see often in kind of companies I've worked with is where Hadoop clusters are spun up just to do a certain job.
So there might be a kind of a bit of kind of predictive modeling
or some kind of like models being built or something.
And you'll see, you know, I've seen, I think,
Cloudera Director being used for this,
where a Hadoop cluster is spun up just to do this job and it's then taken down again at the end and
so on so i think certainly for development is kind of good but also for these kind of one-off jobs
that might require a lot of nodes but those nodes aren't kind of persistent really then that's kind
of useful so um i noticed i think i noticed a while ago that the impala was running on amazon
emr so maybe at some point in the future,
we might see kind of Kudu on there as well.
So that would be kind of interesting.
So I think we've run out of time now.
So Mike, just mention again,
what are the URLs again to get hold of Kudu?
Where would people go to get hold of this?
Sure, just go to kudu.apache.org
is the open source version of Kudu.
And then if you want a repackaged version that has RPMs and stuff like that,
then you can go to Cloudera.com.
And under downloads, you can find Kudu there.
And so sort of those two places.
Or, you know, Cloudera Manager is also a good way to install Kudu. And you can
find out how to install Kudu using Cloud Air Manager from either cloudair.com, or I think
it's also mentioned in the Kudu documentation. But so, Mike, thanks very much and appreciate
you coming on the show. Thanks a lot, Mark. I appreciate the opportunity.