Drill to Detail - Drill to Detail Ep.3 'Apache Kudu and Cloudera's Analytic Platform' with Special Guest Mike Percy

Starting point is 00:00:00 Hello and welcome to Drill to Detail, the podcast series about the world of big data, business intelligence and data warehousing, and the people who are out there leading the industry. I'm your host, Mark Rittman, and Drill to Detail goes out twice a month with each hour-long episode featuring a special guest, either from one of the vendors whose products we talk about all the time, or someone who's out there implementing projects for customers or helping them understand how they work and how they all fit together. You can subscribe for free at the iTunes Store podcast directory, and for show notes and details of past episodes, visit the Drill to Detail website at www.drilltodetail.com,

Starting point is 00:00:46 where you'll also find links to previous episodes and the odd link to something newsworthy that we'll probably end up discussing in an upcoming show. So in this episode, we've got Mike Percy from Cloudera. And I'm particularly pleased to have Mike on the show because Mike's actually a software engineer that works on the Kudu project there and as probably some of you might have heard Kudu is a new technology or a new project out of sponsored by Cloudera I think but it's now been sort of donated open source and so on that is going to be in my into my mind one of the kind of key analytic kind of platform pieces. Mike do you want to introduce yourself first of all and just kind of just tell us what you do at Cloudera and how you got involved in this?

Starting point is 00:01:25 Sure. Thanks a lot, Mark. Thanks for having me on. So, yeah, my name is Mike. I am a software engineer at Cloudera, as you said. I'm also a PMC member, which is Project Management Committee member and committer on the Apache Kudu project. Apache Kudu is a columnar open source distributed data store. And so we can go into that a little bit more later. However, in terms of my own background, prior to Cloudera, I was at Yahoo for several years working on machine learning and sort of building a machine learning distributed system using the Hadoop stack.

Starting point is 00:02:12 And I've been at Cloudera for a little over four years. Fantastic. So, Mike, you said you've been involved in this project called Kudu. And I came across this, I think it was probably late last year it was mentioned so there was a few kind of press releases and news articles um from cloudera and so on about a couple of new technologies that they'd kind of uh launched around kind of analytics one of them was called record service which is about security but the one i was really interested in was this thing called kudu and you know looking at the i suppose looking at the kind of the the kind of the news articles and the write-up at the time, it was positioned as this kind of like, in a way, best of both worlds

Starting point is 00:02:47 of kind of fast analytics and fast loading and so on. I suppose, in a way, to help anybody who's not really heard of Kudu or understood what it's about, what problem does Kudu solve? And why did Cloudera and you guys get involved in this, really? Why was it really kind of done? Yeah, sure. So the previous options for storage in the Hadoop ecosystem were HDFS and HBase. And HDFS is based on something called GFS, the Google File System. And HBase is based on, from a design perspective, on something called BigTable, which is a technology that was invented at Google.

Starting point is 00:03:28 Both of them were invented in the early 2000s and papers were published about them. So HDFS is a file system and it's a special kind of file system. It's distributed. And the way it's architected and designed, it's really made for writing very, very large files in a batch, sort of a batch scenario. So say you would ingest a large amount of data every hour, or maybe every 10 minutes, that would go into its own HDFS file. And you wouldn't really want to try to access that before you were done loading it. So there's some latency built in there, as you can see. However, it's really efficient at scanning large amounts of data.

Starting point is 00:04:13 So it's really efficient to have a bunch of jobs read data from HDFS and sort of scan through all of it. And so this is good for stuff like machine learning where you're building models from a lot of data. And so the opposite sort of end of the spectrum in terms of throughput and latency is something called HBase. And so what HBase does is instead of being a file system, it looks like a, it's a basically sort of a NoSQL store. It's basically a table abstraction

Starting point is 00:04:46 on top of HDFS. And it allows for mutating individual rows. You can seek to a single row, insert, update, and delete, all that stuff. But it's actually not really designed to be as efficient as HDFS at scanning lots of data at sort of throughput in general on the read side. So there was this gap, essentially, where, well, what if you want to be able to have random access to something that, you know, feels like a database, so something more like HBase, but you really want very fast scans. You really want to be able to churn through lots and lots of data in a parallel, high throughput manner. There was really nothing there that could really fit this particular use

Starting point is 00:05:40 case, which is a huge use case. And so that's why we built Kudu to fill that gap and to sort of meet those to sort of do a very good job at those things. Okay. So yeah, I mean, certainly my experience on using, I suppose, Cloudera, you know, the Cloudera stack on projects was that certainly if we were going to be loading, bulk loading data into kind of, you know into kind of a data lake or something, putting it into HDFS was kind of fine. But when we wanted to do these kind of incremental, especially things like updates and deletes and so on to data, what we tended to do was to use HBase and then put a Hive table over the top of it, which worked okay.

Starting point is 00:06:21 But then, first of all, it was fairly cumbersome, you know, having to use kind of, which works okay but then first of all it was fairly cumbersome you know having to use kind of uh at that point i don't think the uh the hive on h base kind of you know jars whatever actually shipped at the time with cdh but but also the problem we found was that like you say if you try and do aggregations on those hive on h base tables they're quite slow and so on and i think also other things we found were uh we use, say, things like Parquet, which had kind of column store, it had its own limitations as well. And so when I saw what Kudu was trying to do, I thought that was kind of interesting, really. And so I guess looking at things, there are already things out there like, say, Parquet, that column store. Again, what does Kudu kind of bring to the party there, really? I mean, how does it improve on that and do things better this this goes back to sort of the initial conception of uh what kudu uh would

Starting point is 00:07:12 be um and uh so you know really kudu was um first conceived when parquet was first conceived around the same time uh and really a lot of the same people were involved. So Todd Lipkin is a software engineer at Cloudera. And Kudu is his original idea, really. And so he essentially was helping design Parquet as well with some folks from Twitter, as well as Nong Lee, another software engineer at Cloudera at the time. And so essentially, as they were designing Parquet, they're like, OK, this is going to be great. We're going to have extremely fast throughput in a schema-oriented file, and this is going to work great with HDFS.

Starting point is 00:07:59 But the next thing that people are going to say is, well, you know, what if I want to update one of those records? Because people are used to analytical databases that, you know, do this for you, right? You can have really fast scans because it's column oriented. It's really good for analytics. However, you know, you can also go and insert, update, and delete. Well, you can't do that in a parquet file. Not really. If you want to mutate one row in a Parquet file, you have to rewrite the whole file, which is obviously very inefficient if you have, you know, like multi, you know, hundreds of megabytes or even gigabytes in a Parquet file, then it's going to take forever. So really, that was the idea behind Kudu that sort of set up the spark as, hey, you know, people are going to want updatable parquet.

Starting point is 00:08:51 Let's figure out how we can build an updatable parquet. And that was the beginning of Kudu. So another problem I found with parquet was I did a project where we were streaming stuff into, you know, in real time into Hadoop. I was loading into parquet and I was finding that I had a problem there where because of the compression it uses and so on, Parquet didn't really suit very well to streaming data and so on there. I mean, how do you deal, I suppose, with the fact that data is streaming

Starting point is 00:09:19 and you've got to perhaps compact it and so on? How does that kind of work? So there are multiple layers of memory and background tasks inside Kudu. And so really the main two parts are, well, there are four parts. So I'll start with the, for initial inserts, there's something called the memro set, which is essentially where everything goes that's being inserted into a particular shard of Kudu is a distributed system. And when you, um, create a table in Kudu, so, you know, really Kudu feels like, um, if you use it with Impala, um, for example, then it really feels like my SQL circa 2001.

Starting point is 00:10:22 So if, if, you know, so for those of us who, of us who remember back that far, there was MyISAM storage engine for MySQL, and that was really the standard. And with that, you can do all kinds of SQL stuff. It was actually really fast, but it didn't support transactions and triggers and stuff like that. And actually, Kudu really feels like that. So it really feels a lot like you're using MySQL circa 2001.

Starting point is 00:10:54 But instead of being a row-oriented database, it's a columnar database. And instead of being on one node, it can be sharded across. You can have a Kudu cluster that's a thousand nodes. And so the tables that you create are distributed across all of these nodes. So when you say create table with some schema and you specify your partitioning in your create table statement, then that eventually creates things we call tablets. And those tablets are essentially one partition is one tablet. Certainly my experience has been that, I guess when I first heard about Kudu,

Starting point is 00:11:35 I was thinking, is it another SQL engine? Is it a kind of storage format and so on? And I think kind of the thing that worked for me was going through some of the, I suppose the initial tutorials and seeing that effectively it's like a store it's like a storage um storage engine isn't it for hive or something i know it's obviously impala um and and i guess the bits with the tablets there uh it struck me as that's where part of the hbase heritage or certainly it reminded me of kind of like how hbase worked as well um so so one of the

Starting point is 00:12:03 things again that is worth maybe talking about is what's the relationship between, say, Kudu and Impala? So they do seem to be kind of fairly closely linked initially. Is that always going to be the close link between the two? I mean, first of all, what's the link between Kudu and Impala?

Starting point is 00:12:17 And do you intend that to extend to other areas as well, other kind of tools as well? Sure. Yeah, that's, you know, that's something that people initially sort of scratched their heads at. You know, so like, what is, what is this thing? And so I guess I'll, I could make two comparisons. Continuing with my MySQL analogy, MyISAM is a storage engine for MySQL.

Starting point is 00:12:43 I think these days people use InnoDB, which supports transactions but is maybe slower. So Kudu is comparable to a storage engine like that, like MyISAM. However, the way that we've designed these APIs, any system can use it. So it's not specific to Impala. It's not, like, my ASM is very specific to my SQL. It's actually, Kudu actually exposes APIs and Java clients. So there's a Java client, there's a C++ client, and there is a Python client. And these are like client libraries that you could use to talk to this data store.

Starting point is 00:13:29 And that's actually what systems that integrate with Kudu use if they want to implement SQL. Or if you just want to insert a record without going through SQL, you can write an application to do that as well. So it exposes APIs that look like insert, update, delete, create table, but these are programmatic APIs. And so systems today that integrate with Kudu, SQL systems include Impala, Spark. So Spark SQL can talk to Kudu. Drill, Apache Drill is a system that can scan and load data into Kudu. And so essentially what Kudu does is it provides basic APIs.

Starting point is 00:14:15 It also provides sort of advanced APIs so that these SQL engines that were sort of traditionally, you know, monolithically built into databases like MySQL or like Oracle, are now, you know, in the Hadoop ecosystem, we're splitting them out. We're sort of, I don't want to use the word shard, we're tiering them, we're layering them out so that you can have at this base layer, Kudu, which is your storage engine. And then the next layer up, you can have two things. You can have Spark and Impala. And maybe if you're building a machine learning system, or if you want to do like programmatic analytics, or you're doing something that requires some tricky kinds of a combination of joins and something else, and maybe business logic that you want to execute in parallel,

Starting point is 00:15:07 then you could write a Spark job that talks to Kudu. And then for stuff for your reporting or for your sort of ad hoc analysis, you could use something like Impala that gives you a nice SQL shell to just run SQL on Kudu. And they both work. Okay, okay. I mean, it's certainly for me that the biggest kind of revelation SQL shell to just run SQL on Kudu. And they both work. Okay.

Starting point is 00:15:25 Okay. I mean, it's certainly for me that the biggest kind of revelation was seeing that I could do an insert statement in Parler now. So, you know, when you come from... And update. Yeah, I know. When you come from the kind of the data warehouse world that I do, you know, you're so used to being able to do single row inserts and you can't do those in Hive and so on.

Starting point is 00:15:43 And the other thing I was talking to you earlier before we started the recording that um the other thing i've been using is uh is stream sets as well which is the kind of etl tool type thing that comes that you can install now as a as a kind of service into into cloud era as well and in cloud era you know hadoop and that has a kind of i was really pleased to see straight away that that had you know inputs as kind of kafka and it can load into kind of uh into into kind of uh into this as well so so so one question for you i suppose really i suppose maybe devil's advocate question um why why did cloudera choose to do this rather than say just work more on say sort of hdfs you know why why go and create another kind of uh another storage technology

Starting point is 00:16:21 than just maybe extend kind of hdfs or you know what was the thinking behind that well the it's um it's sort of an impedance mismatch i think you know HDFS's original goals um were to be really good at batch and um to really efficiently and economically store bytes on commodity hardware in a way that wasn't really designed to be modified. In fact, as I'm sure you're aware, HDFS is an append-only file system. You can't even change bytes in the middle of a file. While HBase is able to build on top of that and essentially get a store that is mutable, it was really a design decision to not modify HBase because we felt that HBase is actually really good at what it does. It's really fast for random inserts and updates and deletes and also random seeks. It's also extremely efficient at loading. And it's got a lot of things going for it for the use cases that are really ideal for it. So we didn't want to basically come in and say,

Starting point is 00:17:45 well, in order to implement efficient encodings on a column store, we're going to now impose schemas on top of HBase. And in order to get really efficient columnar scans, we're going to break everything out so that every column is a column family. There are a lot of things that we would have had to break in HBase.

Starting point is 00:18:04 And so Kudu is really an attempt at saying, you know what, let's just go back to first principles. Let's figure out the problem that we're trying to solve. And then rather than try to force some other system to sort of conform to what our goals are, let's just go straight for these goals. And I think that that was the right choice do you see do you see impala and do you see impala being something will be commonly used to to load data into into kudu or do you see it mainly being programmatic and the insert statements and things and so on in impala are more of a kind of side issue i mean i think i noticed i was trying to get today set up the JDBC drivers. I don't think the JDBC drivers for Impala yet support inserts, do they?

Starting point is 00:18:50 They're kind of read-only. What's the vision in terms of loading? Is it generally going to be programmatic loading? Or can you imagine an ETL tool, for example, a non-Hadoop one, using insert statements through Impala? Is that the vision for it really, or is it more through programmatic? So I think in terms of Kudu integration with Impala, there's still work to be done. And so Kudu's inserts work really well. However, the, you know, there's, there's quite a bit of roadmap still, like maybe a few months of roadmap for Impala to do, to get better integrated with Kudu.

Starting point is 00:19:35 So that's how I would sort of, that's how I would put it. It's, there are definitely holes in, because Impala's support of insert and update and delete are actually very new for Impala. So I think in a few months, that'll get better. So it's not intended to just be a user tool. It's intended to be programmatic as well. That said,

Starting point is 00:19:59 the inserts and sort of loading directly into Kudu, if you really just want to get the fastest possible inserts, then probably going through the NoSQL APIs like through Spark or something today is probably the fastest option. Although if you have your data on HDFS already,

Starting point is 00:20:24 then something that you can do with Impala on the shell is you can execute something like create table from select star as it's like, I forget the exact syntax, but it's like you can create table from insert, you can create table from select star from some other table. So if you have some data like CSV files on HDFS, then in a one-liner, Impala will set up a job that will very efficiently load data into Kudu

Starting point is 00:20:54 from that other data source. So you can easily get started and try out, try loading data into Kudu and then try it out on your actual. Yeah, you mentioned about the Impala side and clearly the Impala kind of team and project is separate to Kudu and they've worked with different kind of like cadence and so on. But what's the kind of what's the long term? I mean, I guess in a way, take it taking an even further step back. Why did why did why did Cloudera do this and why did you do this?

Starting point is 00:21:20 And what's the longterm vision for this really? I think the long-term vision from my perspective and from the project's perspective is really to be the best possible analytics system out there. So Kudu's goals are sort of two-pronged, really. We certainly want Impala to continue sort of evolving its support for Kudu, and it will. And that's definitely on their roadmap. the first thing that you would think of when you want to do analytics in a big data distributed environment. And so what that means for us is integrating with all of the systems. So the beauty of the Hadoop ecosystem

Starting point is 00:22:18 is that kind of everything works with everything. And so you can use Hive, you can use Impala, you can use Spark, they can all talk to HDFS. And so different can use Hive, you can use Impala, you can use Spark, they can all talk to HDFS. And so different workloads that have different requirements can use these different systems, but they all get to use the same data. It's like the whole data lake or the data hub idea. It's based on lots of tools that can talk to the same data store. So that is what we are, that is what we're going for. And so that's why we're integrating with all these different systems. I think one, you know, missing piece is Hive.

Starting point is 00:22:54 And I know we want to get that done. I'm not sure when Hive is going to integrate with Kudu, but I'm sure it'll happen, at least in the medium term. As far as Kudu's long-term vision, we are going to implement a bunch of features that are coming up. So 1.0 is right now. And so Kudu is now prime time, ready for running in production. And actually, there are already production users of Kudu. Large deployments like 200 nodes, clusters of Kudu running in production, and people are happy with it.

Starting point is 00:23:36 Next up is, looks like security is really, you know, something that a lot of people want. And so right now, Kudop doesn't do authentication and authorization. What that means is that like Hadoop was a couple of years ago, you should really like firewall it off. But in the next few months, we'll start adding security features and soon you'll be able to have your users authenticated and whatnot through Sentry, for example. And then ultimately, we're also going to add multi-row transactions

Starting point is 00:24:14 to Kudu and we'll add that support, integrate it with all the query engines like Impala and Spark. And also we plan to support multi-data center replication and operation. Is the vision for Kudu squarely within analytics or do you see it potentially supporting OTP type workloads? You know, it's easy to imagine that we could add row-oriented storage into the Kudu backend, and it's not really that hard. But it's actually a pretty tall order to go and implement an OLTP database, so that's really not what we're focusing on right now.

Starting point is 00:25:08 I mean, in the future, I suppose, once we feel like we've nailed analytics, then I think it might make sense to take a look and see where the low-hanging fruit is for a sort of row-oriented access and if we could support some basic OLTP workloads. But as I'm sure you're aware, there's a lot of stuff that goes into these databases,

Starting point is 00:25:35 like query planning and all kinds of optimizations that might make sense one way for analytics and make sense another way for OLTP workloads. And so it wouldn't only depend on Kudu. It would also depend on these query engines that have their optimizers to also support efficient OLTP access. And I don't know if anybody's working on that right now.

Starting point is 00:26:02 So if no one's working on it, then I think we're a ways away from it. I guess there's two things. It's can you do it and should you do it, really, isn't there? I mean, why would you do transactions in a column store database? I mean, it's kind of crazy. So I guess another question really is

Starting point is 00:26:19 where does this relate to Spark? So the way I think about integration with Kudu and Spark is that, well, number one, it works. But if you haven't used Spark a lot, then I would say for people who want to get an idea of where it would make sense, Spark provides a programmatic API to do all this data processing stuff. And so for people who have written MapReduce jobs, it's like a lot of work to really, you have to create this mapper and this reducer. And so Spark basically comes in and really simplifies that MapReduce API

Starting point is 00:26:59 and really expands it. And so it makes it really much more expressive and lets you do grouping and really have a lot of flexibility in sort of the order in which things occur and how the parallelism works out. And plus it adds some nice performance features related to caching. So I think that the way I would tend to architect these kinds of systems is you have Kudu and potentially HDFS and also potentially HBase as data sources, depending on where you want

Starting point is 00:27:39 your data to live and sort of what legacy jobs you have. And then for running your reports, I think mostly people want to use SQL to, you know, sort of generate reports. Plus, you know, tools like Impala work really well for integration with BI tools like Tableau, for example. So where Spark comes in is where you want to do custom programming at scale against some data store. And so that's where it really shines. And so, like, I keep coming back to machine learning, but I think other great examples are, like, fraud detection

Starting point is 00:28:22 or, like, sessionization in like websites and figuring out click-through rates on ads. So all these things, you can do some of it with SQL. And if you're a SQL wizard, maybe you can do most of it with SQL. But some of it's sort of tricky enough that it might be more better or more easily expressed, more simply expressed as a program. And so that's where picking up Spark is really nice. And I think Spark SQL as well, essentially just feels like you can add your SQL statement inside your Spark program, and maybe part of what you want to do is really better expressed in this sort of Spark language or Spark syntax,

Starting point is 00:29:11 but some of what you want to express might be easier to do as a SQL statement. And so Spark makes it easy to sort of mix those things together and use them both in the same program. So obviously, Cloudea are quite big backers of Kudu. Is it being picked up by other vendors? mix those things together and uh and use them both in the same program so obviously cloud era are quite big backup backers of kudu is it being is it being picked up by other vendors i mean it is i noticed that it's been donated to apache as well i mean i suppose the question really is how much of this do you think is how much of kudu is going to get picked up and used universally like say spark as an example or is it going to be... benefit from having differentiation. And so they don't want to all be like, you know,

Starting point is 00:30:06 different colors of the same flavor. They want to have their own sort of things that they focus on. So, you know, Cloudera, I know, really hopes that other vendors will pick this up. And, you know, and that's why we from the beginning planned on making this an open source project, because, you know, and we did, and then we donated it to the Apache Software Foundation. And then it was, and we've since graduated from the incubator that Apache has signed off on our development practices. We have open development practices. We accept patches from all over the place. And we really welcome sort of other vendors, Hadoop vendors and non-Hadoop vendors and really individuals to contribute.

Starting point is 00:30:50 So, you know, I think that's the best answer I can give you is we really hope they will. I think that there's some resistance right now. But personally, I feel that like ultimately they won't be able to say no because there's really no other good alternative, in my opinion, for Kudu unless you go proprietary. Yeah, exactly. I mean, certainly if you look at Hortonworks, look at MapR and so on, everyone's got their own kind of favorite SQL engine

Starting point is 00:31:22 and so on there. And I've seen that obviously Impala is ostensensibly open source and i'm sure it is but people don't i suppose it's not being picked up by other vendors because they want to have some differentiation there but other technologies that have come out of different vendors have been have been adopted because as you say if you take say yarn from say sort of holton works or take sort of what you're doing here with kudu you know it fills a gap that is not filled elsewhere really um and and and so therefore you know that's kind of interesting i mean it's so so yeah that that so i mean i suppose um so you said earlier on that it's going to actually by the time this goes out it'll be version one so does it mean it'll be supported it's supported it's a supported product from

Starting point is 00:31:58 cloud era at that point really so clutter lags uh the the upstream or we call upstream the apache releases a little bit by uh usually by like a month or a few months and so i think that that's what's going to happen here as well is um it like so cloud era will essentially semi-support uh 1.0 but uh sort of the officially supported version is going to be just a few more months. So how would somebody get started on Kudu? I mean, when I first looked at it, there was a developer VM you could download and so on. I mean, if somebody was listening to this and thinking, how do I get started with it? Where would they start? What would they do and where would they go to?

Starting point is 00:32:40 So the best place to go to get started is the Kudu website. It's easy to remember. It's Apache Kudu. And so that makes it the website kudu.apache.org. And so that's, I think, the first place I would go to. Just click on community. And we have a Slack channel, which is just like a chat room. And it's a public sort of auto invite thing.

Starting point is 00:33:08 So you just click on the link, it'll invite you, and then you can join the Slack channel. And you can ask questions. We also have a user list, a user mailing list. And there's documentation on the website, sort of getting started with Kudu, how to install it. There's also a VM, if you want to just spin up a virtual machine and sort of try it out already installed. That's another way to kind of just give it a whirl. So just as a last kind of thing, really, I was interested, I mean, looking, I look back at some of the stuff you've done on, you know, in terms of presentations and YouTube and so on. And I can see you've been involved in analytics as a topic

Starting point is 00:33:44 for a while, really. And I'm just interested to get from your perspective, where do you see kind of analytics on Hadoop going and in the next few years? What do you think the kind of where would you see the technology and the opportunities going really for say, analytics, and I guess data science really going forward? Yeah, I think that's a really good question. I think it's really broadening in terms of the use cases that can be used on, that can be executed on Hadoop. So today people are already doing a lot of like large scale BI and sort of analytics and data science.

Starting point is 00:34:24 These terms are starting to get really muddled together because I think people are starting to essentially have multiple teams using sort of the same data sets and there are more and more people that are really getting into the advanced analytics side and data science. And so people are using Hadoop for all kinds of stuff. It's really incredible. For example, implementing, doing fraud detection, credit card companies, I think I mentioned that one, do it in banks, doing like cheat detection. Like they're like these, you know, these big online games that, you know, lots of companies have these massively multiplayer online games and sort of these like arena based games.

Starting point is 00:35:12 And those guys are doing all kinds of number crunching using Hadoop because they have so much data coming in that, you know, they have to do it in parallel. And so they're using things like Spark and Impala to do it. And so more and more people are essentially creating new use cases that can run on Hadoop. And I think what we're trying to do with Kudu and at Cloudera is really simplify the whole process. So there's something called the Lambda architecture that people were talking about a couple years ago and has sort of started to become mainstream now. Where essentially you do this combination of sort of some streaming analytics and then periodically you do a big batch job that then sort of like corrects any errors in your streaming analytics and sort of you do this union merge at the end. And this kind of an approach is, while it works, it's like super complicated and hard to maintain

Starting point is 00:36:24 and, you know, really hard to get your head around in the first place. And so the more that we can say, you know what, you only have to use one data store that has really efficient streaming inserts and really efficient scans, you know, then all of a sudden you don't, you have to worry less about this Lambda architecture. It doesn't necessarily always fully go away if you really need up to the second sort of running counters or something like you might do. But the faster and the more flexible

Starting point is 00:36:57 we can make the backend storage, the better it is for the user, right? Yeah, yeah, definitely. I mean, certainly from my perspective, I think it was Michael Stonebraker a while ago kind of said that in his view, I think all analytic workloads will move to Hadoop. And I think that's completely true.

Starting point is 00:37:17 I think that anybody who's doing anything that is kind of doing classic data warehouse or analytic, or in this case, in the new world, I suppose, of kind of streaming jobs and fall detection and so on. You would do that on Hadoop, or really whatever the success is of the various parts of Hadoop is in time. And for me, that ability to land everything,

Starting point is 00:37:38 the fact that you can land everything in one place economically now, the fact that you could do it in the past in theory, but you can never afford to do it, and you couldn't really pay the cost of all the kind of the the schemer on right stuff where you had to work it all out in advance and so on there i think people the fact you can land it all now in a fairly sort of vague form and apply different engines to it and so on that that has kind of won the argument really um and and certainly things like kudu are

Starting point is 00:38:02 filling in those gaps really things that from the, from the data warehouse world we've had in the past, you know, inserts and deletes and so on, and the ability to land stuff in streaming form and query at the same time, you know, it's filling in those things there. But certainly, and that's working kind of well. I think an area that is ripe for innovation is things like semantic discovery and I suppose in a way data governance as well. And certainly from my side, Hadoop has had,

Starting point is 00:38:30 I say Hadoop in a general sense here, has had a bit of an easy ride so far in terms of data governance and all that kind of stuff. I mean, I know it's not your area really, but do you see that as well? Do you see that probably the next kind of big sort of thing

Starting point is 00:38:43 is getting, I suppose, data quality and data governance and so on? Or is, but maybe for different reasons than it used to be. So I think one of the reasons that, that data governance is still important has a lot to do with like regulations around like financial industry, the financial industry, or maybe like HIPAA, you know, privacy in the United States,

Starting point is 00:39:23 it's, we have regulations around healthcare records and stuff. And so I think data governance in terms of, like, did we retain this record or can we make sure that privacy was maintained on this record is something that will never go away. Yeah, it's interesting, isn't it? I mean mean certainly clients

Starting point is 00:39:45 i've had i've had in the past um i've had the issue where you know they they buy in entirely to the whole kind of data lake idea and the landing stuff in there and there's customer data there in there as well and then they kind of think about the fact well what what did we get kind of permission to do from the customer and and so you've got that whole thing of and that's not that's not hadoop's fault that's not kind of you know it's now oh now we're suddenly now we can use this data a very kind of granular you know we're doing things like micro sort of segmentation so we're now going to give you an offer based actually on your transactions not on not into which segment you kind of you fit into and and there's practical there's practical issues there around well how do you go and redact information out there as well

Starting point is 00:40:22 but but certainly i think an area where i think the technology we're using now could help is if you think about say semantic discovery you've got all this data landing into kind of a data lake at some point you're gonna have to apply schemas to it really um and and it's interesting to see tools like drill for example that can that can read the schema in there and that parquet and so on but certainly again there's products i'm seeing coming out there tamir for example and and so on. But certainly, again, there's products I'm seeing coming out there, Tamiya, for example, and so on, that how do you make sense of the data in there, really? And how do you do the kind of provenance and so on there? I mean, these are, in a way, they're grown-up problems.

Starting point is 00:40:52 They're problems of success, in a way. And we need to be careful we don't get bound into the same kind of inertia we had in the data warehouse world. But certainly, I think the platform is being built out now, and that's one, the kind of argument. But then going forward, it's like, well, now this is becoming a mainstream technology. How do we then deal with the fact that there are rules out there

Starting point is 00:41:13 and there are kind of needs to audit stuff as well and so on, whilst not forgetting this is not a transactional system. This is a kind of information system. Well, there are tools that are being built. So there are a couple of things that Cloudera is doing to try to resolve this problem. One of them is called Cloudera Navigator. And essentially what this thing does is it looks at all

Starting point is 00:41:38 of your log files and integrates with different APIs throughout the different systems in Hadoop. Part of what makes this problem harder than maybe traditional systems is that you're talking like maybe 10 or more different components that are sort of accessing the same data. And so this system gives you sort of like audit logs and data providence information and stuff like that. So it helps with the data governance. And really, you know, that's their mission is to improve data governance over time. And so I think that's one aspect that Cloudera is trying to address. The other thing is something called record service.

Starting point is 00:42:31 And so that's sort of a newer project that's still, in many ways, getting off its feet or getting on its feet. Anyway, I'm not sure I got the idiom right. And so essentially, you know, what it tries to do is provide a single access layer from a network API perspective to all the data. And if you can really enforce a single access layer across all the disparate systems in your data lake, then it's easier to sort of do this auditing and do this enforcement. So there are a couple of different approaches. Yeah, I mean, it was certainly when I saw the two things launched at the same time, Kudu and record service, I thought, that's very interesting. I think, you know, you can see Cloudera there trying to be, I suppose,

Starting point is 00:43:19 the Hadoop vendor for analytics, really. And that was kind of interesting. Just what I'm conscious of time for you as well. But the other thing is is is cloud as well i mean so one of the things that i often think is is that in a way hadoop is this is this kind of store of that thing you can store data into it's effectively you know unlimited storage and it's it's effectively free you know in terms of kind of you know how little cost it is which in a way is kind of describing cloud and and taking kind of cloud and hadoop going

Starting point is 00:43:45 forward and so on you know do you sort of see where do you see cloud coming into this really i mean you can obviously host hadoop in cloud but do you think cloud is going to change things in other ways as well really or what's your thoughts on that i think cloud changes things in a couple of ways. Number one, it takes our, the sort of traditional Hadoop argument that, you know, let's get away from specialized hardware, let's get away from like vertically scaled databases, let's go to horizontally scaled databases as a sort of paradigm shift. And so, you know, so it makes things, it makes it really much more economical, you know, to buy a bunch of cheap drives and, and let them fail and buy a bunch of, you know, like one U or four U, you know, units and, and, and just let them, you know, let them,

Starting point is 00:44:40 let them fail as needed. And, and instead of buying, you know, like a refrigerator appliance or something. And so what the cloud does is it makes it even more economical than before, right? So you can essentially, if you want to build like a, if you want to run a big number crunching job, you know, you just need to go to Amazon and say, Hey, can I rent this, you know, essentially like data center full of machines for, you know, eight hours or something. And you can. And so, likewise, even if you want to run them long term, you can periodically pause them and bring them back up. So I think what this is really doing is, in a lot of ways, you know, there's still the traditional database vendors, you know, that are sort of, you know are starting to get into big data, starting to work on sharding their systems more.

Starting point is 00:45:29 But in many ways, they still want special networks, special hardware, their own hardware, maybe even FPGA chips and stuff. And so what this means is that they'll never run in the cloud. I think the other thing that the cloud brings into the equation, as you mentioned, the economics make it easier for you to feel the pain or maybe you never storing more data than you really need to be storing. And then you start to, you know, need to take a harder look at how the data is, you know, how long the data is lasting. And, you know, do we really, are we really using this data and sort of do more upfront, careful planning than you have to do with Hadoop? Yeah, definitely. I mean, I think for me, it strikes me that in a way,

Starting point is 00:46:55 if you take a level of abstraction from above Hadoop and above cloud, they're just kind of elastic storage, aren't they, really, that is effectively unlimited and can compute things on there as well. And certainly, to my view, you're going to get a blurring between on-premise and cloud and so on here. And, yeah, I think the other thing I see often in kind of companies I've worked with is where Hadoop clusters are spun up just to do a certain job. So there might be a kind of a bit of kind of predictive modeling or some kind of like models being built or something. And you'll see, you know, I've seen, I think,

Starting point is 00:47:22 Cloudera Director being used for this, where a Hadoop cluster is spun up just to do this job and it's then taken down again at the end and so on so i think certainly for development is kind of good but also for these kind of one-off jobs that might require a lot of nodes but those nodes aren't kind of persistent really then that's kind of useful so um i noticed i think i noticed a while ago that the impala was running on amazon emr so maybe at some point in the future, we might see kind of Kudu on there as well. So that would be kind of interesting.

Starting point is 00:47:49 So I think we've run out of time now. So Mike, just mention again, what are the URLs again to get hold of Kudu? Where would people go to get hold of this? Sure, just go to kudu.apache.org is the open source version of Kudu. And then if you want a repackaged version that has RPMs and stuff like that, then you can go to Cloudera.com.

Starting point is 00:48:15 And under downloads, you can find Kudu there. And so sort of those two places. Or, you know, Cloudera Manager is also a good way to install Kudu. And you can find out how to install Kudu using Cloud Air Manager from either cloudair.com, or I think it's also mentioned in the Kudu documentation. But so, Mike, thanks very much and appreciate you coming on the show. Thanks a lot, Mark. I appreciate the opportunity.

CODACE Plant Stand

Drill to Detail - Drill to Detail Ep.3 'Apache Kudu and Cloudera's Analytic Platform' with Special Guest Mike Percy

Mark Rittman is joined by Cloudera's Mike Percy, to talk about Apache Kudu, Analytics on Hadoop, and Cloudera's work in this area...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Drill to Detail - Drill to Detail Ep.3 'Apache Kudu and Cloudera's Analytic Platform' with Special Guest Mike Percy

Mark Rittman is joined by Cloudera's Mike Percy, to talk about Apache Kudu, Analytics on Hadoop, and Cloudera's work in this area...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.