Drill to Detail - Drill to Detail Ep.13 ‘Apache Drill, MapR + Bringing Data Discovery to Hadoop & NoSQL’ with Special Guest Neeraja Rentachintala

Episode Date: December 13, 2016

Mark Rittman is joined by MapR's Neeraja Rentachintala to talk about Apache Drill, Apache Arrow, MapR-DB, extending Hadoop-based data discovery to self-describing file formats and NoSQL databases, and... why MapR backed Drill as their strategic SQL-on-Hadoop platform technology.

Transcript
Discussion (0)
Starting point is 00:00:00 So, welcome to Drill to Detail, the podcast about big data, analytics and data integration, and I'm your host, Mark Ripman. So, if you've come to any of the presentations I've given in the past year at the meetups and conferences and so on, you've probably heard me talk enthusiastically about something called Apache Drill and how I've called it in some cases, you know, sometimes the future of SQL on Hadoop,
Starting point is 00:00:35 a very interesting sort of product. And so I'm very pleased in this episode to be joined by Nirija from MapR, who I've read her blogs in the past and I've seen some associates written and presented on the sort of drill. And she's actually the Senior Director of Product Management at MapR, responsible for product strategy, roadmap and requirements for MapR's SQL initiatives, including Apache Drill.
Starting point is 00:01:00 So Nirija, do you want to just introduce yourself kind of properly and tell us kind of how you came to this, doing this and what you do really? Yeah, sure. First of all, thanks, Mark, for having me here. Yes, so my name is Neeraja Renta Chintala. I am with MapR. So I have been with MapR for about three years now. And my responsibilities and from a product standpoint in Mapper is mainly two things. One is our SQL strategy. So we do offer a variety of SQL products on Mapper platform. And one of the strategic areas we focus is Apache Drill. So SQL strategy. The second aspect is I'm also responsible for our high performance NoSQL database that we have on the MapR platform called MapRDB. So yeah, so I'm responsible for both for
Starting point is 00:01:53 SQL as well as for NoSQL products. Prior to MapR, I was in Informatica. So I was working on a product called Informatica Data Services. This is part of the enterprise data integration suite that Informatica has. And before that, I was in Microsoft as part of the SQL Server business intelligence products called SQL Server Reporting Services, as well as Power BI. Before that, I was in Oracle and Expedia.com. So mostly, most of the time, I have spent my career focusing on enterprise application integration, data integration, business intelligence, analytics. So those kind of things. Wow, interesting, interesting. So, I mean, I came across your name and what you've been doing when I started looking at Drill and the MapR platform.
Starting point is 00:02:43 And I remember thinking at the time, this is somebody I'd really like to speak to. It sounds like you've worked with some very interesting products and you're looking after a very interesting area within MapR at the moment. So for anybody that is kind of new to drill or this kind of area, just explain just a high level. What is Apache Drill and what does it do differently to say Hive and other previous SQL and Hadoop initiatives?
Starting point is 00:03:10 Okay. So I think at the highest level, right, so the way I like to describe Drill is, first of all, it's an open source interactive SQL query engine that can provide data exploration as well as BI ad hoc queries on big data,
Starting point is 00:03:27 right? So this is essentially at the core of it, it is a distributed in-memory SQL engine with which you can get low latency performance on large scale data sets, right? So the interactive performance and big data is the key thing. However, what's different about Drill, right? So there are so many interactive SQL products out there. So what's really different about Drill is along with the performance, it also gives a lot of flexibility, right? So we can go into that a little bit later. But at the basic level, Drill allows you to do SQL queries without having to define any schemas upfront. So you just point to the data wherever it is, files, HBase, whatever your data source is, and you can start doing queries in minutes. So you don't have to spend probably like weeks or
Starting point is 00:04:19 months of modeling time. You're able to look into the data understand the data immediately so they're really kind of working with new types of data that is common in hadoop and the big data world and getting value from it quickly is kind of the focus for or the differentiation for drill yeah yeah and that's certainly to me i mean that listening to that and reading that on a kind of data sheet, a lot of vendors will say, you know, it's quick to set up, the products they've got, they're easy to use and so on. But it certainly struck me as the way you're doing this and the way that, and we'll get onto this later on, the way that you're kind of leveraging, I suppose, the inbuilt metadata in a lot of the kind of the the kind of data types and coming to hadoop these days you know it is very interesting and so i mean for me there were two parts the fact it was so easy to download and install you know it literally was download a zip file unzip it and work with it but but this concept you're talking about about uh about being analyzed data immediately uh without
Starting point is 00:05:20 having to kind of define you know metadata and that sort of thing that's pretty revolutionary and i don't think people kind of get the significance of that at the start. So talk us through what that means then. How do you do analysis faster in terms of kind of getting access to the metadata and structures and so on, really? Yeah, so I think if you look into how the BI and analytics have evolved over time, right? So there is clearly kind of an evolution
Starting point is 00:05:49 towards doing things self-service, right? So the traditional reporting is basically the data warehousing team putting together a data model and the BI users would consume kind of pre-built reports. And probably a decade ago or so, there is a lot of innovation on the bi side of the world tab click views and they all came in and said okay there is data but as a business user i need flexibility i need self-service capabilities so that i can
Starting point is 00:06:18 visualize the data in whatever way i want right no prefix, predefined reporting format. So there is a lot of innovation happened to make the BI world more self-service. But I think on the data side, still things are kind of the same way, right? In the sense you need to do the ETL, you need to kind of prepare your data to be ready for analytics. There is nothing wrong with it and there is complete value in it. But in the context of big data, though, it's more challenging in the sense people are collecting these huge amounts of data into their data lakes or data hubs. And this data is transactional data, semi-structured data sets, clickstream, sensor, all sorts of data sets.
Starting point is 00:07:12 And there is a need, we have seen clearly from customers, that they need to actually see into the data first before they decide what kind of data to operationalize, right? So they need to understand the data, discover the data. So I think from a business value perspective, opening up the data early in the cycle to users for data exploration, data discovery purposes, so that they can figure out what to do with the data. I think that's the motivation, business motivation for the product. Yeah, definitely. I mean, before getting involved with Drill and some of the technologies you're working with there, I mean, I was working with data discovery tools from the likes of Indeca and so on, where the idea there was that you could analyze data in place, no data left behind and so on. But they typically involved loading into a kind of its own kind of NoSQL or key value store, analyzing it and so on. And the thing that struck me with Drill was the fact that you basically point it towards a data source and it tries to use the inbuilt metadata that's in there, really. And if you think about some of the data types we'd have now, JSON and so on and so forth, there's a lot of structure in there that you can leverage.
Starting point is 00:08:18 And so Drill leverages that inbuilt metadata, is that correct? Right, so I think the way to think about that is I think there is a lot of obviously collateral that we have where we talk about schema-less kind of a system. So I think there is a thing to say about it, right? So when we say schema-less, what we are talking about is really there is no central repository of schema but if you take kind of these formats you just mentioned json there are formats like okay there is actually some level of schema in the file itself right so if there is a schema available
Starting point is 00:08:59 in the underlying file format drill can discover that on the fly. Drill also can query something like five tables where you already have schema definitions. So if you have underlying sources that have schema, Drill doesn't tell you, okay, go and define a separate special model for me in the repository again. Drill can discover this on the fly, right? And then there are obviously some kind of file formats where you
Starting point is 00:09:25 don't have the schema at all or you have a very partial schema hbase is a great example where you have column column families and table names but beyond that you don't have anything so those are the kinds of things where it can discover schema on the fly yeah i mean something i mean something it actually i mean drill for me changed quite radically how i approach doing uh i suppose initial discovery on on hadoop projects um you know in the past as you said i mean there's been this kind of irony that in this talk of sort of schema on read and flexible schema and so on if you're going to query things through hive for example there's there's a kind of classic you know modeling exercise that goes on you have to
Starting point is 00:10:03 kind of understand the tables and columns and the structures and so on, maybe use a Serdi or whatever to translate, you know, JSON, you know, elements into kind of columns. And that's quite a kind of a complicated and time-consuming and fragile process. But now, you know, typically, as you say, you know, you download drill, point it towards some files, point it towards kind of, you know, some Parquet files or some Hive, you know, Hive Metastore and so on. And, you know, rather than it being centralized metadata, as you said, and curated and so on, you can make use of what's there. But it also, particularly if someone has gone and set up Hive, you know, for example, Hive Metastore and so on, you can just use that as well and so you can dip into these different kind of like you know sources of data and the metadata they provide um without having to kind of model it all again centrally in your own kind of you know sort of data warehouse for example exactly i think when you look into
Starting point is 00:10:54 most of the customers right if you look into our customer use cases most of the times basically they they are using both these things together right right? So the truth is I think Hive is used by probably like 90 plus percent of the use cases. The primary focus is batch and ETL processing. So they have data where they do batch processing, they do ETL processing, and you have Hive metastore populated and all the groundwork is done. So if your groundwork is done then the value that drill can offer is use that reuse let you reuse the metadata and do interactive queries on that on those tables right so both both these are complementing solutions hive is batch etl drill is focused
Starting point is 00:11:38 on interactive queries the additional thing is the data that is not processed by Hive yet, some JSON data, some log data, Drill lets you look into that as well. Yeah, and for me, the most surprising point was performance. And I kind of expected Drill to be slow when I first used it, if we were querying JSON documents and et cetera, et cetera, especially the fact it was kind of downloaded and ran on my local machine and so on but the speed of it was was fantastic and and do you want to just maybe just you talked about hive earlier on where where where in terms of kind of the two tools complementing first of all what's the performance profile like
Starting point is 00:12:18 of drill and and how do would people how do people kind of like you know use it in combination with hive and and where's the kind of the sweet spot between the two and that sort of thing? Yeah, so I think from a use case standpoint, there is very, from a sweet spot perspective, right? So there is very distinct separation between Hive and ETL. So I think Hive traditionally and continues to be used for batch and ETL processing, right? And we do have customers who use Drill for ETL and who use Drill for like batch processing, but we typically recommend them not to do that, right? One reason is, if you have a MapReduce job, basically, which is like five hour kind of a job. So MapReduce or a framework like HighWarpig on top of MapReduce is really designed for those kind of things, right?
Starting point is 00:13:15 Where primary difference in architecture is in memory processing, right? So rather than drill assumes that queries are going to be fast, so tries to do things as much as possible in memory using a pipeline kind of an execution model. And unlike MapReduce or Spark or any other basically kind of ETL oriented technology, it doesn't spend time writing to disk for checkpointing and recording purposes right so it's trying to do things as much as possible in memory and it is going to disk only if things don't fit in memory right so the entire execution pipeline is designed for performance and there are also other things along with in memory such as it's a distributed engine right so you can add nodes to improve performance and there are things like seminar execution so there are a variety of other performance attributing factors but the primary difference is in-memory or you are continuously
Starting point is 00:14:16 writing to disk for recovery purposes checkpointing okay so so so why you describe it there with with kind of in-memory and and you know it's it's doing it all within, you know, so a bit like Tez, for example, you know, it's trying to do things within kind of succinctly the difference in architecture and the difference in design for example the drill has compared to say things like you know uh like say tez or impala and so on what fundamentally what's the difference in terms of how you built it compared to those yeah so fundamentally i think um there are different categories right so if you take the drill impala I would put them actually from a performance profile standpoint they both are in the same category right the other category is Tez or MapReduce or so so I think though those are in a different category so Impala does pretty much similar things.
Starting point is 00:15:26 Both are in-memory things. It's just the way the data model, because V-Lil is more optimized for schemaless, hierarchical data sets. So there is a fundamentally different underlying in-memory data model. But from a core architecture perspective, both are distributed in-mem memory architectures, right? Which is, I mean, there is a, the difference really is how you do the execution, right? This is kind of, as I mentioned, a pipeline execution versus, there are improvements in Tezo over MapReduce, but it is still leveraging disk.
Starting point is 00:16:06 There is a fundamental architecture difference on how you schedule the different tasks. Okay, okay. So Drill, I've been working recently with BigQuery, for example, on the Google platform. And my understanding is that Drill had some kind of inspiration from Dremel, for example, from Google. Is that the case? Is there some kind of like common sort of thinking there or inspiration from that kind of inspiration from Dremel, for example, from Google. Is that the case? Is there some kind of like common sort of thinking there or inspiration from that kind of project within Google? Yeah, certainly, right.
Starting point is 00:16:33 I think the inspiration is really about scale. So the biggest thing that you need to solve in the context of big data is the core thing is how do you process terabytes and terabytes of data sets and also offer optimizations when you don't have to scan that kind of data sets, like things like partitioning and pruning the partitions. So there are a variety of optimization strategies. So I think the inspiration from Dremel is really around the distributed parallel execution, right? It's really kind of the motivation.
Starting point is 00:17:14 I think drill is still different from Dremel in the perspective we talked about the discovering schemas on the fly, being able to work with nested data and things like that. Those are still kind of differentiators for drill even from Dremel but the scale is kind of the common aspect and the scale is interesting because again yeah the typical way I guess that most people encounter drill is is through downloading it as a single you know it's a drill bit I think you call it and downloading it onto your machine and running with it but how does it how do you how does it then run in clusters and how does it kind of scale up beyond that really really? What's the mechanism behind that, really?
Starting point is 00:17:49 Yeah, so when you're downloading drill onto your local machine, right? So the core daemon in drill is called drill bit. So this is the service which takes requests from the user. It is parsing the SQL query. It is optimizing it, executing it, working with the data sources to get the data and give you from the user. It is parsing the SQL query. It is optimizing it, executing it, working with the data sources to get the data and give you back the results, right? So drill bit, when you're downloading it onto your machine,
Starting point is 00:18:14 you are basically downloading one drill bit. The moment you are putting to in a cluster configuration, you are essentially adding more drill bits, right? So when I deploy drill in hadoop cluster i am essentially deploying a drill bit on preferably like different data nodes in the cluster right so each data node is running a drill bit instance so this is kind of the core service of core service and the way it scales is there is no master slave architecture right so it is a completely distributed architecture so what this means is when
Starting point is 00:18:52 a client is submitting queries your tableau or micro strategy submitting queries it can submit to any drill bit on the cluster so each drill bit is identical and each is capable of parsing the query, optimizing it. And once optimized, distributing, splitting the query into query fragments and distributing it across among the other drill bits available on the cluster. Right. So execution is distributed. And another way is all our optimizations. Everything is parallel. And another aspect I think common across other processing engines is data locality. When you're trying to process
Starting point is 00:19:32 in a highly distributed environment, you want to make sure the processing is done in the same nodes where the data resides. So understanding the location of the data and scheduling processing on those nodes is a skill as a part of scale as well actually okay okay so I suppose a kind of a taking a step back question really um so what why did why did a map are because obviously you know what you work for my power and and and a patchy drill is an open source project why did why did map our think that drill was the kind of the the
Starting point is 00:20:05 sequel on the sequel solution it wanted to go for and and you know what why is it such a big part of your strategy and why has mapper invested you know time money whatever people in in this project really yeah so i think when mapper kind of pioneered drill project the the main sql technology out there was hive right and then impala was probably fairly new at that time as well maybe like a one year or i don't remember the exact time frame but fairly new in the market so i think uh when we added um kind of initiated this project i would say the the main motivation is, first of all, we have seen that the needs for SQL in our customer environments are evolving, right? So there is a class of customers that were offloading basically from Teradata and all
Starting point is 00:20:58 sorts of data warehouses into Hadoop for scale and to be able to reduce their costs, right? Cost and scale are the big reasons. But at the same time, there is a class of customers, they adopted Hadoop, not just for scale and cost reasons, but they are able to actually bring in new types of data that they couldn't before, right? So these things, again, are click streams, logs, sensors, IoT kind of data data telco customers so so many customers are bringing new types of data and for them um this having this
Starting point is 00:21:34 whole entire thing that okay to do anything with the data you have to start with an modeling exercise or etl processing was almost like showstopper in the sense they could do only limited things with that kind of a paradigm so I think the thinking really was this is a new world new data types and new kinds of scale requirements and you can't just reinvent your relational paradigm here you need to think about sql in a different way right so i would say i think that's kind of the main motivation in different types of data sources structured semi-structured and you need to think about in a different perspective and the beauty is you have to bring it into the same set of users right these users are your again bi users excel users tableau users so
Starting point is 00:22:27 so how do how do you do this like kind of bridging the gap between your sql paradigm at the same time to a no sql kind of a data world so i think that's where kind of the location for yeah yeah i mean certainly i'll be honest i mean it was the thing that got me back interested in map r again i think i think the fact that you know what a great choice of sql engine really i mean i think you've absolutely nailed it there saying that the whole that whole you know a massive motivation driver for using hadoop is is not just cost it's the flexibility it's the it's the ability to bring in new data sources and so on and yet we do things in the same way up until now you know modeling things very kind of like in new data sources and so on. And yet we do things in the same way up until now, you know,
Starting point is 00:23:05 modeling things very kind of like very formally and so on. So I'd like to get on to the kind of the wider platform in a bit. But just before we go on to that, actually, again, the person who put me in contact with you was my old colleague, Robbie Moffat, who I think wrote a blog post for MapArt on connecting the Oracle BI tools to Drill. So what's the support like currently for Drill within the kind of the BI tools kind of world? And how does it kind of work? I mean, how do you kind of reconcile the flexible kind of schema, I guess, you get from Drill with the more formal kind of metadata layers you get in BI tools?
Starting point is 00:23:40 I mean, how does that tend to work? Yeah, so Drill, first of all, provides ANSI SQL, right? So it uses Apache Calcite as the SQL parsing layer, and we have done extensions to Calcite to support parallel optimizations. But the main syntax in Drill is obviously ANSI SQL, right? So we have pretty good support for it. Beyond ANSI SQL, we have done extensions so that you can work with nested data. So there are ANSI SQL extensions such as like Flatten, KVGen.
Starting point is 00:24:14 There are like all sorts of functions available so you can flexibly work with nested data. So with respect to the BI tool integration, we do have JDdbc and odbc drivers right so that that's that's what bridges from a connectivity perspective and so the kind of the flow that happens is if you have simply structured data there is nothing much you have to do right you just take it and expose it and you can access it via jdbc odbc If you have nested data, which is like eight level nested JSON documents, and you need to reach into the nested data. So as part of our ODBC driver, we have something called Drill Explorer, which allows you to look into the data that is available through Drill,
Starting point is 00:25:03 and you can create views on top of this nested data. So I can use something like a flatten function to create kind of a relational representation of the underlying data. So you create this view, and then you connect from ODBC in Tableau, and this view shows up as just like any other table. Right. And then you are immediately working with it.
Starting point is 00:25:29 So I think the main I think the key point is the bridge is happening either automatically or bridge is happening because you have done the data exploration and you have created views. View is just logical. So there is no ETL or a physical representation. because you have done the data exploration and you have created views. View is just logical. So there is no ETL or a physical representation, but you are creating these views, which are giving the physical kind of relational representation to the BI tools. Fantastic. Fantastic. So, okay.
Starting point is 00:26:03 So drill was my kind of intro back into kind of MapR, but looking at your kind of website, looking at the kind of materials working with the platform and so on that you've got quite an interesting kind of platform with with what you call them the map is it map our converged platform and there's a lot of kind of interesting technology and a lot of interesting value ads in there that i haven't seen before things like microservices and so on and so forth just again just for anybody new to MapR, maybe paint a picture for what MapR's wider platform is like for Hadoop and for big data analytics and so on.
Starting point is 00:26:34 Just paint a picture, first of all, of what's there and again, differentiate us really from the other things we used to in the past. Yeah, so at the highest level, I would say Mapper platform, Mapper converged data platform is something you would use for analytics, right? Obviously, whether it is traditional BI reporting kind of analytics or new types of exploration, machine learning kind of analytics. So this is a platform for analytics. But at the same time, the converged platform is something you would use to build mission-critical operational applications. So what MapR offers is a unified platform on which you can run analytics and operational applications. So that's kind of our differentiation.
Starting point is 00:27:21 So it's not simply an analytics platform, but it is your next generation data platform. And so with respect to the components, we do have a distributed storage system called Mapper file system. And then we have a high performance NoSQL database. So this is what you would use to build actual apps, applications, business applications. It's called mapper db and then there is global published subscribing messaging system called mapper streams so it's a streaming
Starting point is 00:27:53 system and so that's kind of the three products storage base and streaming and all these are built on a core platform that provides a variety of services such as reliability, consistency, multi-tenancy, security. So all the products are built on the same platform and they inherently basically get the benefits of the same platform. So I think our different vision is really kind of bringing these different data models into one platform so that it can serve both analytics and application needs. Okay. So my understanding as well with MapR, a difference was that you were quite, you know, you're more open, I suppose, to using more proprietary things. You developed yourself and combining that with the platform.
Starting point is 00:28:42 I mean, what's MapR's philosophy in terms of, you of you know use of open source use of kind of your own technology you know again to try and put a picture in people's mind you know how how would how would your platform differ from say pure open source really where do you add value and what's different really about it yeah so i think the thing that i mentioned, which is the converged platform itself is our differentiation, right? So which is the file system DB streams all together in a single platform, right? So literally deploying not three clusters, like a Cassandra cluster, Kafka cluster, Hadoop cluster for different needs, you have a single mapAPR cluster, right? So I think that's kind of our innovation. From our philosophy is each of these products, if you take our file system, there is a NFS POSIX compliant interface. Same time, it is also exposed via HDFS interface.
Starting point is 00:29:42 And MAPRDB is exposed via HBase interface. The streaming system is exposed with Kafka API. So the idea is really leverage our platform, but also be able to leverage kind of the innovation happening in the open source. So that has been the philosophy. And we do invest both in the platform as well as the top layer, the open source layer. So drill exactly fits in there.
Starting point is 00:30:05 Exactly, exactly. I find it very interesting. Yeah, I mean, so you mentioned early on that you're also responsible for MapRDB as well. So tell us about that. What's that? And again, what is different about that? What's the innovation and so on there?
Starting point is 00:30:20 Yeah, so MapRDB is, first of all, when Mapper originally, I think, several years back, we started with our, basically, the kind of value prop was provide a high-scale, highly reliable, multi-tenant storage platform. It's not your secondary storage for analytics, but it is also a read-write file system that you can use as a primary storage right and it is exposed via hdfa so you can run hadoop map reduce so making hadoop enterprise grade was kind of our original positioning and the next wave in the next phase basically we added on the same platform a different data model. The data model is a key value store. So the benefit of MapperDB, MapperDB is essentially a NoSQL database. The benefit is it's extremely scalable,
Starting point is 00:31:15 highly performant, and most importantly, the users, the customers use MapperDB whenever they need extremely critical, mission-critical SLA requirements. So originally, I mean, we do have HBase in our distribution. Several customers had hit issues around the kind of the operational issues with HBase, especially around things having to deal with compactions and stuff like that, which was impacting their SLS. So I think one of the primary things, primary promises with MapRDB is it's integrated into
Starting point is 00:31:53 MapR, and most importantly, it is basically a NoSQL database that can serve your mission critical needs with no spikes in latencies. Okay, interesting, interesting. Yeah, interesting. So with Drill itself, I mean, what – so there's other projects around as well that are kind of interesting in this space, and I think I've seen an article you've written in the past on Apache Arrow, for example, and there's Druid as well and so on.
Starting point is 00:32:19 I mean, Apache Arrow in particular, again, maybe just introduce what it is sort of thing, but there's an article you wrote on it. What interests you about Arrow and where does kind of Arrow fit into this really with your thinking and maybe kind of the plans in the future? Yeah, so Arrow is basically, it's like an in-memory data format. It's like an in-memory data format, right? It's like an in-memory data interchange format. So if you look into, so why Arrow, right? Most of the times, big data analytics, they have evolved over time, right? If you look into the last two, three years, there was a lot of investment around improving
Starting point is 00:33:01 columnar data formats like Parquet or OC. And then there is a lot of work going on in Drill and Spark and Impala, like query engines. So the bottom line is an efficient columnar representation on disk as well as in memory is kind of core to a few performance in analytics workloads. So the main thing is drill is actually one of the first big data query engines which is columnar on disk but it is also columnar in memory and I'm not sure if you are familiar but the arrow format takes its roots from the in-memory representation that is developed as part of Apache drill project so it's called value
Starting point is 00:33:46 yeah so this is basically kind of the foundation for drill the in-memory representation of drill is value vectors which now is modularized into a separate project called arrow because now most of the systems right like if you take spark or impala or any other query engines or even the api's like Python or R they're all moving towards columnar processing right so it made sense to take the in-memory columnar format that drill had and modular it and make it like a separate open source project so so to that extent, actually Arrow takes roots from Drill and you can see that basically the kind of PMC members,
Starting point is 00:34:32 committers, they're all like similar people, but it's more broader forum now. Arrow has more people than Drill, more people in the like beyond the Drill team that are contributing to arrow um so really supportive of that initiative and we sort of have arrow already but eventually as the project matures we totally are going to integrate it and contribute to it okay interesting so what about i'm just in just to round off on the SQL side,
Starting point is 00:35:06 what about security and access control and so on? I mean, one of the challenges I hear from customers is, I suppose the obvious things about how do we secure things, how do we do role-based access control, but also I suppose that the amount of projects that are out there, you know, Ranger and kind of Record Service and sort of Sentry and so on there. What's your thoughts on how we apply security across SQL and Hadoop in a way that kind of makes sense to enterprises?
Starting point is 00:35:31 But I suppose particularly with your focus on enterprises, what was your thoughts on security around SQL and Hadoop? Yeah, so I think our, I would say, like vision, right? If you look look into drill we went through like things like metadata and stuff like that we drill is always like a distributed kind of a world right so there is no central repository for anything whether it is for metadata whether it is for security permissions so it's just the same thing I, the same concept is applicable for security as well. So first of all, I think security is so critical, right? The moment you are exposing big data to a larger audience, BI users, data scientists, security is such a critical thing.
Starting point is 00:36:17 And the approach we took in Drill for security is a decentralized approach, just like metadata. It's a decentralized approach just like metadata it's a decentralized approach to security so the way to achieve so drill can do both column level as row level security so the way to do that is we have drill views right so simple equal views but the views are not catalog objects that you store and secure permission put permissions on the views themselves are files on the file system so when you create a file in drill using create view as you literally see like a dot view dot drill file on the file system so the benefit of that is when views are represented as your kind of platform constructs such as files you can use the same permission
Starting point is 00:37:06 model that you have on the platform to secure these views as well you don't need to reinvent a separate technology to store the object's information to store security permissions so leveraging the platform security is kind of our model. So decentralized approach and drill has support for impersonation. So when a user comes in, take the user identity, pass it along into the underlying platform. So a combination of views, impersonation, and of course it has support for things like Kerberos, Sazzle and basic authentication for different types of authentication mechanisms.
Starting point is 00:37:46 But the main difference is Drill doesn't need a ranger or a sentry to secure it. It has its own decentralized security model. Okay, okay. So taking a look forward to where things are going, really, I mean, if you look at, I suppose, the map our platform going forward and the stuff you're doing um again looking looking at stuff you've written and stuff your company's written there's some interesting things that you're doing there i mean things like microservices um i i'm putting a spot here a little bit really but you know what what what what are microservices in the context of how map are talking about them and why why is
Starting point is 00:38:23 this something again that that you guys are investing in and talking about, really? I'm not sure if it's your area, but certainly I've seen it mentioned quite a lot in MAPR sort of like publicity material. What is that about, really? Yeah, so I think if you look into MAPR Converge platform, so one thing I mentioned before was it's not just an analytics platform.
Starting point is 00:38:44 It is analytics and operational apps, right? So this is actually an application development platform as well. So the way to think about it is microservices is like an architecture pattern, right? Rather than having monolithic apps, now the application architectures are evolving. They're more purpose-built. They're more self-contained. They're more elastic depending on the evolving. They're more purpose-built. They're more self-contained. They're more elastic, depending on the needs. They're more scalable.
Starting point is 00:39:10 And these microservices are not triggered like every day, 3 o'clock in the morning with a timer, but these are triggered by events. When a particular transaction happens into one system, there are five other systems doing five other different types of processing right so these basically microservices are interconnected using a published subscribed kind of a system like a messaging system so why is it relevant for mapper because the first thing is mapper has mapper streams which is publish, subscribe kind of a messaging system, which interconnects these microservices, right? So that is kind of one model. So people basically develop applications on Mapper, something like Mapper DB, they build an application, and they have like hundreds of such applications that are interconnected via MapR streams.
Starting point is 00:40:08 So the concept of microservices is in that idea that you are able to build apps, but you are also able to build connected apps using MapR streams. So that's why that is kind of relevant for us. Okay. Okay. streams um so that that's why that is kind of relevant for us okay okay so i mean i'm working on a system similar to that in a way with publish describe and real-time feeds going in and so on i mean and that's that's kind of running in in a cloud environment um so um what as everybody moves to the cloud and everybody adopts that and and so on and hadoop as a service and so on you know what where where does map rc this going and and and where what what might for example a mapper
Starting point is 00:40:45 analytics platform in a few years time maybe running the cloud as services what what might it look like really to people you know coming coming coming into it then what's your kind of vision in that really yeah so first of all mapper is available in cloud today right so if you go to amazon emr and along with the Amazon-based distribution, you can actually click on MapR and you can provision it and you can use it, right? So we do have actually a lot of hundreds of customers using MapR also in cloud.
Starting point is 00:41:19 So from that point of view, we do have kind of a cloud offering, right? So certainly, I think when it comes to cloud, there is obviously more investment that we are doing. Mainly, I would say is around improving the experience of using cloud, right? So whether it is being able to provision using cloud or container infrastructures as your mesos or Kubernetes, sort of provisioning aspect. But also the moment you're going to cloud, you need to be able to handle elasticity, right?
Starting point is 00:41:55 Being able to scale up and down depending on the load you have. So there is basically from how you manage resources from an application framework standpoint, that's an important area that we are investing, obviously. And there are also, once you are in cloud and you are in this highly distributed environment, MapR as a platform has several features around it. it so we do have things like global multi-master replication so which is as your local systems are doing some collecting some data immediately that data is copied into some centralized data lake without any latency right so things like multi-master replication so there are so many features in the product being able to handle the global scale. So there is ongoing work happening with respect to making sure that there is more intelligence, data aware kind of replication strategies and scheduling strategies. Yeah, so really, I think when it
Starting point is 00:43:01 comes to cloud, it's all about how easily you are able to do things and how efficiently you are able to manage your resources and processing. So there is a lot of foundation in the product already. We continue to make progress on that. Okay, what about drill? I mean, in terms of the drill projects and so on, what's kind of on the roadmap? What can we expect to see happening in the next kind of like you know six months year with drill really what you know in the future yeah so with drill i think i can talk about it from probably two perspectives
Starting point is 00:43:34 one is drill as a product how it is evolving and then drill in the context of mapper right so from drill as a product i think there is a huge traction with respect to how customers are adopting drill so one interesting thing i think i'm personally seeing is there is more desire to build analytics as a service kind of applications so we have several customers they collect a lot of data they just not only use it for in-house BI and reporting purposes, but they are also making it available as a product, data as a product to their end users, right? So these are kind of requirements that come in, right? Mostly around very critical latencies, hundreds to thousands of users of concurrency. You need to be able to handle multi-tenancy, right? So things like I have 15 tenants, one tenant is doing batch type
Starting point is 00:44:42 of queries, one is doing dashboards, the third one is doing dashboards the third one is doing ad hoc queries the fourth one is running an operational app so how do you kind of prioritize these workloads how do you isolate these workloads so there is a lot of work around multi-tenancy management yeah so i would say like from an enterprise perspective, performance, scale, resource management is kind of the core investments. And then obviously there are ongoing things with respect to like SQL improvements, JSON support improvements, so things like that.
Starting point is 00:45:19 I think those are ongoing kind of improvements as well. From a MapWare standpoint, ongoing kind of improvements as well from a yeah yeah from a mapper standpoint i think this is kind of an exciting time for us because we have pretty good products so we are spending a lot of time on integrating drill with mapper streams basically doing ad hoc queries on streaming data and then drill on top of mapper DB. So essentially the way we see drill in the context of mapper is it's a unified SQL layer for mapper, right? So different types of cases, BI or application development or for any type of use case, really is kind of your unified access layer, SQL access layer. So to that extent, there is a lot of deeper integrations with the MapR Converged platform.
Starting point is 00:46:13 Excellent. I mean, certainly, so it took me a while to get you to actually to do an interview because you've been so busy with product releases and so on. So that's kind of, yeah, obviously, there's a lot of very interesting things coming along from there. You mentioned kind of drill against streaming data. So is that almost like continuous query or some kind of different model to be able to kind of query a data stream while it's still in motion and still running? So those are two different types of use cases.
Starting point is 00:46:40 So MapR Streams is basically a published subscribe messaging system so I mentioned that one of the aspects of map our streams is it it can also be a system of record so it's not a ephemeral storage where things come in and data get processed and it goes away so there is a use case like that. But for people also can use it as a system of record. So they don't need to take out that data after three days, they can keep it forever as a system of record. So streams is a system of record in MapR. So what that means is you have all your real time data coming into MapR streams. And as a business analyst, I can literally go to MapR streams and I can start doing exploration and ad hoc queries.
Starting point is 00:47:32 I don't need to wait it to come to some Parquet format or some MapR DB workload. I can just go to streams and start querying it, right? So that's one use case. Basically, data exploration and ad hoc queries on top of real time data. And the second use case is what you mentioned, which is more of a continuous query as the data comes in, do something with it and load it into a downstream database or surface it into an application. So that's the second use case so we are starting with the first one but
Starting point is 00:48:06 both both are the both are use cases for drill and streams fantastic fantastic so i'm conscious of your time uh near just so um just just um just a summary there where can people find information about apache drill and where can find where can people find information about map ours initiatives in this area and kind of platform in general? Broadly, where would these be? Yeah, so first of all, Drill has, there is a website, drill.apache.org, right? So that is a website for Drill. And there is excellent documentation.
Starting point is 00:48:37 So there are, I think, 20 or so tutorials that you can use to get started with drill. So there is also a sandbox that you can download to experiment drill with Hadoop. You can also download drill onto your local machine and start working with it. So I think a great resource is the website. And the community is extremely active. So you can get on to dev at user at drill.apache.org or devette.apache.org. You can sign up, join the community. There is a very active community, so you can get a lot of help there.
Starting point is 00:49:15 From a MapR perspective, again, from the MapR website, we have a lot of product information, use case information, data sheets, demos around drill. So that's a MapR website is a great place to go for understanding the use cases and integrations available. So highly recommend that as well. Okay. And you'll be very modest there because you actually have a blog on the MapR website as well, which is fantastic. And you do kind of video, I think whiteboard walkthroughs as well and that sort of thing so anybody listening that is yeah after more information about this uh your blog is fantastic really and so that's a great intro to the platform and to drill into the concepts behind it as well so uh i'd recommend that as well for anybody um interested in this
Starting point is 00:50:00 well well thank you look thank you very much for coming on the call uh it's really good to speak to you it's great to actually finally speak to you after reading so much about you and what you've done in the past and in the concepts you talk about. Thank you very much. And yeah, I mean, stay in touch and good luck with the project and good luck with what you're doing at MAPR. Yeah, thanks a lot. Thanks for giving me this opportunity. And it's a great pleasure talking to you. Thank you. OK, thank you okay thank you thank you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.