Drill to Detail - Drill to Detail Ep.13 ‘Apache Drill, MapR + Bringing Data Discovery to Hadoop & NoSQL’ with Special Guest Neeraja Rentachintala
Episode Date: December 13, 2016Mark Rittman is joined by MapR's Neeraja Rentachintala to talk about Apache Drill, Apache Arrow, MapR-DB, extending Hadoop-based data discovery to self-describing file formats and NoSQL databases, and... why MapR backed Drill as their strategic SQL-on-Hadoop platform technology.
Transcript
Discussion (0)
So, welcome to Drill to Detail, the podcast about big data, analytics and data integration,
and I'm your host, Mark Ripman.
So, if you've come to any of the presentations I've given in the past year
at the meetups and conferences and so on,
you've probably heard me talk enthusiastically
about something called Apache Drill
and how I've called it in some cases, you know,
sometimes the future of SQL on Hadoop,
a very interesting sort of product.
And so I'm very pleased in this episode
to be joined by Nirija from MapR,
who I've read her blogs in the past
and I've seen some associates written and presented on the sort of drill.
And she's actually the Senior Director of Product Management at MapR,
responsible for product strategy, roadmap and requirements
for MapR's SQL initiatives, including Apache Drill.
So Nirija, do you want to just introduce yourself kind of properly
and tell us kind of how you came to this, doing this and what you do really?
Yeah, sure. First of all, thanks, Mark, for having me here.
Yes, so my name is Neeraja Renta Chintala. I am with MapR. So I have been with MapR for about three years now. And my responsibilities and from a product standpoint in Mapper is mainly two things.
One is our SQL strategy. So we do offer a variety of SQL products on Mapper platform. And one of
the strategic areas we focus is Apache Drill. So SQL strategy. The second aspect is I'm also
responsible for our high performance NoSQL
database that we have on the MapR platform called MapRDB. So yeah, so I'm responsible for both for
SQL as well as for NoSQL products. Prior to MapR, I was in Informatica. So I was working on a
product called Informatica Data Services. This is part of the enterprise data integration suite that Informatica has.
And before that, I was in Microsoft as part of the SQL Server business intelligence products
called SQL Server Reporting Services, as well as Power BI.
Before that, I was in Oracle and Expedia.com. So mostly, most of the time, I have spent my career focusing on enterprise application integration, data integration, business intelligence, analytics.
So those kind of things.
Wow, interesting, interesting.
So, I mean, I came across your name and what you've been doing when I started looking at Drill and the MapR platform.
And I remember thinking at the time, this is somebody I'd really like to speak to.
It sounds like you've worked with some very interesting products
and you're looking after a very interesting area within MapR at the moment.
So for anybody that is kind of new to drill or this kind of area,
just explain just a high level.
What is Apache Drill and what does it do differently to say Hive and
other previous SQL and Hadoop
initiatives?
Okay.
So I think at the highest
level, right, so the way I like to describe
Drill is, first of all, it's an open
source interactive
SQL query engine that
can provide data exploration
as well as BI ad hoc queries on big data,
right? So this is essentially at the core of it, it is a distributed in-memory SQL engine
with which you can get low latency performance on large scale data sets, right? So the interactive
performance and big data is the key thing. However, what's different about Drill, right? So
there are so many interactive SQL products out there. So what's really different about Drill is
along with the performance, it also gives a lot of flexibility, right? So we can go into that a
little bit later. But at the basic level, Drill allows you to do SQL queries without having to define any schemas upfront.
So you just point to the data wherever it is, files, HBase, whatever your data source is,
and you can start doing queries in minutes. So you don't have to spend probably like weeks or
months of modeling time. You're able to look into the data understand the data immediately so they're really kind of
working with new types of data that is common in hadoop and the big data world and getting value
from it quickly is kind of the focus for or the differentiation for drill yeah yeah and that's
certainly to me i mean that listening to that and reading that on a kind of data sheet, a lot of vendors will say, you know, it's quick to set up, the products they've got, they're easy to use and so on.
But it certainly struck me as the way you're doing this and the way that, and we'll get onto this later on, the way that you're kind of leveraging, I suppose, the inbuilt metadata in a lot of the kind of the the kind of data types and coming to hadoop these
days you know it is very interesting and so i mean for me there were two parts the fact it was so
easy to download and install you know it literally was download a zip file unzip it and work with it
but but this concept you're talking about about uh about being analyzed data immediately uh without
having to kind of define you know metadata and that sort of thing that's pretty revolutionary
and i don't think people kind of get the significance of that at the start.
So talk us through what that means then.
How do you do analysis faster in terms of kind of getting access
to the metadata and structures and so on, really?
Yeah, so I think if you look into how the BI and analytics
have evolved over time, right?
So there is clearly kind of an evolution
towards doing things self-service, right?
So the traditional reporting is basically
the data warehousing team putting together a data model
and the BI users would consume kind of pre-built reports.
And probably a decade ago or so,
there is a lot of
innovation on the bi side of the world tab click views and they all came in and said okay there is
data but as a business user i need flexibility i need self-service capabilities so that i can
visualize the data in whatever way i want right no prefix, predefined reporting format. So there is a lot of innovation happened
to make the BI world more self-service. But I think on the data side, still things are kind of
the same way, right? In the sense you need to do the ETL, you need to kind of prepare your data
to be ready for analytics. There is nothing wrong with it and there is complete value in it.
But in the context of big data, though, it's more challenging in the sense people are collecting
these huge amounts of data into their data lakes or data hubs.
And this data is transactional data, semi-structured data sets, clickstream, sensor, all sorts
of data sets.
And there is a need, we have seen clearly from customers, that they need to actually see into the data first before they decide what kind of data to operationalize, right? So they need to
understand the data, discover the data. So I think from a business value perspective, opening up the data early in the cycle to users for data exploration, data discovery purposes, so that they can figure out what to do with the data.
I think that's the motivation, business motivation for the product.
Yeah, definitely. I mean, before getting involved with Drill and some of the technologies you're working with there, I mean, I was working with data discovery tools from the likes of Indeca and so on,
where the idea there was that you could analyze data in place, no data left behind and so on.
But they typically involved loading into a kind of its own kind of NoSQL or key value store, analyzing it and so on.
And the thing that struck me with Drill was the fact that you basically point it towards a data source and it tries to use the inbuilt metadata that's in there, really.
And if you think about some of the data types we'd have now, JSON and so on and so forth, there's a lot of structure in there that you can leverage.
And so Drill leverages that inbuilt metadata, is that correct? Right, so I think the way to think about that is
I think there is a lot of obviously collateral that we have
where we talk about schema-less kind of a system.
So I think there is a thing to say about it, right?
So when we say schema-less, what we are talking about is really
there is no central repository of schema
but if you take kind of these formats you just mentioned json there are formats like okay there
is actually some level of schema in the file itself right so if there is a schema available
in the underlying file format drill can discover that on the fly. Drill also can query something like five tables
where you already have schema definitions.
So if you have underlying sources that have schema,
Drill doesn't tell you,
okay, go and define a separate special model for me
in the repository again.
Drill can discover this on the fly, right?
And then there are obviously some kind of file formats where you
don't have the schema at all or you have a very partial schema hbase is a great example where you
have column column families and table names but beyond that you don't have anything so those are
the kinds of things where it can discover schema on the fly yeah i mean something i mean something
it actually i mean drill for me changed
quite radically how i approach doing uh i suppose initial discovery on on hadoop projects um you
know in the past as you said i mean there's been this kind of irony that in this talk of sort of
schema on read and flexible schema and so on if you're going to query things through hive for
example there's there's a kind of classic you know modeling exercise that goes on you have to
kind of understand the tables and columns and the structures and so on, maybe use a Serdi or
whatever to translate, you know, JSON, you know, elements into kind of columns. And that's quite a
kind of a complicated and time-consuming and fragile process. But now, you know, typically,
as you say, you know, you download drill, point it towards some files, point it towards kind of,
you know, some Parquet files or some Hive, you know, Hive Metastore and so on.
And, you know, rather than it being centralized metadata, as you said, and curated and so on, you can make use of what's there.
But it also, particularly if someone has gone and set up Hive, you know, for example, Hive Metastore and so on, you can just use that as well and so you can dip into these different kind of like you know sources of data and the metadata they provide um without having to kind of model it all again centrally in
your own kind of you know sort of data warehouse for example exactly i think when you look into
most of the customers right if you look into our customer use cases most of the times basically
they they are using both these things together right right? So the truth is I think Hive is used by probably like 90 plus percent of the use cases.
The primary focus is batch and ETL processing.
So they have data where they do batch processing, they do ETL processing,
and you have Hive metastore populated and all the groundwork is done.
So if your groundwork is done then the value that drill can offer is
use that reuse let you reuse the metadata and do interactive queries on that on those tables
right so both both these are complementing solutions hive is batch etl drill is focused
on interactive queries the additional thing is the data that is not processed by Hive yet, some JSON data, some log data,
Drill lets you look into that as well.
Yeah, and for me, the most surprising point was performance.
And I kind of expected Drill to be slow when I first used it,
if we were querying JSON documents and et cetera, et cetera,
especially the fact it was kind of downloaded and ran on my local machine and so on but the speed of it was was
fantastic and and do you want to just maybe just you talked about hive earlier on where where where
in terms of kind of the two tools complementing first of all what's the performance profile like
of drill and and how do would people how do people kind of like you know use it in combination with
hive and and where's the kind of the sweet spot between the two and that sort of thing?
Yeah, so I think from a use case standpoint, there is very, from a sweet spot perspective, right?
So there is very distinct separation between Hive and ETL.
So I think Hive traditionally and continues to be used for batch and ETL processing,
right? And we do have customers who use Drill for ETL and who use Drill for like batch processing,
but we typically recommend them not to do that, right? One reason is, if you have a MapReduce job, basically, which is like five hour kind of a job.
So MapReduce or a framework like HighWarpig on top of MapReduce is really designed for those kind of things, right?
Where primary difference in architecture is in memory processing, right?
So rather than drill assumes that queries are going to be fast, so tries to do things as much as possible in memory using a pipeline kind of an execution model.
And unlike MapReduce or Spark or any other basically kind of ETL oriented technology, it doesn't spend time writing to disk for checkpointing and recording purposes right so
it's trying to do things as much as possible in memory and it is going to disk only if things
don't fit in memory right so the entire execution pipeline is designed for performance and there are
also other things along with in memory such as it's a distributed engine right so you can add nodes
to improve performance and there are things like seminar execution so there are a variety of other
performance attributing factors but the primary difference is in-memory or you are continuously
writing to disk for recovery purposes checkpointing okay so so so why you describe it there with with
kind of in-memory and and you know it's it's doing it all within, you know, so a bit like Tez, for example, you know, it's trying to do things within kind of succinctly the difference in architecture and the difference in design for example the drill has compared to say things like you know uh like say
tez or impala and so on what fundamentally what's the difference in terms of how you built it
compared to those yeah so fundamentally i think um there are different categories right so if you
take the drill impala I would put them actually
from a performance profile standpoint they both are in the same category right
the other category is Tez or MapReduce or so so I think though those are in a
different category so Impala does pretty much similar things.
Both are in-memory things.
It's just the way the data model,
because V-Lil is more optimized for schemaless, hierarchical data sets.
So there is a fundamentally different underlying in-memory data model.
But from a core architecture perspective,
both are distributed in-mem memory architectures, right?
Which is, I mean, there is a, the difference really is how you do the execution, right?
This is kind of, as I mentioned, a pipeline execution versus, there are improvements in Tezo over MapReduce, but it is still leveraging disk.
There is a fundamental architecture difference on how you schedule the different tasks.
Okay, okay.
So Drill, I've been working recently with BigQuery, for example, on the Google platform.
And my understanding is that Drill had some kind of inspiration from Dremel, for example, from Google.
Is that the case? Is there some kind of like common sort of thinking there or inspiration from that kind of inspiration from Dremel, for example, from Google. Is that the case?
Is there some kind of like common sort of thinking there
or inspiration from that kind of project within Google?
Yeah, certainly, right.
I think the inspiration is really about scale.
So the biggest thing that you need to solve in the context of big data
is the core thing is how do you process terabytes and terabytes of data sets
and also offer optimizations when you don't have to scan that kind of data sets, like
things like partitioning and pruning the partitions.
So there are a variety of optimization strategies.
So I think the inspiration from Dremel is really around
the distributed parallel execution, right? It's really kind of the motivation.
I think drill is still different from Dremel in the perspective we talked about the discovering
schemas on the fly, being able to work with nested data and things like that. Those are still
kind of differentiators
for drill even from Dremel but the scale is kind of the common aspect and the scale is interesting
because again yeah the typical way I guess that most people encounter drill is is through
downloading it as a single you know it's a drill bit I think you call it and downloading it onto
your machine and running with it but how does it how do you how does it then run in clusters and
how does it kind of scale up beyond that really really? What's the mechanism behind that, really?
Yeah, so when you're downloading drill onto your local machine, right?
So the core daemon in drill is called drill bit.
So this is the service which takes requests from the user.
It is parsing the SQL query.
It is optimizing it, executing it, working with the data sources to get the data and give you from the user. It is parsing the SQL query. It is optimizing it, executing it,
working with the data sources to get the data
and give you back the results, right?
So drill bit, when you're downloading it onto your machine,
you are basically downloading one drill bit.
The moment you are putting to in a cluster configuration,
you are essentially adding more drill bits, right?
So when I deploy drill in hadoop
cluster i am essentially deploying a drill bit on preferably like different data nodes in the
cluster right so each data node is running a drill bit instance so this is kind of the core service
of core service and the way it scales is there is no master slave architecture
right so it is a completely distributed architecture so what this means is when
a client is submitting queries your tableau or micro strategy submitting
queries it can submit to any drill bit on the cluster so each drill bit is
identical and each is capable of parsing the query, optimizing it.
And once optimized, distributing, splitting the query into query fragments and distributing it across among the other drill bits available on the cluster.
Right. So execution is distributed.
And another way is all our optimizations.
Everything is parallel. And another aspect I think common across
other processing engines is data locality. When you're trying to process
in a highly distributed environment, you want to make sure
the processing is done in the same nodes where the data resides.
So understanding the location of the data and scheduling
processing on those nodes is a skill as a part of scale as well actually okay okay so I
suppose a kind of a taking a step back question really um so what why did why
did a map are because obviously you know what you work for my power and and and a
patchy drill is an open source project why did why did map our think that drill
was the kind of the the
sequel on the sequel solution it wanted to go for and and you know what why is it such a big part
of your strategy and why has mapper invested you know time money whatever people in in this project
really yeah so i think when mapper kind of pioneered drill project the the main sql technology out there was hive right
and then impala was probably fairly new at that time as well maybe like a one year or i don't
remember the exact time frame but fairly new in the market so i think uh when we added um
kind of initiated this project i would say the the main motivation is, first of all,
we have seen that the needs for SQL in our customer environments are evolving, right?
So there is a class of customers that were offloading basically from Teradata and all
sorts of data warehouses into Hadoop for scale and to be able to reduce their costs, right?
Cost and scale are the big reasons.
But at the same time, there is a class of customers,
they adopted Hadoop, not just for scale and cost reasons,
but they are able to actually bring in new types of data
that they couldn't before, right?
So these things, again, are click streams, logs, sensors,
IoT kind of data data telco customers so so many customers are bringing new types of data and for them um this having this
whole entire thing that okay to do anything with the data you have to start with an modeling
exercise or etl processing was almost like showstopper in the sense they could do only
limited things with that kind of a paradigm so I think the thinking really was this is a new world
new data types and new kinds of scale requirements and you can't just reinvent your relational
paradigm here you need to think about sql in a different way
right so i would say i think that's kind of the main motivation in different types of data sources
structured semi-structured and you need to think about in a different perspective and the beauty is
you have to bring it into the same set of users right these users are your again bi users excel users tableau users so
so how do how do you do this like kind of bridging the gap between your sql paradigm at the same time
to a no sql kind of a data world so i think that's where kind of the location for yeah yeah i mean
certainly i'll be honest i mean it was the thing that got me
back interested in map r again i think i think the fact that you know what a great choice of
sql engine really i mean i think you've absolutely nailed it there saying that the whole that whole
you know a massive motivation driver for using hadoop is is not just cost it's the flexibility
it's the it's the ability to bring in new data sources and so on and yet we do things in the
same way up until now you know modeling things very kind of like in new data sources and so on. And yet we do things in the same way up until now, you know,
modeling things very kind of like very formally and so on.
So I'd like to get on to the kind of the wider platform in a bit.
But just before we go on to that, actually, again, the person who put me in contact with you was my old colleague,
Robbie Moffat, who I think wrote a blog post for MapArt on connecting the Oracle BI tools to Drill.
So what's the support like currently for Drill within the kind of the BI tools kind of world?
And how does it kind of work?
I mean, how do you kind of reconcile the flexible kind of schema, I guess, you get from Drill
with the more formal kind of metadata layers you get in BI tools?
I mean, how does that tend to work?
Yeah, so Drill, first of all, provides ANSI SQL, right?
So it uses Apache Calcite as the SQL parsing layer,
and we have done extensions to Calcite to support parallel optimizations.
But the main syntax in Drill is obviously ANSI SQL, right?
So we have pretty good support for it.
Beyond ANSI SQL, we have done extensions so that you can work with nested data.
So there are ANSI SQL extensions such as like Flatten, KVGen.
There are like all sorts of functions available so you can flexibly work with nested data.
So with respect to the BI tool integration, we do have JDdbc and odbc drivers right so that that's
that's what bridges from a connectivity perspective and so the kind of the flow that happens is if you
have simply structured data there is nothing much you have to do right you just take it and expose
it and you can access it via jdbc odbc If you have nested data, which is like eight level nested JSON documents,
and you need to reach into the nested data.
So as part of our ODBC driver, we have something called Drill Explorer,
which allows you to look into the data that is available through Drill,
and you can create views on top of this nested data.
So I can use something like a flatten function
to create kind of a relational representation
of the underlying data.
So you create this view,
and then you connect from ODBC in Tableau,
and this view shows up as just like any other table.
Right. And then you are immediately working with it.
So I think the main I think the key point is the bridge is happening either automatically or bridge is happening because you have done the data exploration and you have created views.
View is just logical. So there is no ETL or a physical representation. because you have done the data exploration and you have created views.
View is just logical.
So there is no ETL or a physical representation,
but you are creating these views,
which are giving the physical kind of relational representation to the BI tools.
Fantastic. Fantastic.
So, okay.
So drill was my kind of intro back into kind of MapR,
but looking at your kind of website, looking at the kind of materials working with the platform and so on that you've got quite
an interesting kind of platform with with what you call them the map is it map our converged
platform and there's a lot of kind of interesting technology and a lot of interesting value ads
in there that i haven't seen before things like microservices and so on and so forth
just again just for anybody new to MapR,
maybe paint a picture for what MapR's wider platform is like for Hadoop
and for big data analytics and so on.
Just paint a picture, first of all, of what's there
and again, differentiate us really from the other things we used to in the past.
Yeah, so at the highest level, I would say Mapper platform, Mapper converged data platform is something you would use for analytics, right?
Obviously, whether it is traditional BI reporting kind of analytics or new types of exploration, machine learning kind of analytics.
So this is a platform for analytics.
But at the same time, the converged platform is something you would use to build
mission-critical operational applications. So what MapR offers is a unified platform
on which you can run analytics and operational applications. So that's kind of our differentiation.
So it's not simply an analytics platform, but it is your next generation data platform.
And so with respect to the components,
we do have a distributed storage system called Mapper file system.
And then we have a high performance NoSQL database.
So this is what you would use to build actual apps,
applications, business applications.
It's called mapper db and then
there is global published subscribing messaging system called mapper streams so it's a streaming
system and so that's kind of the three products storage base and streaming and all these are
built on a core platform that provides a variety of
services such as reliability, consistency, multi-tenancy, security. So all the products
are built on the same platform and they inherently basically get the benefits of the same platform.
So I think our different vision is really kind of bringing these different data models into one platform so that it can serve both analytics and application needs.
Okay.
So my understanding as well with MapR, a difference was that you were quite, you know, you're more open, I suppose, to using more proprietary things.
You developed yourself and combining that with the platform.
I mean, what's MapR's philosophy in terms of, you of you know use of open source use of kind of your own technology you know again to try
and put a picture in people's mind you know how how would how would your platform differ from say
pure open source really where do you add value and what's different really about it
yeah so i think the thing that i mentioned, which is the converged platform itself is our differentiation, right? So which is the file system DB streams all together in a single platform, right? So literally deploying not three clusters, like a Cassandra cluster, Kafka cluster, Hadoop cluster for different needs, you have a single mapAPR cluster, right? So I think that's kind of our innovation.
From our philosophy is each of these products,
if you take our file system,
there is a NFS POSIX compliant interface.
Same time, it is also exposed via HDFS interface.
And MAPRDB is exposed via HBase interface.
The streaming system is exposed with Kafka API.
So the idea is really leverage our platform,
but also be able to leverage kind of the innovation happening in the open source.
So that has been the philosophy.
And we do invest both in the platform
as well as the top layer, the open source layer.
So drill exactly fits in there.
Exactly, exactly.
I find it very interesting.
Yeah, I mean, so you mentioned early on that you're also responsible
for MapRDB as well.
So tell us about that.
What's that?
And again, what is different about that?
What's the innovation and so on there?
Yeah, so MapRDB is, first of all, when Mapper originally, I think, several years back,
we started with our, basically, the kind of value prop was provide a high-scale, highly reliable,
multi-tenant storage platform. It's not your secondary storage for analytics, but it is also
a read-write file system that you can use as a primary storage right and it is
exposed via hdfa so you can run hadoop map reduce so making hadoop enterprise grade was kind of our
original positioning and the next wave in the next phase basically we added on the same platform a different data model. The data model is a key value store.
So the benefit of MapperDB, MapperDB is essentially a NoSQL database.
The benefit is it's extremely scalable,
highly performant, and most importantly, the users,
the customers use MapperDB whenever
they need extremely critical, mission-critical SLA
requirements. So originally, I mean, we do have HBase in our distribution.
Several customers had hit issues around the kind of the operational issues with HBase,
especially around things having to deal with compactions and stuff like that,
which was impacting their SLS.
So I think one of the primary things, primary promises with MapRDB is it's integrated into
MapR, and most importantly, it is basically a NoSQL database that can serve your mission
critical needs with no spikes in latencies.
Okay, interesting, interesting.
Yeah, interesting.
So with Drill itself, I mean, what –
so there's other projects around as well that are kind of interesting in this space,
and I think I've seen an article you've written in the past on Apache Arrow, for example,
and there's Druid as well and so on.
I mean, Apache Arrow in particular, again, maybe just introduce what it is sort of thing,
but there's an article you wrote on it.
What interests you about Arrow and where does kind of Arrow fit into this really with your thinking and maybe kind of the plans in the future?
Yeah, so Arrow is basically, it's like an in-memory data format.
It's like an in-memory data format, right? It's like an in-memory data interchange format.
So if you look into, so why Arrow, right?
Most of the times, big data analytics, they have evolved over time, right?
If you look into the last two, three years, there was a lot of investment around improving
columnar data formats like Parquet or OC.
And then there is a lot of work going on in Drill and Spark and Impala, like query engines.
So the bottom line is an efficient columnar representation on disk as well as in memory
is kind of core to a few performance in analytics workloads.
So the main thing is drill is actually one
of the first big data query engines which is columnar on disk but it is also
columnar in memory and I'm not sure if you are familiar but the arrow format
takes its roots from the in-memory representation that is developed as part of Apache drill project so it's called value
yeah so this is basically kind of the foundation for drill the in-memory representation of drill
is value vectors which now is modularized into a separate project called arrow because now most
of the systems right like if you take spark or impala or any
other query engines or even the api's like Python or R they're all moving
towards columnar processing right so it made sense to take the in-memory
columnar format that drill had and modular it and make it like a separate
open source project so so to that extent, actually Arrow takes roots from Drill
and you can see that basically the kind of PMC members,
committers, they're all like similar people,
but it's more broader forum now.
Arrow has more people than Drill,
more people in the like beyond the Drill team
that are contributing to arrow
um so really supportive of that initiative and we sort of have arrow already but eventually as the
project matures we totally are going to integrate it and contribute to it okay interesting so what
about i'm just in just to round off on the SQL side,
what about security and access control and so on?
I mean, one of the challenges I hear from customers is,
I suppose the obvious things about how do we secure things,
how do we do role-based access control,
but also I suppose that the amount of projects that are out there,
you know, Ranger and kind of Record Service and sort of Sentry and so on there.
What's your thoughts on how we apply security across SQL and Hadoop
in a way that kind of makes sense to enterprises?
But I suppose particularly with your focus on enterprises,
what was your thoughts on security around SQL and Hadoop?
Yeah, so I think our, I would say, like vision, right? If you look look into drill we went through like things like
metadata and stuff like that we drill is always like a distributed kind of a world right so there
is no central repository for anything whether it is for metadata whether it is for security
permissions so it's just the same thing I, the same concept is applicable for security as well.
So first of all, I think security is so critical, right?
The moment you are exposing big data to a larger audience, BI users, data scientists, security is such a critical thing.
And the approach we took in Drill for security is a decentralized approach, just like metadata.
It's a decentralized approach just like metadata it's a decentralized approach to security
so the way to achieve so drill can do both column level as row level security so the way to do that
is we have drill views right so simple equal views but the views are not catalog objects that
you store and secure permission put permissions on the views themselves
are files on the file system so when you create a file in drill using create view as you literally
see like a dot view dot drill file on the file system so the benefit of that is when views are
represented as your kind of platform constructs such as files you can use the same permission
model that you have on the platform to secure these views as well you don't need to reinvent
a separate technology to store the object's information to store security permissions
so leveraging the platform security is kind of our model. So decentralized approach and drill has support
for impersonation. So when a user comes in, take the user
identity, pass it along into the underlying platform. So a combination
of views, impersonation, and of course it has support
for things like Kerberos, Sazzle and basic authentication for different types
of authentication mechanisms.
But the main difference is Drill doesn't need a ranger or a sentry to secure it.
It has its own decentralized security model.
Okay, okay.
So taking a look forward to where things are going, really,
I mean, if you look at, I suppose, the map our platform going forward and the stuff you're doing um again looking looking at stuff you've
written and stuff your company's written there's some interesting things that you're doing there
i mean things like microservices um i i'm putting a spot here a little bit really but you know what
what what what are microservices in the context of how map are talking about them and why why is
this something again that that you guys are investing in and talking about, really?
I'm not sure if it's your area,
but certainly I've seen it mentioned quite a lot
in MAPR sort of like publicity material.
What is that about, really?
Yeah, so I think if you look into MAPR Converge platform,
so one thing I mentioned before was
it's not just an analytics platform.
It is analytics and operational apps, right?
So this is actually an application development platform as well.
So the way to think about it is microservices is like an architecture pattern, right?
Rather than having monolithic apps, now the application architectures are evolving.
They're more purpose-built.
They're more self-contained. They're more elastic depending on the evolving. They're more purpose-built. They're more self-contained.
They're more elastic, depending on the needs.
They're more scalable.
And these microservices are not triggered like every day, 3 o'clock in the morning with a timer, but these are triggered by events.
When a particular transaction happens into one system, there are five other systems doing five other different types of
processing right so these basically microservices are interconnected using a published subscribed
kind of a system like a messaging system so why is it relevant for mapper because the first thing is
mapper has mapper streams which is publish, subscribe kind of a messaging system,
which interconnects these microservices, right? So that is kind of one model. So people basically
develop applications on Mapper, something like Mapper DB, they build an application,
and they have like hundreds of such applications that are interconnected via MapR streams.
So the concept of microservices is in that idea that you are able to build apps, but you are also able to build connected apps using MapR streams.
So that's why that is kind of relevant for us.
Okay.
Okay. streams um so that that's why that is kind of relevant for us okay okay so i mean i'm working
on a system similar to that in a way with publish describe and real-time feeds going in and so on i
mean and that's that's kind of running in in a cloud environment um so um what as everybody moves
to the cloud and everybody adopts that and and so on and hadoop as a service and so on you know
what where where does map rc this going and and and where what what might for example a mapper
analytics platform in a few years time maybe running the cloud as services what what might
it look like really to people you know coming coming coming into it then what's your kind of
vision in that really yeah so first of all mapper is available in cloud today right so if you go to
amazon emr and along with the Amazon-based distribution,
you can actually click on MapR
and you can provision it and you can use it, right?
So we do have actually a lot of hundreds of customers
using MapR also in cloud.
So from that point of view,
we do have kind of a cloud offering, right?
So certainly, I think when it comes to cloud, there is obviously more investment that we
are doing.
Mainly, I would say is around improving the experience of using cloud, right?
So whether it is being able to provision using cloud or container infrastructures as your mesos or Kubernetes,
sort of provisioning aspect.
But also the moment you're going to cloud, you need to be able to handle elasticity, right?
Being able to scale up and down depending on the load you have.
So there is basically from how you manage resources from an application framework standpoint, that's an important area that we are investing, obviously.
And there are also, once you are in cloud and you are in this highly distributed environment, MapR as a platform has several features around it. it so we do have things like global multi-master replication so which is as your local systems are
doing some collecting some data immediately that data is copied into some centralized data lake
without any latency right so things like multi-master replication so there are so many
features in the product being able to handle the global scale. So there is ongoing
work happening with respect to making sure that there is more intelligence, data aware
kind of replication strategies and scheduling strategies. Yeah, so really, I think when it
comes to cloud, it's all about how easily you are able to do things
and how efficiently you are able to manage your resources and processing.
So there is a lot of foundation in the product already.
We continue to make progress on that.
Okay, what about drill?
I mean, in terms of the drill projects and so on, what's kind of on the roadmap?
What can we expect to see happening in the next kind of like you know six months year with drill really what
you know in the future yeah so with drill i think i can talk about it from probably two perspectives
one is drill as a product how it is evolving and then drill in the context of mapper right
so from drill as a product i think there is a huge traction with respect to how
customers are adopting drill so one interesting thing i think i'm personally seeing is there is
more desire to build analytics as a service kind of applications so we have several customers
they collect a lot of data they just not only use it for in-house BI and reporting purposes, but they are also making it available as a product, data as a product to their end users, right?
So these are kind of requirements that come in, right? Mostly around
very critical latencies, hundreds to thousands of users of concurrency. You need to be able to
handle multi-tenancy, right? So things like I have 15 tenants, one tenant is doing batch type
of queries, one is doing dashboards, the third one is doing dashboards the third one is doing
ad hoc queries the fourth one is running an operational app so how do you kind of prioritize
these workloads how do you isolate these workloads so there is a lot of work around
multi-tenancy management yeah so i would say like from an enterprise perspective, performance, scale, resource management
is kind of the core investments.
And then obviously there are ongoing things
with respect to like SQL improvements,
JSON support improvements, so things like that.
I think those are ongoing kind of improvements as well.
From a MapWare standpoint, ongoing kind of improvements as well from a yeah yeah from a mapper standpoint i think this is
kind of an exciting time for us because we have pretty good products so we are spending a lot of
time on integrating drill with mapper streams basically doing ad hoc queries on streaming data and then drill on top of mapper DB. So essentially the way we see drill
in the context of mapper is it's a unified SQL layer for mapper, right? So different types of
cases, BI or application development or for any type of use case, really is kind of your unified access layer, SQL access layer.
So to that extent, there is a lot of deeper integrations
with the MapR Converged platform.
Excellent. I mean, certainly, so it took me a while to get you to actually
to do an interview because you've been so busy with product releases and so on.
So that's kind of, yeah, obviously, there's a lot of very interesting things
coming along from there.
You mentioned kind of drill against streaming data.
So is that almost like continuous query or some kind of different model to be able to
kind of query a data stream while it's still in motion and still running?
So those are two different types of use cases.
So MapR Streams is basically a published subscribe messaging system so I
mentioned that one of the aspects of map our streams is it it can also be a system of record
so it's not a ephemeral storage where things come in and data get processed and it goes away so
there is a use case like that. But for people also can use
it as a system of record. So they don't need to take out that data after three days, they can keep
it forever as a system of record. So streams is a system of record in MapR. So what that means is
you have all your real time data coming into MapR streams. And as a business analyst, I can literally go to MapR streams
and I can start doing exploration and ad hoc queries.
I don't need to wait it to come to some Parquet format
or some MapR DB workload.
I can just go to streams and start querying it, right?
So that's one use case.
Basically, data exploration and ad hoc queries on top of
real time data. And the second use case is what you mentioned, which is more of a
continuous query as the data comes in, do something with it and load it into a downstream database or
surface it into an application. So that's the second use case so we are starting with the first one but
both both are the both are use cases for drill and streams fantastic fantastic so i'm conscious
of your time uh near just so um just just um just a summary there where can people find information
about apache drill and where can find where can people find information about map ours initiatives
in this area and kind of platform in general?
Broadly, where would these be?
Yeah, so first of all, Drill has, there is a website, drill.apache.org, right?
So that is a website for Drill.
And there is excellent documentation.
So there are, I think, 20 or so tutorials that you can use to get started with drill. So there is also a sandbox that you can download
to experiment drill with Hadoop.
You can also download drill onto your local machine
and start working with it.
So I think a great resource is the website.
And the community is extremely active.
So you can get on to dev at user at drill.apache.org or devette.apache.org. You can
sign up, join the community. There is a very active community, so you can get a lot of help there.
From a MapR perspective, again, from the MapR website, we have a lot of product information, use case information, data sheets, demos around
drill. So that's a MapR website is a great place to go for understanding the use cases and
integrations available. So highly recommend that as well. Okay. And you'll be very modest there
because you actually have a blog on the MapR website as well, which is fantastic. And you do
kind of video, I think whiteboard walkthroughs
as well and that sort of thing so anybody listening that is yeah after more information about this uh
your blog is fantastic really and so that's a great intro to the platform and to drill into
the concepts behind it as well so uh i'd recommend that as well for anybody um interested in this
well well thank you look thank you very much for coming on the call uh it's really good to speak
to you it's great to actually finally speak to you after reading so much about you and what you've done in the past and in the concepts you talk about.
Thank you very much. And yeah, I mean, stay in touch and good luck with the project and good luck with what you're doing at MAPR.
Yeah, thanks a lot. Thanks for giving me this opportunity. And it's a great pleasure talking to you.
Thank you. OK, thank you okay thank you thank you