Drill to Detail - Drill to Detail Ep.51 'Druid, Imply and OLAP Analysis on Event-Level Datasets' With Special Guest Fangjin Yang

Starting point is 00:00:00 Hello and welcome to another episode of Drill to Detail, the podcast series about innovation around big data, analytics and data warehousing in the cloud. My name is Mark Whitman and I'm pleased to be joined in this episode by FJ, co-founder and CEO at Impli, and co-author of Druid, an open-source high-performance, column-orientated distributed data store used by Impli and lots of other query tools you might have heard of recently, such as Superset, that need fast OLAP-style query support. So FJ, welcome to the show, and why don't you introduce yourself properly to the listeners and let us know who you are. Yeah, hey guys, hey everyone listening. My name is FJ. As Mark mentioned, I am one of the co-founders at a startup based in San Francisco called Imply

Starting point is 00:00:58 and I'm also one of the original authors behind the Druid open source project. So for folks that may have never heard of Druid before, it's a analytics database that's primarily designed for event data. As for my background, I worked on Druid initially at a company called Metamarkets, which was then acquired by Snapchat. And prior to that, I was from Canada originally

Starting point is 00:01:22 where I came from the University of Waterloo in Canada. Okay fantastic. So FJ, so the way we know each other and the way I knew about Impli was I'd been struggling back at the office back in the UK trying to get Druid to work as a back end for I think Superset at the time and some general work I was playing around with and it was quite hard to kind of work with and I posted on uh Twitter at the time uh you know Druid is fantastic apart from the actual kind of like setting it up and running it and so on and I think somebody might have been you or one of your colleagues just posted a very terse reply back on Twitter saying try Imply so I had a look at the uh your website and look to what you were doing and um what you've done there is you've built out a tool I bought out a whole company I suppose

Starting point is 00:02:06 called Imply that kind of makes Druid easy to work with and you put front end on it and that's probably a very kind of you know simple introduction to what you do but tell us a bit about what Imply is and I suppose what you're trying to do at a very high level with Druid and get into detail in a moment. Yeah, so I'll start with the Druid part first. So Druid is really designed to ingest like large volumes of event data. This is data typically generated by users interacting with products. It might be generated by like systems themselves and really any other sort of interaction, which is just generating time series data. Druid is a very powerful technology.

Starting point is 00:02:52 Obviously it's something that was designed to work first and we've been gradually spending more and more time making it just easier to get started with. Impli is sort of the continuation of that work. So it's designed to package Druid in a really nice way. It's designed to be really easy to get started with it. But even more importantly, there's an entire application layer we've built on top of Druid to really surface what the engine is really good at. Druid is really used for a lot of very rapid slice and dice queries. So you

Starting point is 00:03:22 stream data directly into the system, and then issue like rapid slice and dice OLAP queries on top of that data. And then Impli provides the application management, visualization, deployment and security kind of just around that core engine. Okay, okay. So, and again, one of the reasons that I came across Druid was the place I'm working at the moment, Qubit,

Starting point is 00:03:43 we do a lot of uh landing of very granular data and we currently land into bigquery google bigquery and that is you know fantastic at storing data and and and processing it in large kind of quantities but uh we had a need at the time to get data out very small amounts of data you know with kind of very fast response times and so that's where Druid came in as a thing I was thinking of. And looking back at the story of Druid, it's kind of interesting. I mean, tell us, I suppose, going back to where you said about Metamarkets there, tell us a bit about the role you did there and what led Metamarkets to, I suppose, come up with the need for this and

Starting point is 00:04:19 this solution. Right. So Metamarkets is an advertising analytics startup. So I was one of the first employees at the company. And when we first started, what we were trying to do was basically create an application that was designed for all sorts of users, not just like analysts, but people that may not have like a background in data science or may not have a background in engineering. So we wanted these people to do was to access an application and then be able to very rapidly perform slice and dice OLAP analytics on advertising data. So what was interesting about like programmatic advertising data is there's a lot of it. So you could be a very small company in the space, we could be generating tons of data. So for everyone out there who's not

Starting point is 00:05:05 familiar with programmatic advertising, the idea is if you imagine a website like Facebook, basically in the milliseconds before, like when you go to Facebook and the milliseconds it takes for Facebook to load, there's actually a very complex process that's happening behind the scenes, which is understanding kind of who you are and your demographic. And advertisers basically, programmatic, they try to outbid each other to display you an ad based on who you are. So it's a very sophisticated, interesting process, and kind of the state of the art of where advertising is today, where a whole ton of data is getting processed. And more importantly than that, there's tons of folks in the background,

Starting point is 00:05:47 basically processing that information and then trying to display you an ad. And this whole process generates a ton of data. So MetaRockets had this pretty niche product, which was designed for product, programmatic advertising data, and then providing a lot of users within the company access that data.

Starting point is 00:06:04 So the scale, the complexity of the data, and the volume, and the rate it was coming in, they were all challenges. And initially, you know, we looked at a bunch of different systems. We looked at relational databases, we looked at key value stores, looked at various solutions in the Hadeep ecosystem to try and find an engine that could basically power a power UI that we were thinking about. And there was really nothing in the space that really addressed our needs.

Starting point is 00:06:30 So as engineers do, you start writing code and do it with the result. Yeah, I guess it's quite, I mean, I can see, I can see obviously why you wouldn't go down the relational database route because you have a lot of overhead there around transactions and all that kind of stuff that wouldn't be appropriate in a lot of overhead there around transactions and all that kind of stuff that wouldn't be appropriate in this kind of instance there but you say that none of the key value store or kind of no sql databases were appropriate you know why was that and surely out of all of them there there'd be one of them that would have been appropriate for you

Starting point is 00:06:57 right um so at the time 2011 was when truid started And at that time, there were a couple of different ways. I mean, even today, key value stores are not a very good fit for analytics. The reason is because there's two methods that key value stores get used. So one is you pre-compute out basically every query that you think your user is going to make. And the idea is that your primary key is going to exactly match kind of the filter or the query that the user makes, and then your result is an exact match for that query. So the idea is basically your key value stores like a giant in-memory cache. And that works to a certain extent if your data is not very complex, if you don't have a lot of attributes in your data,

Starting point is 00:07:42 if there's not many columns of data, you can pre-compute out the total query set. But for a lot of real-world data sets, especially ones that may have like hundreds or thousands of attributes with their data, that query set can grow exponentially in size. And as your data changes, as you add and remove columns, you have to keep recomputing things out. So that's one problem. That approach was used for a while, but it's not really used that much anymore. The thing that's a little bit more common nowadays is to kind of do what's known as like a range scan. So if you use things like Cassandra or other systems as kind of a time series store, you basically do a range scan on your primary key. There's kind of two trade-offs there. One is that for a lot of OLAP queries, they include a lot of filters.

Starting point is 00:08:29 They filter on certain dimensions. They filter on certain values. They filter on attributes. They filter on time ranges. And when you leverage a key value store, there's no real way of matching kind of the exact data that's required for your query. That level of intelligence is not there within the key value store. So you oftentimes have to scan a lot more data

Starting point is 00:08:49 than what is actually required for the query. Another problem with the key value store is basically this idea that compute and store are actually pretty separated. So every time you issue a query to your key value store, it filters the data and that data gets actually pulled into some intermediate compute buffer where the runs are actually crunched.

Starting point is 00:09:11 So what this means for every query that you issue, there's actually a ton of data that gets shuffled around into these compute buffers before results can get crunched. And at scale, it's just something that becomes pretty limiting, and there's a lot of impact on performance. Yeah, I mean, I can see what you're saying there. And parts of what you're saying there as well, and I guess one of the reasons why I was interested in Druid

Starting point is 00:09:37 and some of the things you were doing was a lot of what you're talking about is kind of OLAP and multidimensional OLAP. And certainly, Druid was the first of these kind of OLAP and multi-dimensional OLAP. And certainly, you know, Druid was the first of these kind of stores that talked about things like aggregation on load and so on there. I mean, I'm surprised that nobody else has come across that as a need. But it sounds like you were one of the first pioneers in that area to build this out on a kind of a big data scale, really. Yeah, I think OLAP is one of these terms that was popular maybe 10 or 20 years ago,

Starting point is 00:10:07 but it's become popular in recent times. And, you know, part of our goal is obviously trying to try to bring focus back into like, what exactly is OLAP? Because you think about like OLAP, really, it is like, you know, looking at subsections of your data, it's looking, it's doing a lot of like aggregations. It's not just searching for some particular like set of data or it's not just looking at, you're not doing like a select all query. Most of the time you're looking at a subset of your data and aggregating a bunch of numbers together.

Starting point is 00:10:35 And those numbers are important for your business. So I think we were one of the first systems to actually really start tackling OLAP in the sense of building a distributed system around it. But there have been previous generations of systems that were not distributed. They were kind of like single server, and you kind of scale up by having larger and larger servers.

Starting point is 00:10:54 But I think part of the attraction of Druid today is its ability to do really, really fast OLAP at scale. So we had Julian Hyde on the on the show a little while ago and he was involved he obviously the person behind um behind the Mondrian project now that again is on a large scale but that is my understanding is that say I suppose an OLAP sort of catalog or metadata layer over relational technology whereas what Druid is is very very it's very separate it's very different it's the storage layer isn't it really yeah um so i don't know everything uh about mondrian um

Starting point is 00:11:33 i can see it as being more like a classic uh lab tool um what makes druid different i think it's a couple of different things so one is uh if you actually dig into the architecture, it's basically a column store kind of combined with a search system, which is a little bit of a like, you know, obviously, a search system is not unique, and a column store is not unique. But I think the combination of the two ideas is something that's pretty unique is something that's, I think, well applicable to OLAP. I think the other thing that really makes Druid pretty unique is its ability to handle streaming data. So Druid is like an OLAP engine that's really, really targeted towards time series data.

Starting point is 00:12:10 So events are getting generated by a multitude of products and users. Yeah, so when I've been playing around with Druid, one of the concepts that I first came across was segments um and how they work in memory and and i suppose dealing with and managing segments is quite a key part of the of the kind of the admin tasks if you're working with druid directly i mean maybe explain to us what segments are and how they fit into the architecture and and why they're so critical to the way things work right um so the way that so the way that druid works is time is a very special attribute in Druid. And our first level sharding is basically always done on time. So you think of the way that most people use Druid, they basically have a stream of events or a stream of time series data that they feed

Starting point is 00:12:58 directly into Druid. And Druid, what it does internally is it partitions or shards this data based on timestamp first. And each time partition slice is called a segment. So within a segment, so segments are usually bounded by some period of time. Typically, you see a segment containing a day's worth of data or an hour's worth of data, depending on scale. And within a segment, data is stored in a column orientation, and within all the columns which you typically filter on, there's actually search indexes that you filter for those columns.

Starting point is 00:13:33 So that's the overall idea of how Druid works, that the first level partitioning is done on time, it creates segments, there might be additional levels of partitioning within the segments themselves. Okay, so the other thing that was very obvious to me was the fact that Druid would handle streaming data coming in. But there was this distinction between batch loading and streaming loading and so on.

Starting point is 00:13:53 I mean, and so tell us about that and I suppose the benefits of that and I suppose what that means in terms of how data is loaded into Druid because we can talk in a moment about how that's a lot easier with Imply as well. Right. So the way I guess you can think of Druid working as a system is it like pulls a source of raw data into itself and it takes that raw data and it puts it into a format that's highly optimized for like analytic OLAP aggregation heavy queries. So I think there's a

Starting point is 00:14:24 lot of similarities between how Druid works and how a typical search heavy queries. So that's, I think there's a lot of similarities between how Druid works and how a typical search system works. So you think about system, we have like raw data, you feed it to the search system. It creates indexes of that raw data and those indexes are designed for like very fast full-text search on that data.

Starting point is 00:14:38 This is how like Elasticsearch and Splunk and many other search systems work. Druid is different there. So there's similarities and differences. The similarities are it connects to a source of raw data, and then it takes that raw data, indexes it, and puts it into a column format that's very good for aggregations and slice-and-dice analytics. So there's two ways of getting data in.

Starting point is 00:15:00 One is you kind of stream in data. So Druid can support like exactly one's consumption of data for message buses like Kafka or Kinesis. It can also support out-of-the-box integrations with a variety of stream processors, such as Apache Flink or Spark Streaming, Storm, and many others. The streaming data is nice in that you can basically see events occurring basically immediately after they occur. It also supports a batch mode, which is, you know, if you have static files on a file system somewhere and you have essentially years of static files and kind of want to load into Druid in one go, it supports that as well. So the system is inherently able to understand both streaming data coming in from some sort of streaming system and also a batch load of data that's represented, at least in raw form, by files in some file system somewhere. quite technical and quite abstract. But, you know, certainly what I've been finding is that Druid is, I suppose, the default back-end technology

Starting point is 00:16:07 for a lot of SaaS apps that have analytics built into the application where they need to have very fast response time. What I would, in the old days, have called kind of OLAP-style response time, but on the sort of scale you'd get with these large-scale multi-tenant sort of SaaS applications. Yeah, so the project is extremely well adopted

Starting point is 00:16:25 today. It tends to be adopted by pretty large enterprises today just because there's a little bit of overhead to setting it up and people really need to have like a scale of data complexity issue before they seek out Druid. In terms of what it is, yeah, like at a very high level how people use it is they have some sort of analytics application that they're trying to power or they want to give out to, you know, they want to have a bunch of users use it at the same time. This analytics application is dealing with event stream that's of very high volume. So this could be server logs, it could be cybersecurity data, it could be network flows, it could be the output of your digital business, or it could be user-product interaction. So those are some of the canonical data sets that end up in Druid.

Starting point is 00:17:12 And the users, they tend to be a broad variety of different functions within the company. So there might be product owners, there might be people doing sales or marketing, understanding how their campaigns are doing. It might be executives, or it might be the engineers or operational folks within a company themselves. Yeah, it's really designed for a very multi-tenant environment and a very large, complex streaming data set. Okay.

Starting point is 00:17:37 Okay. So certainly Druid got my attention at the time, but then actually implementing it, as you say, can be some technical tasks involved in doing that and um and that's when i came across imply so um tell us a bit about what imp just remind us again what what is it what is the company imply trying to do and and what's the relationship between what you're doing and the druid open source project right so i guess similar to what elastic does for elasticsearch and what confluent does for Apache Kafka, Impli is a company that has Druid at its core. The idea is that we, at Impli, we do spend a lot of time kind of building out the open source and working with the community to drive the project forward.

Starting point is 00:18:19 The vast bulk of contributions that go into Druid and the roadmap is really being driven by Impli. Similar to other companies that have an open source, I guess, project at the heart of the company, Impli has an open core model. So we package Druid, we make it a lot more enterprise ready by adding management, operations, security features that the enterprise requires. There's folks that want to build applications on top of Druid and plug different UIs on top of Druid. We have a UI that basically works out of the box.

Starting point is 00:18:52 That UI is designed for loading data very easily and also visualizing the data very easily, so exposing what Druid is best at. That's part of our story, is that we have a relationship with this open source that we can continue to drive it forward, but we also have this enterprise product, something that works both on premise and works in the cloud that we give out to our customers.

Starting point is 00:19:15 Yeah, I mean, certainly, I mean, there are other companies out there that have taken an open source core and then they've added maybe management tools around it or some you know maybe some value out there but certainly what you guys have done is to me is a lot is a lot more than that really and in that you've been building out a kind of front-end tool you've done you've solved some of the problems around the management side um and you know it's more than just kind of adding i suppose you know an admin tool to the end and so that's again is what what part of my interest was here um so tell us about this on your website and in your materials you talk a lot about event analytics and you said earlier on that druid was was really optimized for you know loading in time series data so

Starting point is 00:19:53 maybe tell us just define what analytics are and you know and what problem are you solving for that market that that hadn't been done before really yeah uh so what what about analytics is so when I say events, I mean kind of like event streams. Some people, you know, they've actually talked to community kind of calls with log data. I don't really call it log data because that has like heavy, I guess, association with server logs, but it's sort of a wider set of data that can be thought of as events. So what I think of an event, it's basically a discrete data point, which represents some sort of occurrence within a system or within a product. So the idea is that you know a lot of popular event streams we see are you know every on the web nowadays whenever someone interacts with

Starting point is 00:20:40 a web property or interacts with some sort of web product basically everything you're doing is like whether you're looking on the page or you're clicking on the page, all those actions are generating discrete data points, which can then be further analyzed. So if you're making purchases online, you're doing views or you're just clicking around on some web property, those are all generating events. At the same time, server logs is also a popular source of events so these are events about what's happening in your servers the cpu the the latencies uh the logs that are that are kind of being generated there um and yeah uh events also occur in the form of like network flows so you know

Starting point is 00:21:19 every packet that's getting sent to a network is a discrete event um with cyber security data as well like intrusions or threats or anomalies against the system those are really just like network flows that need to be analyzed further um in the digital media world where drew it first started uh in the advertising world it would be you know people looking at an ad and then whether they click on the ad or not or like that would be a very important thing to have occurring in the background okay so so would there be i mean going back to my example where i was using bigquery before and but trying to get faster response time than than that are there are there are there types of of data or

Starting point is 00:21:55 types of kind of i suppose um um you know use cases or things that wouldn't suit druid you know is it the case that druid is just a better store for data for analytics than a thing like BigQuery or a particular niche or usage that suits it best, really? Right. So BigQuery is like a data warehouse and that's not what really Druid is. So if you think about the data warehousing products

Starting point is 00:22:19 out there, BigQuery, Redshift, Snowflake, and many others, they kind of play in the same space. You know, even the Oracle's entire data. And what data warehouses are especially good at is the flexibility. So, if you think about what a standard data warehouse does, you can

Starting point is 00:22:38 have back-office analysts basically write SQL queries that are like a thousand lines in length, right? It can involve complex joins between many different data sets to kind of get the response that you want. And like that's all well and good. And it's like very, very important. Those queries might take 30 minutes to respond. They might take some significant time before you get a response.

Starting point is 00:22:56 And like that's what canonical data warehousing is about. But there's another set of use cases with data where recency and immediacy is important. So, you know, you want to see events occur. You want to be able to analyze events right after they occur. And also you might want to start doing slice and dice queries on data to understand why something is happening in real time. So this is like sub-second queries on data that just occurred less than a second before. So that immediacy and recency powers, if you have some sort of application that's user-facing, having users wait for minutes

Starting point is 00:23:29 while they're accessing a UI is not a good experience. So those workflows, which are, I think, a lot more operational in nature, those are what Druid is good for. Whereas things that are much more standard data warehousing related is something that BigQuery would be much better at. But I think those are sort of two sides of the same coin. So like data warehousing is something

Starting point is 00:23:52 that like back office analysts may care about a lot more. And then the operational characteristics, the immediacy and like immediately seeing visualizations update, that's something that other people in an organization care about. Yeah. What about, so let's get on to, we'll get on to Pivot now and I'll come back to Employee

Starting point is 00:24:11 Manager in a moment. So Pivot is, what surprised me about when I saw your tool Pivot that's part of the suite of things you built, is it's an OLAP tool and it's got dimensions, it's got fact, it's got kind of measures, metrics, sorry. It has a very strict kind of data model of dimension attributes and metrics. I mean, that's interesting because I haven't seen that out of the latest generation of kind of BI tools. It's been, you know, OLAP and very kind of defined

Starting point is 00:24:38 dimensional models is not something I've seen much use. What led you down that route really? And why take that approach with Pivot really? Yeah, so having, so I think like a lot of data is looking more and more unstructured nowadays and unstructured data has its benefits and its like drawbacks. One of the benefits is obviously you don't have to define like a hard schema and like that allows you more flexibility and it's a little bit easier to just just get your data kind of in something um druid has druid has like two methods of of uh basically intaking data it can do a like a mostly schemaless model where you just you just kind of define your attributes

Starting point is 00:25:19 and drew to kind of figure out what those attributes are um and and what you can do with them uh and it's another way of operating in dru those attributes are and what you can do with them. And there's another way of operating in Druid where you basically define a schema to Druid, and that schema has a notion of dimensions and measures using the more standard GALAX terminology. Once you define that schema, Druid can be much more intelligent

Starting point is 00:25:38 about, like, what indexes it creates for dimensions, like, how it represents measures, and there's, like like direct performance, compression, storage, et cetera, like benefits once you start defining a schema. So the more of a schema that you define, the more optimizations we can apply. So in Pivot, because we're really,

Starting point is 00:25:56 Pivot is a UI obviously targeted for like end users and organizations, a variety of different roles. The idea is we want them to have kind of the best performance and the best experience possible. So we try and force people to actually define a schema just so we can start applying all those optimizations to get the best performance in storage. Yeah.

Starting point is 00:26:16 So who would you say has been the target user of Pivot then? I actually say where people have liked it the most are people that have like operational roles so uh people that are really responsible for explaining why something is kind of happening within a data set um so we've seen you know people that are working on traditional it who like just take the data just you know they know, they get asked questions like, hey, why is this happening? Or like, can you explain this trend or the other trend? And they're able to use a tool like Vivid to really rapidly like slice and dice data

Starting point is 00:26:52 to get to the results or get to the answer that they're looking for. Okay, okay. And there's also imply SQL in there as well. And again, that's one thing that led me on to maybe speak to Julian Hyde actually. Presumably that is a SQL kind of layer built using Apache Calcite, is that correct?

Starting point is 00:27:10 Tell us about that really. Yes. Yeah, definitely. So Calcite, I think, has become the standard for a lot of these open source tools to build their SQL layers with. So what we've done is we actually worked together with Julian and we integrated Calc into Druid.

Starting point is 00:27:26 So Implied SQL really is just Apache Calcite. And we built like a nice UI as a front basically for Calcite. Okay. I mean, I was really impressed at the quality of the stuff you built there. I mean, the quality of the front end, typically, you know, in a startup and particularly things like OLAP, it's fairly kind of basic. But I was very impressed with the quality of the SQL bit you did and also, you know, the front end. Typically, you know, in a startup and particularly things like OLAP, it's fairly kind of basic, but I was very impressed with the quality of the SQL bit you did and also, you know, the front end. But then the thing that really struck me was Implier Manager. And that goes back to, I guess, the original kind of problem that led me to find your product in the first place.

Starting point is 00:27:57 You know, the cluster management, the loading and so on. Tell us about Implier Manager and what it does and what it adds beyond what you get normally with just the open source version of Druid. Yeah, so I guess like for anyone that's ever worked with distributed systems before, working with like an open source system, like the setup is like the most painful part just because there's like a lot of different pieces and also the pieces need to like be connected

Starting point is 00:28:23 and really be able to work together. So people, like from what I've seen, it's easy to get kind of a quick start up and running on your own laptop. But people really struggle with setting up like the system across multiple machines because they're on the now there's tuning involved. There's, you know, configuration that's involved. There's there's understanding like how do you how do you like set this thing up for a particular like type of hardware. So we you know know we were spending a lot of time basically helping people with the same set of questions and as engineers we decided to really productize uh the work that we were doing so uh we created a system that is really designed to make deployment of a complex distributed system pretty easy so through a few clicks you can clicks, you can kind of select your hardware

Starting point is 00:29:05 that you want this stuff to run on and then just be able to go and deploy everything. The idea is that, like, you know, if you want to be able to update your software without taking any downtime for your users, you can write some pretty complex scripts through that or you're just using our product in kind of one click and then the software will just update from one version to another in a rolling update fashion without

Starting point is 00:29:30 any downtime for users. Implied Manager is encapsulating that vision of really making it so that you don't really have to have expertise in a distributed system to be able to set one up in a highly available fashion. So currently you support building clusters in Amazon Web Services. Is there any kind of plans to bring that out, cover maybe things like Google Peer Engine,

Starting point is 00:29:51 other kind of like cloud environments other than AWS? Yeah, absolutely. So it's absolutely on our roadmap to take a lot of the management software and basically work in any environment, including kind of on-premise. Okay, okay. So, and I guess also that if you can also use Employee Manager as a way of just managing Druid, if you're going to use it for other things as well. So, you know, you can build your clusters in there, you can power, you know, you can power Pivot with it,

Starting point is 00:30:18 but you can also power other things as well. So it's a general management environment for Druid as well. Is that correct? Yeah, absolutely. So that's the goal of, it's a you know it's a general management environment for druid as well is that correct yeah absolutely so that that's the goal of uh it's a general environment for all druid for everything kind of related to imply you can use your own application uh you can actually write your own extension to the system as your own functionality or you can use a lot of the stuff that we've pre-built already we're pretty flexible on how what you want to do so so how is how what is your licensing model like i mean so how, so is everything kind of freemium? Is it, I mean, how do you, what's the distinction between the open source part

Starting point is 00:30:52 and what you charge for? And generally, how does that kind of process work? Yeah, so I guess similar to many other organizations out there that have an open source project, the engine is basically open source. So Druid as an engine is open source. Implied builds a lot of the application around the engine. So UI, promote data, manage data, secure the data, and also visualize it.

Starting point is 00:31:14 So all that is obviously proprietary to our business. But the engine remains open. We're actually taking Druid to the Apache Software Foundation. So if anything, it's going to be more open when we encourage sort of more businesses to get involved and contribute to the project. Okay. Okay. And so taking things forward, I mean, I suppose there's plans you might have for improving, extending Druid, and there's plans for your product line as well. I mean, what's the next problem to be solved with Druid, for example? What's the thing that you're kind of working on now to try and tackle to make it even more adopted and so on? Right. So today, actually, where a lot of our efforts are focused is on that use of component, use of ETH component of Druid.

Starting point is 00:31:59 So we really want to make Druid as easy and as simple to deploy as possible. So I think the biggest challenge with Druid today is probably just getting that data ingestion pipeline set up. So I have my raw data in a file system or having Kafka. How do I get into this system with, like, minimum management and minimum thinking? So we're going to make that really nice and smooth and make sure, like, you know, errors or imperfections

Starting point is 00:32:19 of the data are also going to get surfaced. And then after that, just kind of extending the functionality of Druid, being able to better handle nested data, being able to expand the features such as full-text search, and more time series oriented use cases. So a lot of that are just things that are going to be coming up. On Implied's side, we're planning to announce the general availability of Implied Cloud, which is our cloud product where the management software currently lives very shortly. Probably before this podcast actually goes out. But there's a ton of stuff we're developing on the application side as well.

Starting point is 00:32:59 A lot of our workflows is really around slicing and dicing data, and we're just going to build more and more tools for doing that. Okay. I mean, something that would be interesting as well would be, certainly when I trial your software, your employee manager tool, it's very good for kind of helping me manage my own cluster of AWS-hosted Druid nodes, but also maybe you guys hosting it and offering it as a service will be interesting as well. You know, the next step on is to take away even that bit of complexity really,

Starting point is 00:33:31 and just handle that for the customer. I guess on your roadmap, maybe that's what you're thinking about as well. Yeah, that would be, yeah, that's definitely something that we're looking at doing. I think the biggest there's always getting around security requirements of everything. Yeah, I guess so. But certainly that

Starting point is 00:33:49 would be interesting. So tell us then, just to wrap up then, tell us how people would find out about Druid, but more importantly, how would they find out about Imply? And how would they get maybe a trial or somehow get to sort of kick the tires a little bit with your product? Right. Yeah.

Starting point is 00:34:05 So I would say that people want to get started with Druid, they can go to Druid.io. That's the web page. That is the community page. And obviously, you're working with a more stock open source project there. If you want to get started with Imply, which is something a little bit more catered toward enterprises and has that whole suite of software around it,

Starting point is 00:34:24 you can go to Imply.io. I'm obviously obviously biased but i think people should get started with imply just because the scripts and everything is just gonna make it much easier to get started with yeah i i would totally agree with that actually certainly it was a revelation um you you know trying to get druid running through your software than to try and do it myself now i got i got it working with what i was trying to do but but certainly it was a lot easier to get the, I suppose, to get the experience of what Druid is like and see the whole thing kind of packaged up really in a way that was easy to use.

Starting point is 00:34:53 So I would say that if anybody listening is considering looking at Druid, do it through your tool. It's a much nicer experience. You can get to see whether it's suitable for your data and your use cases and not get caught up in scripting loads and all that kind of thing really so i i would endorse that as well yeah yeah absolutely so there's a ton of information online about druid and how to

Starting point is 00:35:14 get started with it um and obviously uh imply as well has user forms druid has user forms if you'll get stuck just feel free to go to those forums to find out more and get help excellent well let's say it's been great speaking to you i really appreciate, just feel free to go to those forums to find out more and get help. Excellent. Well, FJ, it's been great speaking to you. I really appreciate you taking the time to come on the show and talk to us about Employee Druid. Thank you very much and take care and good luck for the future. Yeah, thank you so much for having me. Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.51 'Druid, Imply and OLAP Analysis on Event-Level Datasets' With Special Guest Fangjin Yang

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.