Drill to Detail - Drill to Detail Ep.52 'Lyft, Ride-Share Analytics and ETL Developer Productivity ' With Special Guest Mark Grover

Starting point is 00:00:00 Hello and welcome back to the Drill to Detail podcast after the Easter break and I'm your host Mark Whitman. So one of the most popular episodes over the last two years we've had was with Mark Grover, one of the committers on the Apache Sentry and the Apache Spot open source projects a contributor to Apache Spark and co-author of the O'Reilly book Hadoop Application Architectures along with Gwen Shapira from Confluent who also came on the show back in 2017 to talk about Apache Kafka streaming data pipelines and of course her book. So back at the time Mark worked at Cloua, but since then moved on to Lyft, and he's kindly agreed to come back on the show and tell us a bit about the technology he's using there right now, and some of the things he sees that the big data world hasn't solved yet,

Starting point is 00:00:53 and where it might be headed. So Mark, welcome back to the show, and great to have you on here, and maybe just do a quick introduction and what you've been doing so far. Yeah, thanks Mark. Yeah, it's great to be back here again. I am now a product manager at Lyft. Lyft is a big player in the ride-sharing space here in the US and now in Canada as well. Like you said, I was previously an engineer at Cloudera, and over the past seven, eight years at various different companies, I've built and seen many different architectures with big data tools, and I've contributed and been a committer on many different projects in the open source space for big data as well. I have a keen interest in sharing what I've seen in the larger data community through blog posts and books and

Starting point is 00:01:35 podcasts and that's exactly why I'm here and I'm super excited to be here. That's great thanks Mark so just before we get into the detail of what you've been doing at Lyft just tell us a bit about your involvement in the Apache Spark project. So what interested you about that problem space and what was your contribution really to that project? Yeah, Spark was quite the rage when it started. It started as a replacement for MapReduce for batch computes. So essentially you could do your ETL or any ad hoc batch compute in Spark. But then it opened up two new use cases

Starting point is 00:02:09 with the same engine. It opened up a very popular streaming use case, which allowed you to do near real-time processing with Spark streaming. So you could do fraud detection or anomaly detection. And it opened up a machine learning use case. So you could use the same engine to train and score models. It tried to open up a machine learning use case. So you could use the same engine to train and score models.

Starting point is 00:02:26 It tried to open up a few other use cases around graph processing. But I would say at the high level, batch streaming and machine learning became the very common use cases of Spark. When I focused on Spark, I was mostly spending my energy around Spark streaming. So both contributing as a platform to make it more resilient, more secure, and so on, but helping a lot of organizations who are Cloudera customers to build near real-time applications. And there's a chapter

Starting point is 00:02:53 in the Hadoop Application Architectures book on fraud detection, for example, which is a total manifestation of the kind of work that people were doing with Spark Streaming and the kind of work I involved myself day-to-day with Spark streaming. Excellent. So, so you've moved on now from Cloudera and you've moved on to, to Lyft. So what,

Starting point is 00:03:13 what interested you about going into the ride sharing industry, obviously, albeit in an IT role and Lyft in particular, to the point where you'd move from say a platform vendor to actually kind of working in a more focused role with a customer? Yeah, that's a great question. A question I spent a lot of time thinking about when I was thinking of this transition too. And I think there are two parts to it. One part is my changing of roles from an engineer to product manager. And then the second part is me changing companies. And by the way, this is all specific to me, right? So it's only my reflection of how I think about things. So let's talk about changing roles first, going from an engineer to a product

Starting point is 00:03:50 manager. I think going to a product manager, it has allowed me to step back and think about what we need to build, how does it fit into the longer term vision, and why is it that we need to build it? And I got a lot of that independence as an engineer too, but it's even been more refreshing now to be a product manager and still be thinking about execution, but at the same time, have a little more time to think strategically

Starting point is 00:04:20 about what we need to build long-term and how does that map to short-term and why is it that we need to build it? So the second part of that was changing companies, which was like moving from Cloudera to Lyft. And I actually learned a lot from Cloudera and I definitely apply much of what I learned at Cloudera every day at my job at Lyft. And my decision to move to Lyft was a hard one, but it was mostly around scaling my impact and being able to contribute end-to-end in the lifecycle of data at a company. So at Cloudera, I went and helped a customer with the big data platform, but here at Lyft, I'm not only contributing with a platform,

Starting point is 00:04:57 but thinking about end-to-end things, like how do we make this platform the most productive platform? How do we make Lyft the most data-driven company in using this platform and things like that? Okay, so I think it was Gwen Shapiro, actually, who wrote a good blog post a while ago about going from an engineer to a product manager. And it's an interesting transition, isn't it, in terms of your focus and what you look at in terms of the longer-term vision, as you say, and the value. But also, I mean, one of the things I found particularly unexpected about being a PM myself is the fact that you don't really have any, I suppose, direct line, you know, you don't

Starting point is 00:05:34 manage people, you don't have the ability to actually tell people what to do. It's a lot through influence, really, isn't it? It's more down to the strength of your vision. Do you think that's the case? That is definitely the case, yeah. So there's a lot of influencing happening, and you influence through data, through what the industry standards are looking to be,

Starting point is 00:05:53 what are the trends that are happening, and it's a very interesting role. Excellent. Okay, so let's go into what you're doing at Lyft then. So first, what does the technology platform look like there? You know, what kind of tools do you use or certainly what kind of, I suppose, problems are you solving? Maybe start by walking us through the data platform, that side of it really, first of all. Yeah, that sounds great. So to walk folks through the mission of Lyft is to improve people's life

Starting point is 00:06:21 with the world's best transportation, which literally translates to have drivers offering ride and making money and have passengers requesting those rides. The catch is all of this happens, all the drivers and the passengers have to be at the right place at the right time. So if we were to try to map this larger goal of Lyft to how data can help Lyft meet this goal, there are three big categories of users that data could serve and make more data-driven at Lyft, right? So the first category of users are analysts or data scientists. For example, if you wanted to see the growth of drivers and passengers on a per region basis and be able to forecast how much we need to invest and just do sort of business related analytical decisions, that would be the first category of users. So long and short of that is analysts and data scientists. Second category of users is operations. So there may be a general manager of a particular region, let's say San Francisco, who wants to make sure that the drivers are correctly incentivized to drive at particular

Starting point is 00:07:32 times of the day when, for example, a baseball game may have finished, and that they have actionable insights from that. So not only do we show them the data around where the drivers should be, but we also give them recommendations hey we recommend you deploy this particular incentive so drivers are then driving over to this area then the third category of users is experimenters and all engineers in the company all product managers and company fall in this category but these are users who can use the data platform to experiment with changes to the app and then decide based on data whether they want to do a full-on rollout or not. And

Starting point is 00:08:10 when people think experimentation, they think, you know, changing button colors, but it can be much more nuanced too. You could test out a new pricing algorithm in a certain region with certain group of users and see if you want to roll that out to the entire region or the entire country. So to recap the three kinds of users that the data platform at Lyft is serving are analysts slash data scientists. The second category is operations. And the third category is experimenters. Okay. So imagine in that second category, you know, you've got the operational side, you know, you're in a uniquely competitive situation because a driver using, say, Lyft in their car, you know, the Lyft app, would have other ride sharing platforms there as well. And so you're competing against kind of other platforms they can go with. And it's all kind of in a way you've got to make a decision now and then because, you know, that's when the opportunity is to earn some money from getting a ride.

Starting point is 00:09:04 You can't, you know, you've got the opportunity is to earn some money from getting a ride. You can't, you know, you've got to be on time as well. So you must have a uniquely competitive situation and an opportunity there to differentiate through analytics, really. Absolutely. And actually, that leads to an interesting point that I want to make and want to share. I think when people think of data and using data to power organizations, many times we only think of providing the data. So let's use the example that you were using, Mark. So a general manager has to see what is the state of the health of the drivers and passengers right now in the moment in the region, right? One way for us to do that is to just provide data. Be like, oh, there's like a collection of passengers here that need rides.

Starting point is 00:09:47 And then there's an oversupply of drivers over here in the other region. But at this point, all we have done is given them just the data. And now they have to act upon this data and they have to figure out what they need to do in order to tackle that supply and demand difference in various different regions. But imagine if we elevated this conversation to not just be providing data, but to be recommending actions, right? So take it one more level higher. So instead of saying, here's an extra supply in this region, and here's some extra demand in this region. We say, based on some data and what you've done in the past, like learn essentially from what they've done in the past and actually recommend, here's an incentive that we recommend you deploy to these category of

Starting point is 00:10:37 drivers, which will make hopefully like this percentage of drivers from this part of oversupply to the over demand area at the moment, right? So raising that level of conversation around actions instead of just data is something that we think about constantly. Okay, okay. And I suppose, I mean, other examples I've seen, I mean, one of your competitors has, you know, a food delivery service as well. And that you could imagine that the activity that has been seen using a ride-sharing app and the concentrations of people and people travelling to restaurants and so on, that maybe even prompted the idea of branching out into a different area, say things like food delivery or logistics.

Starting point is 00:11:15 And in the UK, we have a company called Deliveroo that delivers food from restaurants into houses. But actually now, actually now when they realize there's demand in an area where there's no restaurants they actually now work with the restaurants to open like pop-up restaurant pop-up shops in there so that they actually tell the restaurants where to open additional branches which is which is fascinating isn't it yeah that's crazy i think there's a lot of uh room for improvement a lot of work to do and a lot of impact in the space yeah so let's get into a bit more technical stuff.

Starting point is 00:11:45 Let's go into the ETL analytics use case. So what does the architecture look like there? I mean, do you use things like Kudu at all? I mean, what kind of platforms and solutions do you have there? Yeah, great question. So for the ETL analytics use case, I'll start from the very ingest side of things. So we have a custom ingest code that either is used in your phone app. So every time you have some activity that's of interest, we lock that. And then we also have instrumentation in our services. So the service,

Starting point is 00:12:19 for example, that provides you the ETA or the pricing, also logs certain information about the ETA and the price it provides. So we have instrumentation in cell phones and in services. All this instrumentation flows through a custom logger to a message queue. And from the message queue, there's a streaming system that will take this data and put that into S3.

Starting point is 00:12:40 From there on, we have SQL systems. So the three SQL systems that exist today, we have Redshift, we have Hive and Presto, which I put in the same category. I'll explain later why. And then the third system is Druid. And from each of these three systems, we can connect various different tools for BI or analytics, and we use those tools to do that analysis. Okay, okay. So I mean, all those names I recognize there, but they're not particularly names I recognize from the Cloudera side. So how does that stack, you know, each of the different levels there? How does that compare to what you were used

Starting point is 00:13:19 to using? And, you know, maybe tell us about some of the pros and cons of why they chose that. I mean, start maybe with the kind of storage layer, HDFS, and you mentioned S3 there. Yeah, that's a good question. So let's start again from the top there, from the ingest side. We have a custom collector at Cloud Airside. Many times people use Flume. And I think it may just have been here a difference, when this code was developed and the familiarity of the engineers with Flume and so on. I don't think this is a big difference. I think we're going to

Starting point is 00:13:50 find other interesting, bigger differences as we go downstream here. So the next one is the PubSub system. So at Lyft, we use Kinesis and we are investing in Kafka because of some limitations we have come through Kinesis. And I think that decision was, again, because of the heavy reliance of Lyft on AWS components, especially on the operational software side and services side, so the services that serve ride sharing. And then Kinesis was an easier choice there. But I think we have run into some limitations of Kinesis that's making us go to Kafka. And Cloudera architectures are also using Kafka. So we are pretty convergent there. Tell us about Kinesis. So tell me about Kinesis. I've heard of that. But coming more from the Google side and Cloudera, what is Kinesis really? What does it try to solve and describe it for us?

Starting point is 00:14:42 Yeah, very good question. So Kinesis is similar to Kafka. It's a PubSub system, message queue. You throw a bunch of producers on one side and have a bunch of consumers on the other side that read from it. So it's pretty competitive to Kafka. And I think to dive deep into the differences between Kinesis and Kafka,

Starting point is 00:15:04 I think one limitation we're hitting is the number of reads we can do on the consumer side of Kinesis, how that scales in comparison to Kafka. And we've seen Kafka scale better than Kinesis in that case. Okay. So basically solving the same problem, but in your experience so far, Kafka has been a more scalable solution for doing that. Absolutely. Great. Okay. So from there on, how do you get data consumed from Kafka into your analytics system? And there are different solutions to that. Even at Cloudera, some folks used Flume and some folks used Spark streaming. Outside of Cloudera, if you talk to Gwen, for example, at Confluent, they could have many users using Kafka Connect.

Starting point is 00:15:48 At Lyft, we use Flink. And I think this is probably a part of ingestion pipeline that hasn't been super standardized right now. The reason being, you could use, there are two sets of tools you could use here in order to do ingestion. First one is a dummy ingestion

Starting point is 00:16:04 tool. It simply picks data, doesn't do a whole lot of transformation. It can do maybe in the context of an event, it may mask something or create a new record based on existing fields in the events, but it doesn't do larger windowing or anything more intelligent than that. Since I flew in D. Yeah, exactly. Yeah. So that category is similar to Flume and Kafka Connect. And that's one category of tools you can use.

Starting point is 00:16:39 But then folks have started using richer, more heavier weight streaming systems in order to do stuff like this, right? So you could use Spark Streaming or Flink in order to ingest along the way. Maybe you will change a few things here and there. Maybe you will do some windowing. Maybe you'll do some deduplication in the context of that window and so on and so forth. And I think there is a fine line between deciding which one are you going to use and that line is not clear in general. And so I see more and more people using a larger heavyweight streaming system because of the flexibility it can offer in the long term. And between that, I think the choice

Starting point is 00:17:12 between Spark streaming and Flink is still, again, something that's not super standardized yet. And Lyft has investments in Flink and has a few Flink emitters. And so that's why Lyft chose Flink. I think there are also a lot of technical differences between Spark streaming and Flink when we get to the guts of it, but we will save that for another conversation later. And then, okay, so let's move on from there and talk about analytical databases. So at Lyft, I was saying we have Redshift

Starting point is 00:17:39 and then we have Hive and Presto and then we have Hive and Presto. And then we have Druid. And Redshift is mostly a sign of historical existence at Lyft. And think of it as competitive Teradata or Netezza, your heavy MPP systems. And that was essentially for easier setup, quick and dirty analysis back in the day. But these things stick, you know. And the next system is Hive and Presto. And cloud architecture would be very similar. The only difference there would be Presto versus Impala. And I do continue to see a lot of contention between that. Airbnb, for example, uses Presto too, so does Lyft.

Starting point is 00:18:29 And the third system, really quick, would be we use Druid for our sort of cubing, really fast analytical store, which connects to Supersat. And in cloud airspace, that would be Kudu. I'm happy to dig into whichever ones you want to dig into, but that's the sort of differences. So, I mean mean there's a ton of stuff there i'd be interested to talk to you about um so um first of all no use of drill i mean drill again is one of those ones in the presto kind of impala space you know that's something you guys have looked at or just it's just not one that you've chosen to use at all yeah not one we've chosen to use i think impala versus presto versus drill is yeah it's

Starting point is 00:19:06 that category okay so so so looking at you mentioned about druid versus kudu there and and so we actually had um we actually had the uh guys from impala on the show a couple weeks ago not impala sorry from imply um and actually one of the people behind kind of druid was on one show and we're talking about it and um you know i iid is kind of, I think it's a fantastic product and it's got quite a high barrier to entry, I think, as far as I'm concerned, in terms of getting data into it. And there's a lot of kind of manual management

Starting point is 00:19:32 of kind of servers and so on there. But the performance is super fast. And like you say, you can cube stuff, you can load data in aggregate. I mean, what's your experience been like with Druid? And is it something you'd recommend using further? Or what really? Yeah, good question. And I actually, I did see and listen to Funjin's talk around Druid here at Drill to Detail. And I really enjoyed it. That's the best part of me for Drill to Detail, to listen to all these amazing talks. And I know Funjin

Starting point is 00:20:03 personally, and he came by at Lyft as well to give a tech talk on Druid and help us understand how do we make this decision, right? So first thing to note is like Druid and Kudu solve similar problems, but they aren't the same ways of solving them. So an example is Druid is a storage engine as well as execution engine. So you store your data and Druid will also do how you actually process your data. Well, Kudu is just a storage engine, which means if you have to process your data that's stored in Kudu, you have to have an execution engine or a processing engine with it. And the most commonly

Starting point is 00:20:43 used engine is Impala, right? So when you're comparing Druid and Kudu, that's not quite apples to apples unless you compare Druid and Kudu plus Impala or something like that. Yeah, exactly, exactly. So what about, I mean, because I've been, as an aside, I've been trying to persuade people

Starting point is 00:20:59 where I'm working now to consider Druid as a universal front end for lots and lots of, say, big data, but also using it for kind of like, like you say, things like Superset, for ad hoc queries and so on. I mean, with Druid, how well have you found that it handles both large amounts of data and fast queries and generally how much admin is involved in maintaining it really? Yeah. So I don't have a whole lot of context into the admin. I have not seen it to be as heavy as perhaps you may be suggesting. I think Druid is obviously a newer product as compared to many other products in the ecosystem. So it's

Starting point is 00:21:38 understandable for it to be a little further earlier in the maturity curve, but I don't have any anecdotes or data to prove that it's really operation heavy. Our experience has been really good. It's very fast for analytical sort of, you know, here are my few dimensions that I want to group by. Here's some attributes that I want to roll up to and aggregate

Starting point is 00:22:03 and powering a dashboard or, you know, a SQL or, you know, there's no SQL syntax in Druid today, but you could power a dashboard using the Druid syntax and have that be super interactive. And I think one of the main reasons you may ask, like, oh, you already have Presto, others may have Impala, like why do we need yet another system engine in order to do this stuff? And I think there are, so Impala and Presto, all these things and Drill

Starting point is 00:22:36 have SQL syntax that you use to query the system, right? So it supports on joints and it has code, for example, that will spill to disk if your join is too big. It's fast compared to Hive, but in many cases, it's not fast enough. In cases where you want to do simple aggregations and want to do group bias on a certain very small number of attributes and you don't need the full power of SQL, there's room for much more interactive analysis. And that's the gap that Druid, for

Starting point is 00:23:06 example, is trying to fill. Yeah, excellent. So what about, and this might lead on to the next thing I want to talk about, which is the kind of ETL side, but what about things like orchestration and workflow around ETL? I mean, have you tried things like, oh, Uzi is the obvious one, but have you looked at things like Airflow at all and those sort of things? Yeah, absolutely. In fact, at Lyft, we use Airflow. I've seen outside of Cloudera, Airflow being a pretty popular orchestration tool. Uzi had become the standard and is still used pretty heavily. Before Uzi, there was Eskaban from LinkedIn. There was Luigi and so on. And I think all of these mostly standardize to Uzi.

Starting point is 00:23:49 And I think perhaps the only choice that folks have to now make is Airflow and Uzi. And I've been working with Airflow now, and I'm a big fan of Airflow, having used Uzi for a long time too. Okay. And I think this is a great example here. We're talking a lot about all these kind of crazy words and Uzi and all this kind of stuff. And I guess one thing I think I found certainly the last year and a half when I've been working on a customer site is the things that we were obsessing lot more about how do we solve more generic kind of problems around kind of things like ETL and so on. I mean, are you finding it's less about technology and more about kind of approaches now, taking ETL as an example? Oh, my God, 100%. Yes. I think as the ecosystem is evolving, first, open source had this problem real bad because the barrier to entry for creating your own SQL engine is very low, right?

Starting point is 00:24:53 So you saw a lot of frameworks and engines pop up. So if you were to take the example of ETL, there was MapReduce, then came Spark, then came like Tez, and you could run Hive on Tez. You could run maybe even Spark on Tez. I don't know how that fly, but then you could do Hive on Spark and now there's Flink, right? So there's a lot of initial conversation about this stuff. But as the ecosystem evolves, I think that conversation is becoming more and more standard. So then I see myself that less conversations are happening around what framework to use, but more conversations are now happening around, okay, I've chosen a framework, but how do I use this framework to be the most productive, most effective in my organization? And so I can also give an example of what an ETL workflow look like. And even if assuming you have standardized on a particular ETL engine, there's a whole slew of things that are still unresolved that an ETL developer has to do. Shall we talk about that, Mike?

Starting point is 00:25:57 Yeah. thing to discussions I'm having at work at the moment where you have very, very clever data engineers that are writing very small pieces of code and so on. But I suppose in a way, how do they become more productive and how do they think about making them... Yeah, the productivity question. What do you

Starting point is 00:26:20 think about that? Yeah, that's something I've been thinking a lot about. So let's say, I'll use Lyft as an example. And I'll use an example that many are familiar with. Let's say you work at Lyft, and you just want to do analysis on the number of Lyft rides that happen on a daily basis, right? So your boss comes to you be like, can you provide me this analysis? So first thing as an ETL developer, you had to do is you had to first search which is the right table to use for rides, right? There may be three or four different tables that may have rides in the name. You don't know when they were last updated.

Starting point is 00:26:51 You don't know who owns them. If you have a question, which table do I use? And this is the problem of discovery, right? Discovering data and references. Okay, let's say somehow you figured out that this is a table that you should use for calculating number of rides in a day. Then you have to look at this data and have a sense of what are the various columns, what

Starting point is 00:27:14 is the profile of those columns. So if it's a string column, how many times does null appear as a percentage on a daily basis? If it's an int column, what are the min and max? What are the count distinct? How many different rows or different values does this column have? So you could perhaps do a count distinct

Starting point is 00:27:33 and understand how much cost it's gonna take. So this category of questions is profiling. So the first one is discovery. Once you have discovered, now you wanna profile and understand the shape of the data. Now, to profile and understand the shape of the data. Now, once you've understand the shape of the data, you have to now go and develop your SQL query. Okay, and you're likely developing this in a fast iterative way in your local machine. And for this, it's important that even if the backing data set is a terabyte or a petabyte, that you have interactive

Starting point is 00:28:06 response so you can develop your SQL query really fast. If you missed a semicolon or spelled the column name wrong, you get the results really quickly and you can move fast, right? So this is your iterative development phase. After this, you make your query go through some integration tests. And this is usually a staging cluster where you may want to bake your query for a certain amount of time before you move it to production. You may also hook on a tool like Dr. Elephant to

Starting point is 00:28:34 analyze the performance characteristics of this query and make sure that it's not consuming too much CPU and it just is a reliable query in general. And once it passes the staging test, you may promote it to production. So if we were to recap,

Starting point is 00:28:49 there are these five steps, I think. There's a discovery, prototyping, iterative development, staging slash integration test, and productionalization, which an ETL pipeline or an ETL developer has to go through when developing an ETL pipeline or an ETL developer has to go through when developing an ETL pipeline. And the point I'm trying to make here is if we ignore the first two steps,

Starting point is 00:29:11 which is discovery and profiling, because they're big enough to talk about on their own, and just focus on the last three steps, which is development, staging, and production. If you simply wrote a SQL statement, you would have to modify that SQL statement as it goes from development to staging to production. The reason being in development, your test data set is probably an anonymized data set that's lying on your local machine that you're using to develop this. And it may not obviously be in the same location where your data is in production. If you're using in staging, maybe you're using a sampled down data set.

Starting point is 00:29:47 So you only have one for every hundred records in the production data set. And you're doing that for just speed of execution, right? And this sample data set may be actually in a different schema or a different place. And of course, in production, you have your data in the standard place. So within the context of each of these environments, data may be different. The kind of properties that you may set for the execution of this query may be different. And it's not good for you to modify this code as you move from one context to the other. So what we really need that's making these engineers unproductive is a more declarative, functional way of expressing

Starting point is 00:30:25 the SQL statement. So this functional ETL would then decipher the scope it's in. It figures out, oh, I'm in the dev environment right now. And so I'm going to substitute all these schema names to a particular local location that has been set for this anonymized data. Or, oh, I am in the staging scope and that I'm going to change all these schemas to be the sampled data sets. And Max Beauchemin, the creator of Airflow and Superset, also talks about this in a blog post.

Starting point is 00:30:56 And I think that's the really productivity conversation and the tools that we need to build. It's not just about the frameworks anymore, but it's about like, here's the workflow of an ETL developer and here are the inefficiencies there today. What can we do in order to remove those inefficiencies? Okay. So, so, so interestingly, you know, if any of my kind of old, old fogey kind of friends who worked in kind of Oracle data warehousing and ETL tools and so on years ago were listening to this, they would go, you've just described, you've just set out the requirements for an ETL

Starting point is 00:31:27 tool and and and you know talking it's interesting it's interesting how things you know how things go in a circle because some of the things you picked up on then so you see you said things like template driven or declarative definition of of kind of ETL flows that then get translated into the actual code. That is what tools like in my old days, tools like Oracle Data Integrator would do, where you would have this graphical, more symbolic mapping of source to target,

Starting point is 00:31:56 and then you could choose, for example, the actual language or the platform you're going to deploy to. It might be in, say, PySpark or it might be in whatever. And so that's what they would do. And so I suppose, you know, I suppose why is that kind of tool not used in a modern environment? I mean, that's probably an interesting question as well. I mean, I never hear of those tools ever being considered in a place like Lyft, for example. Yeah, that's a good question.

Starting point is 00:32:26 I don't have a good answer to that. I would say that many of these tools are reinventing the wheel, right? And much of that has been for good, right? So the invention of Hadoop and the SQL systems and Airflow and all that, I think, have been great innovations but there are things that were figured out an example that you just used marco's oracle data integrator that we don't have to like fully embody and have like a really bulky old school tool but we definitely can learn from the best practices especially around productivity and effectiveness from the past and incorporate

Starting point is 00:33:02 some of them in the newer ecosystem of big data. Yeah, it's interesting. I think it's a classic thing. I've got two teenage kids and the one thing they never want to do is ever take advice from their dad about anything. So it's one of those things where I suspect there's an element there where people have to find things out for themselves. And so I'm not putting yourself in that position, but sometimes you want to discover things yourself rather than be lectured by old people about, oh, nowadays we used to do this.

Starting point is 00:33:28 But certainly template-driven, I mean, we have, again, where I'm now, we have a system where there's templates that generate SQL from some Clojure script and so on. And effectively, it's a rebuilding of a template ETL engine. I guess the reasons they're not used in your sort of environments is, I'd imagine, you know, things like, I think the whole thing of having a graphical interface that you have to kind of step through things very kind of like, you know, in a certain way, that just does not lend itself to being used in environments where people are used to using like GitHub

Starting point is 00:33:57 and they're used to using kind of, you know, code and all that. So I think there's a cultural difference there for a start. And a lot of these tools are based on Windows. And again, where I am now, there isn't a single windows locked machine in the building so that in a practical sense doesn't help um cost the other thing yeah the other thing there is also like the pricing flash yeah exactly all all these new hipsters want open source tools myself included and i think historical tools are not open. Well, I think there's the open source and closed thing, but there's also the cost. I mean,

Starting point is 00:34:27 if you think about the cost of a tool like Data Integrator, you're talking kind of like, an average Oracle deal size used to be 100 grand, and nobody's ever going to spend that sort of money on something that you'd rather build yourself anyway and learn from that process. But there's interesting things there.

Starting point is 00:34:44 Another thing you said about was the deployment of code into different environments so this thing about you know being able to deploy into uh into into test or development but one thing you did say then that i don't think was addressed by those old tools was the different way that you know if you're deploying into production the database volume what might be different or or you know you might deploy different code into different environments because of the because of the i suppose the the size of it um that that isn't taken care of um and uh but yeah i mean general it sounds like what you're you're putting forward there is is a is a is a kind of spec for an etl tool and it's interesting you mentioned maxim and airflow because yeah i want it that tool deliberately stops at the point of being just, you know,

Starting point is 00:35:26 maybe describe what Airflow is for people just in case people aren't aware of what it is and tell us what it involves and what problem it solves first of all. Yeah, Airflow is an orchestration tool which is just a fancy term for being a really fancy cron job. So you can say run this job at nine and I have this other job that I would like to get run at 10 but it has a dependency on the 9 a.m job so please only run it either at 10 a.m or whenever the dependency gets done okay okay and and so but again that deliberately stops at a certain point it's very much done in code um it's not done perfectly um but you know but one other question I wanted to ask you before we get on to uh you know, the next bit is, why did, it sounds like it's all on-premises sort of systems you've got.

Starting point is 00:36:11 Did you think about maybe kind of make, you must have heard of obviously the whole serverless approach and things like Athena and BigQuery and things like that and Dataflow. Have they been considered or not really? Yeah, so a lot of these systems were developed before all all these serverless systems were the rage uh these are on-prem system in the sense like let's take airflow as an example it's an on-prem system in the sense that it can be used on premise but we use it on the cloud so in fact we don't you know we we have a machine and set of machines on AWS that we use to host our Airflow instance. In terms of serverless, I think it's something we're thinking about. It's not currently an important part of the architecture. Okay. And I suppose another reason we use that here is

Starting point is 00:36:59 dealing with the kind of, I suppose, the auto scaling side. So when, for example, in the States, it's a very busy time for ride-sharing, how do you handle that kind of scaling up and scaling down? Has that been a problem at all with things like Flink or have you handled that really with the way you do things? Right. Yeah, many of these pieces of software, like you're alluding to, Flink has, like, scaling properties and we have been able to handle the traffic

Starting point is 00:37:24 with the current functionality of Flink. I haven't seen a lot of problems related to the scaling. I do think the other factor here is cost and that I have admittedly less visibility to because you may provision for the max capacity, but if you're not utilizing that max capacity all the time, then obviously you're paying the cost. So auto scaling helps with both scaling for higher traffic but also lowering the cost when when there aren't as many rides or requests okay okay so what okay so you've done the etl side and you've put the

Starting point is 00:37:55 data in um how do you how what challenges do you face around kind of making that daily data available to people and how do you handle i I suppose, what's your thoughts around the data discovery side of things, really? Yeah. And this is the first two parts of the ETL development workflow, right? So discovery and profiling. But as you point out, this is not just in the ETL development workflow, right? If you are an analyst who just wants to do some analysis, they also have to go through the data discovery access. So there are multiple different personas for data discovery. We can focus just on the ETL development and the data find persona. Yeah. And so here, like I was saying before, like

Starting point is 00:38:38 one has to figure out which tables to use, but the problem is even more compounded in larger organizations where you may not actually have just one database, right? So you may have some Oracle, some Heredita, some Kafka, some HBase, and of course, some tables on S3 via Hive or Presto or Impala, something like that. And you may want to learn all the attributes that we were talking about earlier, like what's the schema, when was the last updated, who's the owner, who are the most popular users. But you may also want to see, like, where actually is this table? Is this table in an HBase table? Is it a Grafana dashboard?

Starting point is 00:39:15 Is it a Teradata table? And then you may also have organization-specific tags, right? We obviously can't predict how an organization is structured, so you may have certain tables tagged as marketing tables. And being able to essentially have like a search engine that becomes your single go-to place, like Google has become for us on the web, right? If there is a portal that becomes your single go-to place for finding all data. And so far, we have talked only about the tables. But imagine if we raised that level and started talking about dashboards.

Starting point is 00:39:50 So not only this search engine is indexing all the tables, but it's indexing all the dashboards, right? So when you search for a particular thing like weather, it shows you all the dashboards that people have previously created with weather, related to weather, ranked by their order of popularity. And it shows you some tables as well, and you can see which one you want to use. So that means that you can actually start reusing analysis instead of just figuring out what tables you need to use. Okay.

Starting point is 00:40:18 So that's interesting because there's a few points in there I want to drill into. The profiling thing you talked about and how we can do that differently. But interestingly, the search idea you said there, I think is a kind of interesting one. And in fact, we've got another show coming up. I think it's probably going to be the one after yours. I've recorded it already, which is with Doug Buddenara, I think it is, from ThoughtSpot.

Starting point is 00:40:43 And ThoughtSpot's actual product is what you're saying there. It's to say, let's use the search interface as our way of discovering the data that's in our environment. Can we use that, not so much for the profiling side, but for the actual analytic side there, really? And we spent a good half an hour with him, him trying to actually describe effectively

Starting point is 00:41:03 an empty search box, which was him describing the interface for his tool. You know, it just was you just type stuff in and you find things from there. Is that the sort of thing you're thinking of really, that kind of idea? Right. Yes, I am. But I think what's perhaps different here is what all information is being indexed, right? And so it's not just a particular kind of dashboards. I think it's very common for folks to build search boxes within one tool. But what I'm really advocating

Starting point is 00:41:32 for here is search boxes across multiple tools and at multiple levels, right? Both at the table level as well as the dashboard level. So you're looking at the kind of lineage side and the provenance side as well, really. So where you know, so where did this come from? How is it kind of loaded? Or even just finding things at a level beyond the kind of things visible to the end user, really. So the profiling side, profiling, again, you know, putting my old man hat on, you know, these ETL tools have had data profilers for a long time now. But you sort of like need to know what you need to know what it is that you're you're you're profiling it's a very expensive and time-consuming process to do it and

Starting point is 00:42:10 you know one wonders if there are better ways we can do this now with the search and with the power of uh i suppose i think we've learned the last few years to do that sort of better i mean so with the profiling side how do you think that can improve them then really? Yeah, I think there are a few technical ways that are super interesting around profiling. So I'll give you an example. Parquet is a really common file format to use for storing your data. In fact, Lyft uses Parquet heavily.

Starting point is 00:42:37 And Parquet already has some information in its footers around stats that are included in that particular row group, which is just a segment of data in Parquet. So if you store your data in Parquet, you already have information, for example, in the file itself, in the footer of the file, around some stats like max, min, and so on. What that means is that the historical process of finding stats, for example, running a SQL query and then computing all these very heavy statistics on all the data you have, is now already in some ways available to you. And it's already used for query optimization. That was the original goal for having these stats in Parquet footers.

Starting point is 00:43:19 Can we leverage the same stats in order to actually make them available directly to the users and be visible to humans in order to actually see the profile of the data in the same way a query optimizer would is exactly like an improvement. So what about, you mentioned Airflow again earlier on, what about the kind of the metadata that is provided in Airflow? How do sometimes. For example, Airflow knows the delivery times of all tables. So you could definitely and should definitely surface that in this one single system. There's also lineage metadata in Airflow that you're referring to. But I think lineage metadata is only available within DAGs and therefore doesn't really map across DAGs. And if you forgot to put a dependency, let's say your SQL query actually depended on a few different tables, but in the DAG you forgot to express that dependency,

Starting point is 00:44:32 that dependency is lost. One other kind of lineage information that's not available through just Airflow is lineage across systems or lineage across like tables and dashboards. So if a dashboard uses a particular table, then that obviously is not available to Airflow. So Airflow has a very small legitimate view of dependencies, but only within the context of tables, not necessarily in the larger context of the usage across data. Okay, so it's almost like you can imagine a kind of Google search really for the data within your organization and I suppose that'd be really quite valuable. I mean have you seen that idea out there in the market at all? Is it something that you think would be useful and you've seen it

Starting point is 00:45:15 being used at companies? Yeah I haven't seen a direct manifestation of that. There are definitely attempts that have been made along doing something similar. So there's a vendor elation that's fairly common used for data discovery, but it's slightly different than what we've been talking about. Airbnb has a data portal. LinkedIn has a warehouse, which is open source. Facebook has an internal system called iData

Starting point is 00:45:43 that's doing something similar. Netflix has Metacad, again, doing something similar. So lots of attempts have been great hearing what you're doing on a customer site now as opposed to sort of the in-cloud era. So thank you very much for coming on the show. And how do people find out about you really, Mark? Yeah, that's continued. I still speak at conferences. I still try to do blogs whenever I get a time. And I've been still really enjoying that.

Starting point is 00:46:24 And it's always a pleasure being here at Drill to Detail and talking to you, Mark. So I'm really, really grateful for you having me here and giving me a chance to share. That's great. And your book that you wrote a while ago, just maybe kind of like, what's that called again? And because that is such a, the book itself is so, the bit about the architectures I found was so timeless, really. You know, tell us about that again.

Starting point is 00:46:45 Where was that published and how did people get hold of it? Yeah, for sure. The book is called Hadoop Application Architectures. And even though the name suggests it's related to Hadoop, it's really big data architectures. And the book's divided into two parts. The first part talks about design decisions that one makes in general when storing and processing big data, like do you need to store them in a row format or a column format? What should your row keys look like if you're using something like an HBS or Cassandra and so on and

Starting point is 00:47:16 so forth. But I think what has become really popular amongst people is the last part. The second half of the book talks about these larger architectures, things like data warehousing using these big data tools or fraud detection using these big data tools or event analysis using similar tools. And I think we put a lot of thought into what is the state-of-the-art architecture for building a fraud detection system or building a data warehousing system. And that's become super super popular There were four co-authors for the book myself Ted Jonathan and Gwen and We're really glad with the response we've seen to the book and we still keep Getting great feedback and look forward to hearing more and how we can improve that knowledge share in the future

Starting point is 00:47:59 Yeah, I mean I said to you I think I said to you in the first episode I've recorded with you that for my awkward days, it reminded me of a Tom Kite book. You know, it was not only how you do things, but it was why. And that for me was the thing that was really interesting. And that's why I originally approached you to come on the show then, because it was why do we do things and what is the rationale behind it? You know, that was that's quite rare in kind of big data books. And so that for me was made it a book that I still refer to now, actually. So I'd totally recommend that. So, Mark, thank you very much for coming on the show it's been great to speak to

Starting point is 00:48:27 you again and take care and hopefully speak to you again at some point yeah thank you Mark really appreciate your time as well really enjoyed being here

Drill to Detail - Drill to Detail Ep.52 'Lyft, Ride-Share Analytics and ETL Developer Productivity ' With Special Guest Mark Grover

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.