Drill to Detail - Drill to Detail Ep.52 'Lyft, Ride-Share Analytics and ETL Developer Productivity ' With Special Guest Mark Grover
Episode Date: April 9, 2018Mark Grover LinkedIn Profile and Github Profile"Hadoop Application Architectures" "Drill to Detail Ep. 7 'Apache Spark and Hadoop Application Architectures'Lyft Engineering Blog"Software Engineer... to Product Manager" blog by Gwen Shapira"Introduction to the Oracle Data Integrator Topology" from the Oracle Data Integrator docs siteApache Airflow and Amazon Kinesis homepages "Experimentation in a Ridesharing Marketplace" by Nicholas Chamandy, Head of Data Science at Lyft"How Uber Eats Works with Restaurants""Deliveroo has built a bunch of tiny kitchens to feed more hungry Londoners" - Wired.co.uk
Transcript
Discussion (0)
Hello and welcome back to the Drill to Detail podcast after the Easter break and I'm your host
Mark Whitman. So one of the most popular episodes over the last two years we've had was with Mark
Grover, one of the committers on the Apache Sentry and the Apache Spot open source projects a contributor to Apache Spark and co-author of the O'Reilly book
Hadoop Application Architectures along with Gwen Shapira from Confluent who also came on the show
back in 2017 to talk about Apache Kafka streaming data pipelines and of course her book.
So back at the time Mark worked at Cloua, but since then moved on to Lyft,
and he's kindly agreed to come back on the show and tell us a bit about the technology he's using
there right now, and some of the things he sees that the big data world hasn't solved yet,
and where it might be headed. So Mark, welcome back to the show, and great to have you on here,
and maybe just do a quick introduction and what you've been doing so far.
Yeah, thanks Mark. Yeah, it's great to be back here again. I am now a product manager at
Lyft. Lyft is a big player in the ride-sharing space here in the US and now in Canada as well.
Like you said, I was previously an engineer at Cloudera, and over the past seven, eight years
at various different companies, I've built and seen many different architectures with big data
tools, and I've contributed and been a committer on many different projects in the open source space for big data as well. I have a keen interest in
sharing what I've seen in the larger data community through blog posts and books and
podcasts and that's exactly why I'm here and I'm super excited to be here. That's great thanks Mark
so just before we get into the detail of what you've been doing at Lyft just tell us a bit about
your involvement in the Apache Spark project.
So what interested you about that problem space and what was your contribution really to that project?
Yeah, Spark was quite the rage when it started.
It started as a replacement for MapReduce for batch computes.
So essentially you could do your ETL or any ad hoc batch compute in Spark.
But then it opened up two new use cases
with the same engine.
It opened up a very popular streaming use case,
which allowed you to do near real-time processing
with Spark streaming.
So you could do fraud detection or anomaly detection.
And it opened up a machine learning use case.
So you could use the same engine
to train and score models. It tried to open up a machine learning use case. So you could use the same engine to train and score models.
It tried to open up a few other use cases around graph processing.
But I would say at the high level, batch streaming and machine learning became the very common use cases of Spark.
When I focused on Spark, I was mostly spending my energy around Spark streaming.
So both contributing as a platform to make it more resilient, more secure, and so on,
but helping a lot of organizations
who are Cloudera customers
to build near real-time applications.
And there's a chapter
in the Hadoop Application Architectures book
on fraud detection, for example,
which is a total manifestation
of the kind of work that people were doing
with Spark Streaming and the kind of work I involved myself day-to-day with Spark
streaming.
Excellent. So, so you've moved on now from Cloudera and you've moved on to,
to Lyft. So what,
what interested you about going into the ride sharing industry,
obviously, albeit in an IT role and Lyft in particular,
to the point where you'd move from say a platform vendor to actually kind of
working in a more focused role with a customer?
Yeah, that's a great question. A question I spent a lot of time thinking about when I was thinking
of this transition too. And I think there are two parts to it. One part is my changing of roles from
an engineer to product manager. And then the second part is me changing companies. And by the
way, this is all specific to me, right? So it's only my reflection of how I think about things. So let's talk about changing roles first, going from an engineer to a product
manager. I think going to a product manager, it has allowed me to step back and think about
what we need to build, how does it fit into the longer term vision, and why is it that we need
to build it? And I got a lot of that independence as an engineer too,
but it's even been more refreshing now
to be a product manager
and still be thinking about execution,
but at the same time,
have a little more time to think strategically
about what we need to build long-term
and how does that map to short-term
and why is it that we need to build it? So the second part of that was changing companies, which was like
moving from Cloudera to Lyft. And I actually learned a lot from Cloudera and I definitely
apply much of what I learned at Cloudera every day at my job at Lyft. And my decision to move
to Lyft was a hard one, but it was mostly around scaling my impact and being able to contribute end-to-end
in the lifecycle of data at a company. So at Cloudera, I went and helped a customer with
the big data platform, but here at Lyft, I'm not only contributing with a platform,
but thinking about end-to-end things, like how do we make this platform the most productive
platform? How do we make Lyft the most data-driven company in using
this platform and things like that? Okay, so I think it was Gwen Shapiro,
actually, who wrote a good blog post a while ago about going from an engineer to a product manager.
And it's an interesting transition, isn't it, in terms of your focus and what you look at in terms
of the longer-term vision, as you say, and the value.
But also, I mean, one of the things I found particularly unexpected about being a PM myself
is the fact that you don't really have any, I suppose, direct line, you know, you don't
manage people, you don't have the ability to actually tell people what to do.
It's a lot through influence, really, isn't it?
It's more down to the strength of your vision.
Do you think that's the case?
That is definitely the case, yeah.
So there's a lot of influencing happening,
and you influence through data,
through what the industry standards are looking to be,
what are the trends that are happening,
and it's a very interesting role.
Excellent.
Okay, so let's go into what you're doing at Lyft then.
So first, what does the technology platform look like there? You know,
what kind of tools do you use or certainly what kind of, I suppose, problems are you solving?
Maybe start by walking us through the data platform, that side of it really, first of all.
Yeah, that sounds great. So to walk folks through the mission of Lyft is to improve people's life
with the world's best transportation, which literally translates to have drivers offering ride and making money and have passengers requesting
those rides. The catch is all of this happens, all the drivers and the passengers have to
be at the right place at the right time. So if we were to try to map this larger goal of Lyft to how data can help Lyft meet this goal, there are three
big categories of users that data could serve and make more data-driven at Lyft, right? So the first
category of users are analysts or data scientists. For example, if you wanted to see the growth of drivers and passengers on a per region basis and be able to forecast how much we need to invest and just do sort of business related analytical decisions, that would be the first category of users.
So long and short of that is analysts and data scientists.
Second category of users is operations. So there may be a general manager of a particular region, let's say San Francisco,
who wants to make sure that the drivers are correctly incentivized to drive at particular
times of the day when, for example, a baseball game may have finished, and that they have
actionable insights from that. So not only do we show them the data around where the drivers should
be, but we also
give them recommendations hey we recommend you deploy this particular incentive so drivers are
then driving over to this area then the third category of users is experimenters and all
engineers in the company all product managers and company fall in this category but these are users
who can use the data platform to experiment with changes
to the app and then decide based on data whether they want to do a full-on rollout or not. And
when people think experimentation, they think, you know, changing button colors, but it can be
much more nuanced too. You could test out a new pricing algorithm in a certain region with certain
group of users and see if you want to roll that out to the entire region or the entire country. So to recap the three kinds of users that the data platform at
Lyft is serving are analysts slash data scientists. The second category is operations. And the third
category is experimenters. Okay. So imagine in that second category, you know, you've got the
operational side, you know, you're in a uniquely competitive situation because a driver using, say, Lyft in their car, you know, the Lyft app, would have other ride sharing platforms there as well.
And so you're competing against kind of other platforms they can go with.
And it's all kind of in a way you've got to make a decision now and then because, you know, that's when the opportunity is to earn some money from getting a ride.
You can't, you know, you've got the opportunity is to earn some money from getting a ride. You can't,
you know, you've got to be on time as well. So you must have a uniquely competitive situation
and an opportunity there to differentiate through analytics, really. Absolutely. And actually,
that leads to an interesting point that I want to make and want to share. I think
when people think of data and using data to power organizations, many times we only think of
providing the data. So let's use the example that you were using, Mark. So a general manager has to
see what is the state of the health of the drivers and passengers right now in the moment in the
region, right? One way for us to do that is to just provide data. Be like, oh, there's like a collection of passengers here that need rides.
And then there's an oversupply of drivers over here in the other region.
But at this point, all we have done is given them just the data.
And now they have to act upon this data and they have to figure out what they need to do in order to tackle that supply and demand difference in various different
regions. But imagine if we elevated this conversation to not just be providing data,
but to be recommending actions, right? So take it one more level higher. So instead of saying,
here's an extra supply in this region, and here's some extra demand in this region. We say, based on some data
and what you've done in the past, like learn essentially from what they've done in the past
and actually recommend, here's an incentive that we recommend you deploy to these category of
drivers, which will make hopefully like this percentage of drivers from this part of oversupply
to the over demand area at the moment,
right? So raising that level of conversation around actions instead of just data is something
that we think about constantly. Okay, okay. And I suppose, I mean, other examples I've seen,
I mean, one of your competitors has, you know, a food delivery service as well. And that you could
imagine that the activity that has been seen using a ride-sharing app and the concentrations of people and people travelling to restaurants and so on,
that maybe even prompted the idea of branching out into a different area,
say things like food delivery or logistics.
And in the UK, we have a company called Deliveroo
that delivers food from restaurants into houses.
But actually now, actually now when they realize
there's demand in an area where there's no restaurants they actually now work with the
restaurants to open like pop-up restaurant pop-up shops in there so that they actually
tell the restaurants where to open additional branches which is which is fascinating isn't it
yeah that's crazy i think there's a lot of uh room for improvement a lot of work to do and a lot of
impact in the space yeah so let's get into a bit more technical stuff.
Let's go into the ETL analytics use case.
So what does the architecture look like there?
I mean, do you use things like Kudu at all?
I mean, what kind of platforms and solutions do you have there?
Yeah, great question.
So for the ETL analytics use case, I'll start from the very ingest side of things. So we have a custom ingest
code that either is used in your phone app. So every time you have some activity that's of
interest, we lock that. And then we also have instrumentation in our services. So the service,
for example, that provides you the ETA or the pricing, also logs certain information about the ETA
and the price it provides.
So we have instrumentation in cell phones and in services.
All this instrumentation flows through a custom logger
to a message queue.
And from the message queue,
there's a streaming system that will take this data
and put that into S3.
From there on, we have SQL systems.
So the three SQL systems that exist today,
we have Redshift, we have Hive and Presto, which I put in the same category. I'll explain later why.
And then the third system is Druid. And from each of these three systems, we can connect
various different tools for BI or analytics,
and we use those tools to do that analysis. Okay, okay. So I mean, all those names I recognize
there, but they're not particularly names I recognize from the Cloudera side. So how does
that stack, you know, each of the different levels there? How does that compare to what you were used
to using? And, you know, maybe tell us about some of the pros and cons of why they chose that. I
mean, start maybe with the kind of storage layer, HDFS, and you mentioned S3 there.
Yeah, that's a good question.
So let's start again from the top there, from the ingest side.
We have a custom collector at Cloud Airside.
Many times people use Flume.
And I think it may just have been here a difference, when this code was developed and the familiarity of
the engineers with Flume and so on. I don't think this is a big difference. I think we're going to
find other interesting, bigger differences as we go downstream here. So the next one is the
PubSub system. So at Lyft, we use Kinesis and we are investing in Kafka because of some limitations
we have come through Kinesis.
And I think that decision was, again, because of the heavy reliance of Lyft on AWS components,
especially on the operational software side and services side, so the services that serve ride sharing.
And then Kinesis was an easier choice there. But I think we have run into some limitations of Kinesis that's making us go to Kafka. And Cloudera architectures are also using Kafka. So we are pretty convergent there.
Tell us about Kinesis. So tell me about Kinesis. I've heard of that. But coming more from the
Google side and Cloudera, what is Kinesis really? What does it try to solve and describe it for us?
Yeah, very good question. So Kinesis is similar to Kafka.
It's a PubSub system, message queue.
You throw a bunch of producers on one side
and have a bunch of consumers on the other side
that read from it.
So it's pretty competitive to Kafka.
And I think to dive deep into the differences
between Kinesis and Kafka,
I think one limitation we're hitting is the number of reads we can do on the consumer side of Kinesis, how that scales in comparison to Kafka.
And we've seen Kafka scale better than Kinesis in that case.
Okay. So basically solving the same problem, but in your experience so far, Kafka has been a more scalable solution for doing that.
Absolutely. Great. Okay. So from there on, how do you get data consumed from Kafka into your
analytics system? And there are different solutions to that. Even at Cloudera, some folks used
Flume and some folks used Spark streaming. Outside of Cloudera, if you talk to Gwen,
for example, at Confluent, they could have many users
using Kafka Connect.
At Lyft, we use Flink.
And I think
this is probably a part of
ingestion pipeline that hasn't been super
standardized right now.
The reason being, you could use, there are two sets
of tools you could use here in order to do ingestion.
First one is a dummy ingestion
tool. It simply picks data, doesn't do a whole lot of transformation. It can do maybe in the context
of an event, it may mask something or create a new record based on existing fields in the events,
but it doesn't do larger windowing or anything more intelligent than that.
Since I flew in D.
Yeah, exactly.
Yeah.
So that category is similar to Flume and Kafka Connect.
And that's one category of tools you can use.
But then folks have started using richer, more heavier weight streaming systems in order to do stuff like this, right?
So you could use Spark Streaming or Flink in order to ingest along the way.
Maybe you will change a few things here and there.
Maybe you will do some windowing.
Maybe you'll do some deduplication in the context of that window and so on and so forth.
And I think there is a fine line between deciding which one are you going to use and that line is not clear in general.
And so I see more and more people using a larger heavyweight streaming system
because of the flexibility it can offer in the long term. And between that, I think the choice
between Spark streaming and Flink is still, again, something that's not super standardized yet.
And Lyft has investments in Flink and has a few Flink emitters. And so that's why Lyft chose
Flink. I think there are also a lot of technical differences between Spark streaming and Flink
when we get to the guts of it,
but we will save that for another conversation later.
And then, okay, so let's move on from there
and talk about analytical databases.
So at Lyft, I was saying we have Redshift
and then we have Hive and Presto
and then we have Hive and Presto. And then we have Druid.
And Redshift is mostly a sign of historical existence at Lyft.
And think of it as competitive Teradata or Netezza, your heavy MPP systems. And that was essentially for easier setup, quick and dirty analysis back in
the day. But these things stick, you know. And the next system is Hive and Presto. And
cloud architecture would be very similar. The only difference there would be Presto versus Impala.
And I do continue to see a lot of contention between that.
Airbnb, for example, uses Presto too, so does Lyft.
And the third system, really quick, would be we use Druid for our sort of cubing,
really fast analytical store, which connects to Supersat.
And in cloud airspace, that would be Kudu.
I'm happy to dig into whichever ones you want to dig into,
but that's the sort of differences. So, I mean mean there's a ton of stuff there i'd be interested
to talk to you about um so um first of all no use of drill i mean drill again is one of those ones
in the presto kind of impala space you know that's something you guys have looked at or just
it's just not one that you've chosen to use at all yeah not one we've chosen to use i think impala versus presto versus drill is yeah it's
that category okay so so so looking at you mentioned about druid versus kudu there and and
so we actually had um we actually had the uh guys from impala on the show a couple weeks ago not
impala sorry from imply um and actually one of the people behind kind of druid was on one show
and we're talking about it and um you know i iid is kind of, I think it's a fantastic product
and it's got quite a high barrier to entry,
I think, as far as I'm concerned,
in terms of getting data into it.
And there's a lot of kind of manual management
of kind of servers and so on there.
But the performance is super fast.
And like you say, you can cube stuff,
you can load data in aggregate.
I mean, what's your experience been like with Druid?
And is it something you'd recommend using further? Or what really? Yeah, good question. And I actually, I did see and
listen to Funjin's talk around Druid here at Drill to Detail. And I really enjoyed it. That's
the best part of me for Drill to Detail, to listen to all these amazing talks. And I know Funjin
personally, and he came by at Lyft as well to give a tech talk on
Druid and help us understand how do we make this decision, right?
So first thing to note is like Druid and Kudu solve similar problems, but they aren't the
same ways of solving them.
So an example is Druid is a storage engine as well as execution engine.
So you store your data and Druid will also do how you actually process your data. Well,
Kudu is just a storage engine, which means if you have to process your data that's stored in Kudu,
you have to have an execution engine or a processing engine with it. And the most commonly
used engine is Impala, right?
So when you're comparing Druid and Kudu,
that's not quite apples to apples
unless you compare Druid and Kudu plus Impala
or something like that.
Yeah, exactly, exactly.
So what about, I mean, because I've been,
as an aside, I've been trying to persuade people
where I'm working now to consider Druid
as a universal front end for lots and lots of, say,
big data, but also using it for kind of like, like you say, things like Superset,
for ad hoc queries and so on. I mean, with Druid, how well have you found that it handles
both large amounts of data and fast queries and generally how much admin is involved in
maintaining it really? Yeah. So I don't have a whole lot of context
into the admin. I have not seen it to be as heavy as perhaps you may be suggesting. I think
Druid is obviously a newer product as compared to many other products in the ecosystem. So it's
understandable for it to be a little further earlier in the maturity curve, but I don't have any anecdotes or data
to prove that it's really operation heavy.
Our experience has been really good.
It's very fast for analytical sort of,
you know, here are my few dimensions
that I want to group by.
Here's some attributes
that I want to roll up to and aggregate
and powering a dashboard or,
you know, a SQL or, you know, there's no SQL syntax in Druid today, but you could power a
dashboard using the Druid syntax and have that be super interactive. And I think one of the main
reasons you may ask, like, oh, you already have Presto, others may have Impala, like why do we need yet another system engine
in order to do this stuff?
And I think there are,
so Impala and Presto,
all these things and Drill
have SQL syntax that you use to query the system, right?
So it supports on joints
and it has code, for example,
that will spill to disk if your join is too big.
It's fast compared to Hive, but in many cases, it's not fast enough. In cases where you want
to do simple aggregations and want to do group bias on a certain very small number of attributes
and you don't need the full power of SQL, there's room for much more interactive analysis. And
that's the gap that Druid, for
example, is trying to fill. Yeah, excellent. So what about, and this might lead on to the next
thing I want to talk about, which is the kind of ETL side, but what about things like orchestration
and workflow around ETL? I mean, have you tried things like, oh, Uzi is the obvious one, but have
you looked at things like Airflow at all and those sort of things? Yeah, absolutely. In fact, at Lyft, we use Airflow.
I've seen outside of Cloudera, Airflow being a pretty popular orchestration tool.
Uzi had become the standard and is still used pretty heavily.
Before Uzi, there was Eskaban from LinkedIn. There was Luigi and so on.
And I think all of these mostly standardize to Uzi.
And I think perhaps the only choice that folks have to now make is Airflow and Uzi.
And I've been working with Airflow now, and I'm a big fan of Airflow, having used Uzi for a long time too.
Okay.
And I think this is a great example here.
We're talking a lot about all these kind of crazy words and Uzi and all this kind of stuff.
And I guess one thing I think I found certainly the last year and a half when I've been working on a customer site is the things that we were obsessing lot more about how do we solve more generic kind of problems around kind of things like ETL and so on.
I mean, are you finding it's less about technology and more about kind of approaches now, taking ETL as an example?
Oh, my God, 100%. Yes. I think as the ecosystem is evolving, first, open source had this problem real bad because the barrier to entry for creating your own SQL engine is very low, right?
So you saw a lot of frameworks and engines pop up.
So if you were to take the example of ETL, there was MapReduce, then came Spark, then came like Tez, and you could run Hive on Tez.
You could run maybe even Spark on Tez. I don't
know how that fly, but then you could do Hive on Spark and now there's Flink, right? So there's a
lot of initial conversation about this stuff. But as the ecosystem evolves, I think that conversation
is becoming more and more standard. So then I see myself that less conversations are happening around what framework to use, but more conversations are now happening around, okay, I've chosen a framework, but how do I use this framework to be the most productive, most effective in my organization? And so I can also give an example of what an ETL workflow look like.
And even if assuming you have standardized on a particular ETL engine, there's a whole slew of things that are still unresolved that an ETL developer has to do.
Shall we talk about that, Mike?
Yeah. thing to discussions I'm having at work at the moment where you have very, very clever data
engineers that are writing
very small pieces
of code and
so on. But I suppose in a way, how do
they become more productive and how do they
think about making them...
Yeah, the productivity question. What do you
think about that?
Yeah, that's something I've been thinking a lot about.
So let's say,
I'll use Lyft as an example. And I'll use an example that many are familiar with. Let's say you work at Lyft, and you just want to do analysis on the number of Lyft rides that happen on a daily
basis, right? So your boss comes to you be like, can you provide me this analysis? So first thing
as an ETL developer, you had to do is you had to first search which is the right table to use for rides, right?
There may be three or four different tables that may have rides in the name.
You don't know when they were last updated.
You don't know who owns them.
If you have a question, which table do I use?
And this is the problem of discovery, right?
Discovering data and references.
Okay, let's say somehow you figured out that this is a table
that you should use for calculating
number of rides in a day.
Then you have to look at this data and have a sense of what are the various columns, what
is the profile of those columns.
So if it's a string column, how many times does null appear as a percentage on a daily
basis?
If it's an int column, what are the min and max?
What are the count distinct?
How many different rows or different values
does this column have?
So you could perhaps do a count distinct
and understand how much cost it's gonna take.
So this category of questions is profiling.
So the first one is discovery.
Once you have discovered, now you wanna profile
and understand the shape of the data. Now, to profile and understand the shape of the data.
Now, once you've understand the shape of the data, you have to now go and develop your SQL query.
Okay, and you're likely developing this in a fast iterative way in your local machine.
And for this, it's important that even if the backing data set is a terabyte or a petabyte, that you have interactive
response so you can develop your SQL query really fast. If you missed a semicolon or spelled the
column name wrong, you get the results really quickly and you can move fast, right? So this is
your iterative development phase. After this, you make your query go through some integration tests.
And this is usually a staging cluster where you may want
to bake your query for a certain
amount of time before you move it to production.
You may also hook on
a tool like Dr. Elephant to
analyze the performance characteristics
of this query and make sure that
it's not consuming too much CPU
and it just is a reliable query
in general. And once it passes
the staging test,
you may promote it to production.
So if we were to recap,
there are these five steps, I think.
There's a discovery, prototyping,
iterative development,
staging slash integration test,
and productionalization,
which an ETL pipeline or an ETL developer
has to go through when developing an ETL pipeline or an ETL developer has to go through when developing
an ETL pipeline. And the point I'm trying to make here is if we ignore the first two steps,
which is discovery and profiling, because they're big enough to talk about on their own,
and just focus on the last three steps, which is development, staging, and production.
If you simply wrote a SQL statement, you would have to modify
that SQL statement as it goes from development to staging to production.
The reason being in development, your test data set is probably an anonymized data set
that's lying on your local machine that you're using to develop this.
And it may not obviously be in the same location where your data is in production.
If you're using in staging, maybe you're using a sampled down data set.
So you only have one for every hundred records in the production data set.
And you're doing that for just speed of execution, right?
And this sample data set may be actually in a different schema or a different place.
And of course, in production, you have your data in the standard place.
So within the context of each of these environments,
data may be different. The kind of properties that you may set for the execution of this query
may be different. And it's not good for you to modify this code as you move from one context
to the other. So what we really need that's making these engineers unproductive is a more declarative, functional way of expressing
the SQL statement. So this functional ETL would then decipher the scope it's in. It figures out,
oh, I'm in the dev environment right now. And so I'm going to substitute all these schema names
to a particular local location that has been set for this anonymized data. Or, oh, I am in the
staging scope
and that I'm going to change all these schemas
to be the sampled data sets.
And Max Beauchemin, the creator of Airflow and Superset,
also talks about this in a blog post.
And I think that's the really productivity conversation
and the tools that we need to build.
It's not just about the frameworks anymore,
but it's about like,
here's the workflow of an ETL developer and here are the inefficiencies there today. What can we do in order to remove those inefficiencies? Okay. So, so, so interestingly, you know,
if any of my kind of old, old fogey kind of friends who worked in kind of Oracle data
warehousing and ETL tools and so on years ago were listening to this, they would go,
you've just described, you've just set out the requirements for an ETL
tool and and and you know talking it's interesting it's interesting how things
you know how things go in a circle because some of the things you picked up
on then so you see you said things like template driven or declarative
definition of of kind of ETL flows that then get translated into the actual code.
That is what tools like in my old days,
tools like Oracle Data Integrator would do,
where you would have this graphical,
more symbolic mapping of source to target,
and then you could choose, for example,
the actual language or the platform you're going to deploy to.
It might be in, say, PySpark or it might be in whatever.
And so that's what they would do.
And so I suppose, you know, I suppose why is that kind of tool not used in a modern environment?
I mean, that's probably an interesting question as well.
I mean, I never hear of those tools ever being considered in a place like Lyft, for example.
Yeah, that's a good question.
I don't have a good answer to that.
I would say that many of these tools are reinventing the wheel, right?
And much of that has been for good, right?
So the invention of Hadoop and the SQL systems and Airflow and all that, I think, have been
great innovations but there are things that
were figured out an example that you just used marco's oracle data integrator that we don't have
to like fully embody and have like a really bulky old school tool but we definitely can learn from
the best practices especially around productivity and effectiveness from the past and incorporate
some of them in the newer ecosystem of big data.
Yeah, it's interesting. I think it's a classic thing. I've got two teenage kids and the one
thing they never want to do is ever take advice from their dad about anything. So it's one of
those things where I suspect there's an element there where people have to find things out for
themselves. And so I'm not putting yourself in that position, but sometimes you want to discover
things yourself
rather than be lectured by old people about,
oh, nowadays we used to do this.
But certainly template-driven, I mean, we have, again,
where I'm now, we have a system where there's templates
that generate SQL from some Clojure script and so on.
And effectively, it's a rebuilding of a template ETL engine.
I guess the reasons they're not used in your sort of environments is, I'd imagine, you
know, things like, I think the whole thing of having a graphical interface that you have
to kind of step through things very kind of like, you know, in a certain way, that just
does not lend itself to being used in environments where people are used to using like GitHub
and they're used to using kind of, you know, code and all that.
So I think there's a cultural difference there for a start.
And a lot of these tools are based on Windows.
And again, where I am now, there isn't a single windows locked machine
in the building so that in a practical sense doesn't help um cost the other thing yeah the
other thing there is also like the pricing flash yeah exactly all all these new hipsters want open
source tools myself included and i think historical tools are not open. Well, I think there's the open source and closed
thing, but there's also the cost. I mean,
if you think about the cost
of a tool like Data Integrator,
you're talking kind of like,
an average Oracle deal size used to be 100 grand,
and nobody's ever going to spend
that sort of money on something that you'd rather
build yourself anyway and learn from that process.
But there's interesting things there.
Another thing you said about was the deployment of code into different environments so this thing about
you know being able to deploy into uh into into test or development but one thing you did say
then that i don't think was addressed by those old tools was the different way that you know if
you're deploying into production the database volume what might be different or or you know
you might deploy different code into different environments because of the because of the i suppose the the size of it um that that isn't
taken care of um and uh but yeah i mean general it sounds like what you're you're putting forward
there is is a is a is a kind of spec for an etl tool and it's interesting you mentioned maxim and
airflow because yeah i want it that tool deliberately stops at the point of being just, you know,
maybe describe what Airflow is for people just in case people aren't aware of what it is and tell us
what it involves and what problem it solves first of all. Yeah, Airflow is an orchestration tool
which is just a fancy term for being a really fancy cron job. So you can say run this job at
nine and I have this other job that I would like to get run at 10
but it has a dependency on the 9 a.m job so please only run it either at 10 a.m or whenever the
dependency gets done okay okay and and so but again that deliberately stops at a certain point
it's very much done in code um it's not done perfectly um but you know but one other question
I wanted to ask you before we get on to uh you know, the next bit is, why did, it sounds like it's all on-premises sort of systems you've got.
Did you think about maybe kind of make, you must have heard of obviously the whole serverless approach and things like Athena and BigQuery and things like that and Dataflow.
Have they been considered or not really?
Yeah, so a lot of these systems were developed before all all these
serverless systems were the rage uh these are on-prem system in the sense like let's take airflow
as an example it's an on-prem system in the sense that it can be used on premise but we use it on
the cloud so in fact we don't you know we we have a machine and set of machines on AWS that we use to host our Airflow instance.
In terms of serverless, I think it's something we're thinking about. It's not currently an
important part of the architecture. Okay. And I suppose another reason we use that here is
dealing with the kind of, I suppose, the auto scaling side. So when, for example, in the States,
it's a very busy time for ride-sharing,
how do you handle that kind of scaling up and scaling down?
Has that been a problem at all with things like Flink
or have you handled that really with the way you do things?
Right. Yeah, many of these pieces of software,
like you're alluding to, Flink has, like, scaling properties
and we have been able to handle the traffic
with the current functionality
of Flink.
I haven't seen a lot of problems related to the scaling.
I do think the other factor here is cost and that I have admittedly less visibility to
because you may provision for the max capacity, but if you're not utilizing that max capacity
all the time, then obviously you're paying the cost.
So auto scaling helps with both scaling for higher traffic but also lowering the cost when when there aren't
as many rides or requests okay okay so what okay so you've done the etl side and you've put the
data in um how do you how what challenges do you face around kind of making that daily data
available to people and how do you handle i I suppose, what's your thoughts around the data
discovery side of things, really? Yeah. And this is the first two parts of the ETL development
workflow, right? So discovery and profiling. But as you point out, this is not just in the ETL
development workflow, right? If you are an analyst who just wants to do some analysis,
they also have to go through the data discovery
access. So there are multiple different personas for data discovery. We can focus just on the ETL
development and the data find persona. Yeah. And so here, like I was saying before, like
one has to figure out which tables to use, but the problem is even more compounded in larger
organizations where you may not actually have just one database, right? So you may have some Oracle, some Heredita,
some Kafka, some HBase, and of course, some tables on S3 via Hive or Presto or Impala,
something like that. And you may want to learn all the attributes that we were talking about
earlier, like what's the schema, when was the last updated, who's the owner, who are the most popular users.
But you may also want to see, like, where actually is this table?
Is this table in an HBase table?
Is it a Grafana dashboard?
Is it a Teradata table?
And then you may also have organization-specific tags, right?
We obviously can't predict how an organization is structured,
so you may have certain tables tagged as marketing tables. And being able to essentially have like a search engine
that becomes your single go-to place, like Google has become for us on the web, right?
If there is a portal that becomes your single go-to place for finding all data. And so far,
we have talked only about the tables.
But imagine if we raised that level and started talking about dashboards.
So not only this search engine is indexing all the tables,
but it's indexing all the dashboards, right?
So when you search for a particular thing like weather,
it shows you all the dashboards that people have previously created
with weather, related to weather, ranked by their order of popularity.
And it shows you some tables as well, and you can see which one you want to use.
So that means that you can actually start reusing analysis instead of just figuring out what tables you need to use.
Okay.
So that's interesting because there's a few points in there I want to drill into.
The profiling thing you talked about and how we can do that differently.
But interestingly, the search idea you said there,
I think is a kind of interesting one.
And in fact, we've got another show coming up.
I think it's probably going to be the one after yours.
I've recorded it already, which is with Doug Buddenara,
I think it is, from ThoughtSpot.
And ThoughtSpot's actual product is what you're saying there.
It's to say, let's use the search interface
as our way of discovering the data
that's in our environment.
Can we use that, not so much for the profiling side,
but for the actual analytic side there, really?
And we spent a good half an hour with him,
him trying to actually describe effectively
an empty search box,
which was him describing the interface for his tool.
You know, it just was you just type stuff in and you find things from there.
Is that the sort of thing you're thinking of really, that kind of idea?
Right.
Yes, I am.
But I think what's perhaps different here is what all information is being indexed, right?
And so it's not just a particular kind of dashboards. I think it's very common for folks to build search boxes within one tool. But what I'm really advocating
for here is search boxes across multiple tools and at multiple levels, right? Both at the table
level as well as the dashboard level. So you're looking at the kind of lineage side and the
provenance side as well, really. So where you know, so where did this come from?
How is it kind of loaded?
Or even just finding things at a level beyond the kind of things visible to the end user, really.
So the profiling side, profiling, again, you know, putting my old man hat on, you know,
these ETL tools have had data profilers for a long time now.
But you sort of like need to know what you need to know what it is that you're you're you're profiling it's a very expensive and time-consuming process to do it and
you know one wonders if there are better ways we can do this now with the search and with the power
of uh i suppose i think we've learned the last few years to do that sort of better i mean so
with the profiling side how do you think that can improve them then really? Yeah, I think there are a few technical ways
that are super interesting around profiling.
So I'll give you an example.
Parquet is a really common file format
to use for storing your data.
In fact, Lyft uses Parquet heavily.
And Parquet already has some information in its footers
around stats that are included in that particular row group,
which is just a segment of
data in Parquet. So if you store your data in Parquet, you already have information, for example,
in the file itself, in the footer of the file, around some stats like max, min, and so on.
What that means is that the historical process of finding stats, for example, running a SQL query and then computing all these very heavy statistics on all the data you have, is now already in some ways available to you.
And it's already used for query optimization.
That was the original goal for having these stats in Parquet footers.
Can we leverage the same stats in order to actually make them available directly to the users and be visible to humans in order to actually see the profile of the data in the same way a query optimizer would is exactly like an improvement.
So what about, you mentioned Airflow again earlier on, what about the kind of the metadata that is provided in Airflow? How do sometimes. For example, Airflow knows the delivery times of all tables.
So you could definitely and should definitely surface that in this one single system.
There's also lineage metadata in Airflow that you're referring to.
But I think lineage metadata is only available within DAGs and therefore doesn't really map across DAGs.
And if you forgot to put a dependency,
let's say your SQL query actually depended on a few different tables,
but in the DAG you forgot to express that dependency,
that dependency is lost.
One other kind of lineage information that's not available through just Airflow
is lineage across systems or lineage across like tables and dashboards. So if a dashboard uses a
particular table, then that obviously is not available to Airflow. So Airflow has a very small
legitimate view of dependencies, but only within the context of tables, not necessarily in the
larger context of the usage across data. Okay, so it's almost like you can imagine a kind of Google search really for the data within your organization and I
suppose that'd be really quite valuable. I mean have you seen that idea out there in the
market at all? Is it something that you think would be useful and you've seen it
being used at companies? Yeah I haven't seen a direct manifestation of that.
There are definitely attempts that have been made along doing something similar.
So there's a vendor elation
that's fairly common used for data discovery,
but it's slightly different than what we've been talking about.
Airbnb has a data portal.
LinkedIn has a warehouse, which is open source.
Facebook has an internal system called iData
that's doing something similar. Netflix has Metacad, again, doing something similar. So lots of attempts have been great hearing what you're doing on a customer site now as opposed to
sort of the in-cloud era.
So thank you very much for coming on the show.
And how do people find out about you really, Mark?
Yeah, that's continued.
I still speak at conferences.
I still try to do blogs whenever I get a time.
And I've been still really enjoying that.
And it's always a pleasure being here at Drill to Detail and talking to you, Mark.
So I'm really, really grateful for you having me here and giving me a chance to share.
That's great.
And your book that you wrote a while ago, just maybe kind of like, what's that called
again?
And because that is such a, the book itself is so, the bit about the architectures I found
was so timeless, really.
You know, tell us about that again.
Where was that published and how did people get hold of it?
Yeah, for sure.
The book is called Hadoop Application Architectures.
And even though the name suggests it's related to Hadoop, it's really big data architectures.
And the book's divided into two parts.
The first part talks about design decisions that one makes in general when storing and
processing big data, like do you need to store them in a row format or a column format? What
should your row keys look like if you're using something like an HBS or Cassandra and so on and
so forth. But I think what has become really popular amongst people is the last part. The
second half of the book talks about these larger architectures, things like data warehousing using these big data tools or fraud detection using these big data
tools or event analysis using similar tools. And I think we put a lot of thought into
what is the state-of-the-art architecture for building a fraud detection system or building
a data warehousing system. And that's become super super popular There were four co-authors for the book myself Ted Jonathan and Gwen
and
We're really glad with the response we've seen to the book and we still keep
Getting great feedback and look forward to hearing more and how we can improve that knowledge share in the future
Yeah, I mean I said to you I think I said to you in the first episode
I've recorded with you that for my awkward days, it reminded me of a Tom Kite book.
You know, it was not only how you do things, but it was why.
And that for me was the thing that was really interesting.
And that's why I originally approached you to come on the show then, because it was why do we do things and what is the rationale behind it?
You know, that was that's quite rare in kind of big data books.
And so that for me was made it a book that I still refer to now, actually.
So I'd totally recommend that. So, Mark, thank you very much for coming on the show it's been great to speak to
you again and take care and hopefully speak to you again at some point yeah thank you Mark really
appreciate your time as well really enjoyed being here