Drill to Detail - Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin
Episode Date: May 15, 2017Mark Rittman is joined by Maxime Beauchemin to talk about analytics and data integration at Airbnb, the Apache Airflow and Airbnb Superset open-source projects, and his recent Medium article on "The R...ise of the Data Engineer"
Transcript
Discussion (0)
So my guest on this week's episode is Maxime Bouchemin who works at Airbnb as well as being
a main committer on Apache Airflow and Airbnb's SuperSec projects. So Maxime also wrote a
very good and influential blog post recently
entitled The Rise of the Data Engineer.
And I've invited him onto the show to talk about that post,
his work as a data engineer at Airbnb,
and how he got to that point having worked in a more traditional BI developer role many years ago.
And also his work on Airflow and Supersets,
which I know many of you have been kind of listening to and hearing about and so on.
It's quite good to get the person behind it to talk about it as well. So Maxime, first of all, thank
you for coming on the show. Welcome. And just introduce yourself properly and what you do at
Airbnb at the moment. Perfect. Thank you for having me on the show. So it's an honor to be
on the show. So I'm going to talk a little bit about how I got to Airbnb and what I do at Airbnb now.
So I come from Facebook.
So I used to work at Yahoo and Facebook and now at Airbnb.
And what brought me to Airbnb was, well, first, it's important to believe in the mission and believe in the company. And, you know, I really,
the mission of like belong anywhere at Airbnb really resonated with me. So, you know, this idea that you could, you know, that home is not necessarily as a, you know, constant or maybe
home can be something that changes over time as you change lifestyle through your life.
And, you know, I like some of these ideas.
And then, you know, I spoke with people at Airbnb a few times casually about potentially working there.
And it was just really apparent to me that I could, you know, have so much impact there coming from Facebook where arguably they are a few years ahead,
at least in terms of data and tooling and all that stuff.
I spoke to people at Airbnb and I was like,
it felt like I had seen the future.
I could bring that to Airbnb and help them jump
and maybe skip forward and have a lot of impact there.
And it was also clear that they needed something like Airflow at the time.
So Airflow, for context, is a batch process orchestrator.
And as I was speaking with the people there, they're like, if you join Airbnb, you can start working on this project and make it open source.
And it was really important to me, this idea of I want to manage a, I want to start a big open source project.
And this might be just the opportunity.
So that's what brought me there in the first place.
Okay.
Okay.
So your background, what's interesting as well is your
background and your route into this development is from quite traditional kind of bi development
role and i think you when you worked at facebook at the start you were classed as a kind of bi
developer what's your what's your kind of history in that sort of area right so i started so my
career very so early on in around like 2000 I did a little bit of web development.
But soon after, I started getting involved in the data projects at Ubisoft at the time.
And they were starting to talk about building a data warehouse, which I guess some of the theory, some of the books about data warehousing had been written in the 90s.
And bigger companies were building data warehouses
and you know ubisoft was starting to get serious around that and uh you know building a warehouse
and they bought this this package called hyperion s base and they they were looking for a tech
a techie to to manage and you know help you know make that project successful internally
so then i started working on on all of these things.
So building the warehouse, we had the Microsoft SQL Server suite at the time.
I believe it was the 97, or I think it was called SQL Server 7.
So very early on in the projects, we had, I believe a little bit later, we got Business Objects, but kind of this traditional stack.
And we started basically reading the books and building the warehouse and working with people at Hyperion to build our financial solution around their tools. So that's where, you know, coming from, so I've got seven years or so at Ubisoft
where I was just focused on traditional data warehousing,
business intelligence, ETL, store procedures,
and all that stuff.
So that's my foundation.
And when I left to go to Yahoo,
that was a big shift because Yahoo was a lot more,
you know, somewhat closer to what we think of a
data engineer nowadays so more programming and scripting and and perhaps a little bit less less
tooling more thinking in parallel big data type stuff and it was also the rise of of hadoop at
yahoo at the time okay okay So that leads quite nicely into the reason
that I wanted to speak to you really.
So you wrote a blog post recently on Medium
called The Rise of the Data Engineer.
And I think, you know,
I don't know what the numbers are like on it,
but certainly it looks like a very kind of,
it looks like an article that resonated
with a lot of people.
And I think it summarized some of the changes
that are happening within our industry
and how the role of BI and data warehouse and developers changed over time
and how it's different now within organizations like Airbnb, for example.
So maybe just tell us a bit about, to summarize what the article was
and what was the background to it?
What kind of motivated you to write the article, really?
Right, so I had this thought for, I think it had been at least a year or two
that I'd been thinking about writing something like the rise of the data engineer. And but I'm sure if you Google the rise of
the data scientists, you'll find an equivalent or similar post in a lot of ways that, you know,
at a point in time, someone decided to kind of ground this idea of like, what is a data scientist?
What do they do? Why do organizations need them? And I'd been thinking, you know, now that the
word or the title data engineer was getting thrown around and it was becoming quite a big popular thing.
There was nothing that had really defined like what is a data engineer?
How does it relate to existing positions like, you know, business intelligence engineer or data warehouse architect?
Or how do, you know do data engineers and data scientists collaborate
together? So I thought there's a great opportunity here for me to, from my specific perspective,
to explore what is data engineering and to kind of define it myself since no one had done it before.
I was thinking maybe I have the opportunity here to define it for others so that my vision becomes the actual vision for this industry.
And you mentioned the numbers a little bit. So this post, I was surprised to find that
it got extremely popular on Medium. And so I have about 65,000 views and
believe it or not, 20,000 people read through the entire article.
So that means it's something important.
A little bit, something I wanted to mention too is this other post that came, I think, soon after.
What is it called?
I'll try to dig it out.
But it was around data engineering as well and stating the fact that at this time,
I believe, or at the time they wrote it,
there was 6,500 people on LinkedIn
calling their title or saying,
I'm a data engineer,
while there was just about the same number
of job recs open to try to hire
6,500 data engineers in San Francisco alone.
So there's definitely something big happening in this space.
Okay, so again, I think what resonated was, certainly with me,
was that the world you described and the way that kind of BI development
and ETL development and so on is done within startups
and within kind of companies working with large amounts of data and so on is kind of different.
It's a different kind of role really to BI development and so on there.
So why don't you just outline in a way what is it that you do day to day at Airbnb in
terms of development and the development process?
How does that differ, do you think, from the things that people are more used to with kind
of formal ETL and formal BI development and so on?
What's different? What warrants it being a different kind of role in your mind to bi
development right um so the first thing is you know business intelligence engineering and the
tools that the tool set and those processes from from before they still exist in a lot of
organizations and you know there's some organizations are taking a different approach to data and analytics and ETL.
But I want to say that the old approach still exists and is still valid.
And the tools from the past still work well for a lot of organizations.
What is different, though? So one major factor, I believe, is the rise of, I guess, I hate to say that word, but like big data and the big data tooling and the Hadoop ecosystem is very different as traditional databases and computation and storage has changed quite a bit. So the tool set in that environment
and the scale has grown quite a bit. And I think a lot of the tooling and processes from the past
don't work anymore, which warrants a new set of tools and a new approach.
I believe also the information work is getting more technical in
general. So that means traditional analysts might be able to write SQL nowadays, but everyone is
climbing that ladder of complexity and becoming more technical. And for data engineers, that means
in a lot of cases, writing more code, where in the past, maybe ETL tools were more drag and for data engineers that means in a lot of cases writing more code where in the
past maybe etl tools were more drag and drop at this point in time we're expecting modern data
engineers to to write high quality code because the problems we're solving are complex and require
you know potentially like more abstract tooling and uh being able to
write code um that touches some of the elements of the answer there's so much more to it but yes
okay okay so and again i suppose data engineer there's a there's a there's a deliberate
distinction there made between data engineer and data scientist so again how does it differ
how would you say an end data engineer differs from a data scientist so again how does it differ how
would you say an engineer data engineer differs from a data scientist what what point are you
trying to make there really right uh so first to try to ground uh you know what a data scientist
is so to me that the term that you know there's kind of a real definition and it's been it's been
overloaded quite a bit but to me uh data scientists has something to do with uh well first it's it's been overloaded quite a bit. But to me, data scientists has something to do with,
well, first, it's an analyst that can write code
or someone with strong analytical skills who is able to code.
There's also an element of publishing, perhaps, right?
So science is academic, and there's an element where potentially you could
say like a data scientist, if they're really doing science, they should publish articles,
do peer review, and follow this scientific process, I would say. Now, where I see the term
as being very overloaded is, you know, I think analysts that work in San Francisco or just, you know, data analysts that live in San Francisco are called data scientists because they want to be called that because it's a sexy name.
And that's a modern appellation.
It's something that people aspire to this title.
So it's been overloaded quite a bit.
And now in relation to data engineering, so to me, the core of data engineering is, you know, the core role is someone who would build data structures and data pipelines for an organization.
And that's, you know, essentially what we used to call ETL.
But ETL has changed quite a bit in the face of new tools, a new set of tools.
Also, in some of the processes and some of the new tools, I've redefined some of the foundational concept of ETL.
For instance, data modeling, I think data modeling hasn't changed necessarily that much. If you look at concepts like, you know, star schemas and dimensional modeling, I would say some of this still applies, but has changed enough in the light of new tools and databases that don't necessarily have the same constraints as they used to have. So where is the line between data science and data engineering?
There's probably a fair amount of overlap too,
and we want people to overlap.
We don't necessarily want to put a wall there,
but I would say data engineers care most
about building data structures and data pipelines for longer-term solution.
While data scientists might be focused on something this week and something else next week,
the engineer would be building a longer-term solution.
Also, on the side of data science, there's this idea of using machine learning quite a bit.
And that is also true on the data engineering side, but maybe on a slightly longer term vision.
Okay.
Okay.
So I think certainly, I think the first mention I heard of the term data engineer was,
that was obviously your post there.
I think Kurt Monash posted something a while ago, again, making this distinction between
not everything
you do within big data and so on is is data science you know there's there's the kind of
there's people that specialize more in the infrastructure and the architecture and the
pipelines as you said and that is a distinct kind of like role in itself really um i think i think
you hit you hit on it there with the etl part and i think having come into that world myself
from from kind of a more traditional world of etl tools and informatica and so on ETL like you say ETL is changing and I think there's
a question as to whether or not what we do now with scripting and and everything being at this
more kind of like I wouldn't say immature but certainly a more a more kind of like a basic
level whether that's whether that is a function of how kind of like uh basic level whether that's whether that is uh a function
of how kind of like how how new this is or whether the way we do etl has changed completely and i
think be interested to talk to you later about airflow and and so on um did you see that blog
post there was a there was a blog post by somebody else as well which was engineers shouldn't write
etl and it was by jeff magnuson and it was out recently as well and it was a similar kind of
topic but it was talking about how because ETL has changed different people should be doing it
and doing it in a different way I mean fundamentally do you think ETL is different now do you think
it's how would you approach it differently really for this and what's different about doing it in
this environment yes so I'm not familiar with the article you're mentioning but I'll definitely look
it up and I'm curious it sounds controversial so controversial, so now you get my interest and career.
Yeah, there's two points to it, really.
One is that ETL has changed, like you said, but then there's a point of saying that actually if it is an ETL task, then actually it should still be done the old way.
But if it's different, if it is data engineering, I don't know.
The point of it is saying that in a way you shouldn't make data scientists be ETL developers because it because just because it's different data it's interesting kind of area really yeah that's that's one thing
you know data data engineers are kind of here to to save data scientists from doing etl in a very
poor manner so um i believe that's that was the end goal when i when i got to airbnb there was
already you know dozens of data scientists that clearly did not know much about data structures and data
pipelining and were doing a horrible
job. While they were really good
at what they do, they were not good
at data engineering.
As I came
in, there was a small team
of data engineers or
ETL people really at Airbnb that were
building data structures
and pipelines that data scientists
could use so that their analysis would be built on the foundation of strong pipelines.
So instead of going back to the raw tables and the raw ingredients and building their
dirty derivatives, they would start from where data has been cleansed and organized
and where there's been consensus on defining metrics and dimensions.
And it then becomes harder for them to, or a lot easier for them to get right metrics
and get the analysis that were in line with each other.
Now talking about ETL and how that's changed, I don't know if you want to take the tangent.
No, no, please.
Okay, so on ETL, so how has it changed?
So in the 2000s, I would say there's been this rise of a lot of ETL tools by vendors,
business intelligence vendors that were selling things like Informatica,
IBM Data Stage, SQL Server, I believe it was called Integration Services,
and AppInitio.
So a whole set of tools that were all drag-and-drop tools.
So the idea was you have this software package,
you connect to your data sources,
you drag and drop your table,
you drag and drop transformers,
and you build a small graph of data, objects, and transformations.
And so really often they would have these data flows and workflows,
and you'd build those by drag and dropping.
And that's all fine and dandy.
The premise was that people working with data
perhaps did not know or didn't want to write code,
you know, so that they would do drag and drop.
In theory, it would make that easier for them.
But then, you know, in the post,
in the rise of data engineering,
I argued that the problems we're solving now, and perhaps that we were
solving at the time, are too complex to be done with drag and drop tools.
With drag and drop, while it might seem easy at first, you lose on the whole software engineering,
everything that you get in software engineering because you're writing code, things like source control and being able to diff different branches,
being able to create abstractions,
being able to create blocks of code that you're going to reuse
through looping, inheritance, composition.
You'd have some version of that in the drag and drop tools.
But it still made it hard to do things that are easy to do
when you're writing code.
And I think maybe I argue in the post,
and I'd like to maybe write a post that would be more specific
about why is drag and drop not the right abstraction or why was it a mistake almost like for it to have like a decade of drag and drop tools in the ETL space. best way to express logic. And there's a reason why software engineers
are not drag and dropping four loops on the screen
and that they write in an actual programming language.
And I believe a lot of those reasons
why software engineers write code
and don't do drag and drop
in some sort of development environment
is because it's a superior abstraction
and it's something solid that is timeless.
And that applies to data engineering as well as it does to software engineering.
But is that not something that is true but therefore limits the people who can do this to a very small set of people?
I mean, I guess the point of the drag and drop, point and click kind of like ATL tools was to make it possible for people other than software engineers
to do this work.
So do you not think that within the industry we're in now,
this is more, at some point,
how they're going to scale up to handle this really?
I mean, do you not think at some point drag and drop will come to this
or is it just fundamentally flawed, do you think?
Well, so drag and drop might be okay for a certain level of abstraction.
So if you're doing something simple, I'm trying to equate maybe there's like these Lego tools for kids that want to learn how to code and they might go into some toy environment where they can specify a series of actions as visual blocks. And it might be a good mental model for some or for people ramping up.
But if you're writing software at scale,
you need things like source control
and you need to be able to diff your code
and you need to be able to create a class,
create a function, create these reusable blocks.
And I believe that in drag and drop,
people have created that right
so you can have a a for loop as a block right like that way you would drag and drop a for loop
and maybe the abstraction is more visual but it is the same or a similar level of abstraction it's
just the mean the process to to do it is is different different. But if you can understand a tool like Informatica
and all of its glory and complexity,
I believe you can probably understand
the same level of abstraction as a written or as code, right?
Like, I don't think people would be,
oh, I'm able to drag and drop a source table,
but I'm not able to instantiate a source object.
I believe it's the same abstraction.
And if people are able of these abstractions in a drag and drop environment,
they would be able to do that in a code.
It's definitely interesting.
I mean, my experience has been within this kind of industry.
There is no equivalent of something like informatica you know a lot of things that we a lot of things
that we we we had from the bi world have now resurfaced in in the kind of big data world as
such you know we've now got uh platforms like bigquery and athena and so on that give us
a kind of a more tabular interface over over kind of the data we've got you know tools like
looker for example and superset you know that do a more kind of the data we've got you know tools like looker for example and
superset you know that do a more kind of like uh user-friendly bi and analytics uh sort of platform
on top of this but there is no there is no equivalent of informatica and there is no
kind of graphical point and click tool for big data but what there is is things like airflow
that you're working on so i mean tell us, tell us about Airflow. What that is, tell us what problem it solves
and what it is, first of all.
Right, so Airflow is a,
I would call it a workflow orchestrator
for modern enterprises or organizations
that are working with data.
And I guess, you know, fundamentally,
you know, Airflow is just a way
to schedule and run a set of jobs and tasks with complex dependencies. And, you know,
in modern organizations, when you have, you know, perhaps a few people or dozens of people or hundreds of people working with data every day.
These people will write jobs that need to be on a schedule, and that typically depends on each other.
So say ETL is a very classic example of that where, hey, I want to load my fact table.
But first, I need to make sure that the source data for the day has landed. Once the data lands, I'm going to populate my dimensions in a certain order based on whether, you know, are things landed, are all the dependencies met. sets of processes and that need to run on a schedule with really complex dependencies and airflow it's a tool that helps people orchestrate all of
that and to give you an idea of like the complexity of these workflows in modern
organizations so at Facebook I believe we're at the time where I left. So about three years ago, we were running hundreds of thousands of tasks every day. And at Airbnb now, I believe we're using Airflow, we run around 60,000 tasks every day. And these tasks need to run in a very specific order. Each one of these tasks depends on a complex network of other tasks. And these tasks can go from, you know,
populating, you know, data in a table
or in partitions to, you know,
data that can help different parts of the business.
So you can picture there's whole workflows of tasks
for areas like, you know, payments and fraud detection
and search ranking.
So each team has their own sets of complex data pipelines or workflows
that need to be orchestrated in a very specific way
and run every day on a schedule.
Airflow also makes it easy to not only author these jobs,
but to monitor them and track them and to stay sane
while trying to understand why did the data did not lend today? Or why is it not lended yet?
And where is the error report? And can I get some retries when there's some transient errors? Can I
get some tasks to retry within the parameters that I set? Can I get alerted? Can I get some tasks to retry within the parameters that I set?
Can I get alerted?
Can I get an easy access to my logs?
Can I get alerted when things are not landed in time? So Airflow is a whole set of tools around monitoring or authoring, monitoring, troubleshooting these complex workflows of jobs.
Okay, and this was developed at Airbnb and it's open source, is that correct?
I mean, I guess this is something that you felt was a key thing you needed to be able
to do and have to do what you're doing now as a data engineer.
Right, so as I left Facebook, so Facebook adds a set of similar tools, one called Data
Swarm and something called DataBee, And those were internal tools that were not open source at the time, but were also similar
in a lot of ways.
So one thing is like they were written in Python, they worked at scale, they allow people
to author their workflows and troubleshoot them and there was also this one of the core ideas was being able
to dynamically author workflows and I can maybe I'll get into that in a little so the idea of
being able to not only write a static workflow but to write a program that will define a workflow dynamically.
And as I left Facebook, I thought it's going to be really hard for me to operate at the same level that I'm operating at Facebook with these tools, without these tools.
So first thing, I'm going to build the tools that I need,
and then I'm going to be able to solve the problems that I've been solving with the right set of tools.
And I believe that people, so the people at Airbnb at the time were looking at some of the open source solutions that existed.
So there's something called Uzi and Azkaban and Luigi.
And we looked at all these tools and we decided that we wanted to build something new in the light of people coming from the places
where these tools had been written
and saying you shouldn't use,
like someone from Yahoo said,
please do whatever but don't use Uzi.
And someone from LinkedIn was like,
do whatever you want to do
but make sure to not use Azkaban.
And so together we're like,
and I came from Facebook and I was like,
I wish I had the tools from Facebook.
And we decided to not,
to take a new,
to decide to,
can we build something similar,
perhaps better than these tools
in the process,
open source it
and give it out to the community.
Excellent, excellent.
So, I mean,
in terms of your involvement with this,
I mean, obviously you're heavily involved there.
You're a committer.
You know, what's the, how much time do you spend on this
and how big an involvement have you got with this?
Right, so Aeroflilo specifically was really my baby.
So I started the project.
I wrote the first line of code.
That was probably the lone committer on the project
for the first, let's say was probably the lone committer on the project for the the first let's
say six months to a year before the project starting uh started getting any attention from
externally or before we even announced the open source it so that was a piece of software that
i wrote from scratch uh you know and and that i that i pushed forward and wrote the code, the documentation, the unit tests,
and eventually onboarded all sorts of people onto the project.
And now, I would say, my first year and a half at Airbnb,
so it's been two and a half years now,
but I was mostly focused on Airflow
and solving internal problems at Airbnb using airflow.
Things like rewriting or experimentation or A-B testing framework
and then collaborating with teams and making sure they were able to build
what they needed to build using airflow.
Okay, so you mentioned dynamic dynamic kind of generation
there i suppose i suppose in a way you know you've you've solved some things with with with kind of
airflow and you mentioned dynamic generation there and so on what are the ways in which you're taking
this forward that are kind of non-obvious to people from more traditional backgrounds that
like that because i mean it sounds really interesting what you're trying to do there
and so on tell us about that and and where you see this going really. Right. So dynamic workflow generation. So if you think of concepts like, I would say,
analytics, say as a service or analysis automation, or the whole idea that potentially,
instead of having a data engineer writing workflows individually
that are static that a data engineer could build something that can be used to generate workflows
so it's a level of abstraction over the perhaps the what what a data engineer would normally do
so let's say you need you need a specific kind of ETL
for an experiment that you want to run
and you want to run an A-B test on your data platform.
And perhaps, you know, there was a time, you know,
early on at Airbnb where maybe you would write
a small pipeline just for that specific use case.
And the day after you want to run another experiment,
but it's slightly different.
And this time you want to, you're going to have to write a different pipeline for that new experiment
now the airflow allows you allow allows us to write a piece of code that perhaps can read a
config file or some configuration from in a database and based on that create complex workflows
for each experiment with a set of parameters.
Other examples of that could be things like,
so experimentation is a good use case for it.
I should pull, I have a talk called
Advanced Data Engineering Patterns with Apache Airflow that tries to describe a bunch of use cases for doing this sort of stuff.
An example of that for us is we have this tool called Autodag where people can say an analyst or a data scientist that wants to run a certain
query every day um you know they can they can put in a config file with some configuration element
and easily get to a point where this is going to be scheduled and run on their behalf and there's
going to be some some automation there um you know another example of this would be, so say an organization like
Facebook and, you know, and Airbnb, we want to compute the same set of metrics for different
areas of the business over and over. So things around, say, engagement and growth accounting.
So understanding like how many people are using
a certain food feature on the website and how many people are new churn resurrected stale active
so you can you can picture that we would allow people to fill in a form a simple form saying
hey i would like to compute this for my area of the business and configure it in a very specific way,
perhaps saying I'm interested in specific timeframes,
dimensions, demographics.
And they would, by filling in this form or this configuration,
would build dynamically a complex workflow on their behalf.
So that becomes kind of the work of a data engineer as the service
somehow um i can get into more complex use cases um so uh i'm not sure i'll do you want to go well
that's interesting and that actually leads on i want to get onto the data modeling bit you talked
about as well in a second but you the going back to that post which you haven't read so it's not
kind of fair to go into much detail but the thrust of the the other blog post that i mentioned as well the one about kind of should
engineers be writing etl code i think is interesting what you just said because the
thrust of it was that the engineers are always looking for engineers and data scientists and
data engineers are always looking for new and novel ways to solve things like etl whereas
actually in fact by doing that um you know we end up building systems that are
not as stable and as and a lot of this work is more doing than thinking you know do you think
that's the case or do you think in the world that you operate in that I operate in now that you can't
have it as that you've got to be a bit more kind of agile a bit more kind of forward thinking a
bit more dynamic in how you do ETL I mean what do you think on that? Well, ETL, it's true that in some ways it's mind-numbing,
but I would say the easy component of ETL
or the mind-numbing component of ETL
can be, with the right set of tools,
it can be abstracted out and be solved very quickly.
Now, there are things like consensus-seeking,
say, how should we define metrics and dimensions?
And how should we structure our table and our workflows?
And how should we write optimized and performing ETL at scale?
It's more challenging.
Change management is horrible in ETL, right? so hard to say if you want to change the definition of a metric slightly then you know there's all
these derivative tables that you need to to reprocess and airflow certainly helps with
these problems uh but but you know etl is is necessary right or should it whether it should
be in batch or whether you know data pipeline should be you know and in batch or whether data pipelines should be in batch or in streaming fashion, I think is less important.
But how should the data in your organization flow and get organized is a really important and core problem to modern companies.
And there's no way to get get around it i would i would say yeah
um you mentioned you mentioned also data modeling is changing so not only did you talk about etl is
changing but data modeling as well and i guess that's a big part of it as well really
right a few more words around the idea of like etl and why it's necessary but it's a little bit
like you think of like the data engineers as the librarians of uh ofL and why it's necessary, but it's a little bit like, if you think of like the data engineers
as the librarians of data, right?
Like they're the people who will say,
and you know, the equivalent of a library would be like,
people will organize all the books,
put them on the shelves in the right place,
fill it, like be in charge of managing the metadata
or the little cards by which you would search
and find books.
So it's really important to take all of this data that you get that's dirty and complicated and comes from different sources and it doesn't
line up in a lot of ways and to line it up and organize it and store it for the future,
for the well-being of analytics at your company. So you can actually ask questions, get answers and
be somewhat structured in the way you do this um now data
modeling is changing so um i would say like a lot of the books i would still recommend to people
real read to read you know the kimball books um i i believe star uh you know star schemas and
dimensional modeling are uh are still true in a lot of ways, but there are things that are somewhat less relevant.
One thing is the way that we store data now
with columnar database or columnar file formats
like Parquet and ORC.
Things like creating surrogate keys,
and now I'm getting a little bit technical,
so I'm not sure what's the percentage of the audience that will relate to what is a surrogate keys. And now I'm getting a little bit technical, so I'm not sure what's the percentage of the audience that will relate to, you know, what is a surrogate key? But like now that we
have dictionary encoding and we have, you know, file formats that are potentially columnar, do we
need surrogate keys anymore for, I mean, there are other reasons why we may need surrogate keys, but maintaining surrogate keys in traditional data warehousing
was fairly complicated and heavy.
And so you'd have all these problems around late-arriving facts
and preloading dimension members and this whole idea of,
you know, there's entire chapters in these books written around slowly changing dimensions,
which you're probably getting like bad flashbacks thinking about these slowly changing dimension ideas.
But I would definitely argue that slowly changing dimensions,
we have kind of shortcuts,
that we have new solutions for these problems that are simpler.
And perhaps in some cases it's due to the fact that storage and compute
is cheaper than it used to be in relation to engineering time perhaps.
And that's one reason
and then some of the the new serialization formats or database engine make some of the optimization
we would get or some of the the performance gain we would get from say managing surrogate keys
are not as significant anymore from a perf standpoint we don't necessarily
need that because uh because the databases are able to to kind of do that on our behalf
without without thinking too much about it yeah definitely definitely so so you i think you're
also involved in the the supersets project as well i mean is it can is it something you can
tell us about what it is and and i suppose what that's trying to achieve as well right so it's my second uh big project um
and i believe like i started so i started this at airbnb about a year and a half ago
um originally originally the premise was um you know we we wanted to to use this database called Druid. So you can check it out.
So Druid is this colon-oriented, distributed, real-time database.
That's a really cool database.
And we had tons of use cases at Airbnb to use Druid. with DRID at the time is there was no way to really consume the data or visualize the data
easily as none of the tools that existed on that exists on the market had some DRID connectors.
DRID used a REST API to query. So you would have to issue, you know, you have to write a JSON blob
to query it and then get a JSON blob back and somehow write a custom application to, say, visualize your Druid datasets.
So coming out of Facebook,
I really wanted to recreate something similar
to Scuba internally,
which is also a non-open source project
that exists at Facebook.
And Scuba is just this really fast database backend. It's mostly like in-memory
columnized data that you can query gigabytes, terabytes of data in under a second. And Scuba
at Facebook has this really nice front end that allows you to query the Scuba backend and get
answers very, very quickly. So it's very high velocity. You point to a data
set, you say, I want to see these metrics grouped by that, give me this visualization, all in a,
you know, click interface that is very high velocity. So you can really ask, you can ask
hundreds of questions in minutes, just because the database is so fast and the UI is very high velocity.
So looking at Druid, you know, at the time,
Druid had a lot of the properties
that the Scuba backend offered,
and there was just no front end for it.
So I was like, what about I start writing a front end
for Druid as a hackathon project?
And then, you know then this went pretty well,
and we ended up selecting Druid to decide to use it
as we were doing a proof of concept with it,
along with a little UI I was writing at the time.
It worked pretty well, and it seemed like it had a lot of potential.
And quickly after that, the scope grew around Superset,
which was called Panoramics at the time. We changed the name multiple times on the
project.
The use case grew over time to become
pretty much this open source, enterprise
ready business intelligence web application.
Really, at this point in time, you know, Superset has become the main mean by which, you know, people query and consume data at Airbnb.
And, you know, Superset is essentially a set of tools that allows you to point to a table, you know and and explore your data visualize it assemble dashboards
and you know since then we also built a um sql ide on top of it so very much like uh
you know a classic sql id you can write sql you can you can navigate your database to get your
different table definitions and metadata write sql see your results, you know, run a create table as statement,
then visualize this in Superset.
So Superset is this full-on, you know, business intelligence web application
that is completely enterprise-ready.
So that means if people, as a competitor, to say Looker, Periscope, Mode Analytics,
you know, and eventually like Tableau tableau like internally we also use tableau and we like tableau
but more and more people choose superset just because it's it's higher velocity
and it makes it easy for people to assemble a dashboard very very quickly
perhaps still a bit more scrappy but you know in the light of the lifecycle of a dashboard
being shorter and shorter over time,
how much time do you want to spend crafting a dashboard
that will be somewhat obsolete a few weeks from now
when the business is shifting and thinking about new questions
and new problems to solve?
So that's an overview of Superset.
The project is going to Apache.
So as of last week, we started incubating with the Apache Software Foundation.
So that's my second Apache project.
And we really believe, you know, at Airbnb and personally,
I really believe in the Apache Software Foundation way of doing things,
which is, you know, it's a meritocracy.
And, you know, there's all sorts of nice processes around how to organize your project,
how to collaborate with other companies.
How do we define the release process
for this piece of software?
So it's been super exciting to work on this.
And it's been like my main focus
over the past year year and a
half or so uh where you know the airflow community is super solid and strong now and i feel like it's
autonomous in a lot of ways so it doesn't necessarily need me as much as a benevolent
dictator so things have been going really well there and now i'm focusing on on superset uh more
and more so so you mentioned you mentioned looker there as a as a bi tool in the same sort of space and so so look
at one thing that you that i didn't see in in superset that is in a toy at looker is this
concept of a semantic model or a kind of like a business metadata layer um is that something that
you you see as having value in this kind of space and it's something that will be there in in
superset at some point or do you think it's maybe superfluous in this kind of environment what's
your thoughts on that so we do have a semantic layer um and you know there's we can talk as like
bi guys from the previous you know generation we can talk about this this semantic layer a lot i'm
really interested to talk about this so um so superset as a very simple semantic layer
superset will not do joins on your behalf so that means the semantic layer is focused on a single
table or view and that this is where you would define you know what are the the labels for your
different columns and metrics and you and how are your calculated metrics
or calculated columns or dimensions
or calculated metrics,
what are their expressions
and how should they be exposed in the UI.
Now, for people coming from that previous era
of business intelligence tools,
so there was this,
say if you take business objects or a micro strategy,
these things would have a very heavy, complicated semantic layer that would hold a lot of business
logic. So that business logic was like, in part, you know, in the data pipelines and data structure,
then you had this map on top of that for business objects, it was called the universe designer, and then the project management and micro strategy,
where you would bring in your physical tables
and explain to the tool or give the metadata to the tool to say,
how can you join these tables to basically not produce bad results?
So which table can be joined to what table and how to basically how this tool can
generate queries on your behalf. So in Superset, we decided that this layer of complexity of how
data should come together in the tabular format was not going to be part of Superset,
and it would be upstream.
So either you provide a table
that has all the summary information,
the denormalized information
that you need to answer your questions,
or you can provide a view as well.
So in a view, you can write your own joins,
and you can write your own metric definition in a view too so we're
just pushing shoveling that that problem upstream and deciding that uh you know the tool should not
take take care of that my opinion too is that you know you look at uh you mentioned looker and
looker ml which is their the looker modeling language. So that's where you also define that semantic layer.
And in the case of semantic layers in general, there's so much information that exists on that layer.
And that layer is usually not accessible to many, right?
And it also forces a really strong consensus on like how is the data modeled and
organized. And it requires a whole set of like specific tools, right? So if you expect every
single analyst or data scientist or, you know, data engineer that plays with a little bit of
data to go and create that layer on top of the data they produced, that can be pretty
prohibitive, right, to learn, say, something like Looker ML, or even to get access to it. You might
just be like, okay, I created my set of three tables. I'd like to query them now and make my
dashboard and move on with my week. In the case of Looker, it'd be like, oh, now you have to learn
about Looker ML, and we need to grant you access to that layer, or you need someone to do that on your behalf, right?
And that person might be like, you know, you created a set of tables here that are very similar to these other tables.
Why don't you use these?
And let's together have a consensus on how your data should fit in the warehouse and then the person that's just trying to get something done um you know is brought into this extra layer of complexity and
consensus and you know the tools that i've seen working really well in other environments are
these like high velocity tools where you can just move forward and do your own thing um yeah so okay interesting yeah interesting is
yeah i mean so so just to round things off i mean at the end of the end of that blog post we talked
about the rise of the data engineer you talked about organization within the department and you
talked about kind of you know roles and responsibilities and so on within within
there so maybe just outline where what the key kind of roles are within a kind of like a
an organization like yours that has data engineers and does work on this kind of scale.
And again, kind of what's different about it and why have you done it differently to more traditional kind of roles?
One thing I didn't talk about early on in this conversation, we spoke about data scientists and software engineers, but I did not talk about data infrastructure engineers. So I guess, yeah, so, and, you know, a lot of these positions, as the company grows,
you need more clear role definitions and, you know, maybe it makes more sense at that point
in time to start doing distinctions between the different roles and teams. But certainly at Airbnb, we're at a certain scale where certain roles become really clear cut,
where maybe originally you could hire a few data engineers, data scientists,
and data engineers are going to be in charge of the infrastructures to a certain point,
might be building data products. In smaller
organizations, people do more things, roles are not as clear-cut. In larger organizations, though,
I like to make a distinction, or typically we'll see a distinction between people who do data
infrastructure and people who do data engineering. And that specialization would be in the direction
of a data infrastructure engineer would be in charge of of basically installing maintaining
and you know keeping up and doing some devops type workload around data platforms. So that means people that will be in charge of Hadoop and Hive and Druid
and making sure these clusters are scaling with the need.
They'll do capacity planning.
They'll do all sorts of work to get alerted as they need to grow the clusters
in different ways.
And often these data infrastructure engineers will also,
like their engineers, they'll build solutions.
So data infrastructure engineers at LinkedIn
build something like Kafka to solve a use case that they had.
Or at Airbnb, our data infrastructure engineers
will build frameworks and solutions to, say, load data into Droid to glue these systems together and to do automation around the work that they would do manually. a dupe cluster to another, or things like a retention management solution
so that data at Airbnb can get anonymized and summarized,
or not summarized, but put into longer-term storage, get archived in some ways.
So that would be trying to describe what a data infrastructure engineer would be.
And then the data engineer is more specializing into data modeling, data pipe,
so building the data structures, the data pipelines that the company requires.
And also, since we're talking about engineering and software,
there's always this component of trying to automate your your work over time so
data engineers will build more abstract solution to try to automate their their own work too
fantastic i mean so so i mean just to kind of round things off me as you said earlier on you
were kind of hoping in a way to sort of have a chance to define this kind of term and where
we're going with this and it's a bit of a manifesto really you're doing here i mean where where are you where are you taking this really is it something that that
now you've got now you've got people's attention you know you want to develop this out further or
what's the kind of end game with this or where do you want to get to really with this with this
initiative really well so so at first i was kind of a shot in the dark of just doing that and seeing
what uh what would happen and whether it would
stimulate some conversations and define the role and it's really interesting to this exercise of
writing a manifesto and i think it just turned out that it was really needed at that point in time
that you know a lot of people were waiting for something like that and you know i'm inspired to
go and write more blog posts just because of the success of
this blog post i started writing one around kind of timeless best practices uh in data engineering
data modeling so things that used to in a lot of cases used to be um good practice in the past and are still today. And some new concepts, too, that are slightly more modern.
So some ideas around using concepts from dynamic programming
and apply them to ETL.
So immutability and idempotency and, you know,
this idea of pure function and dynamic programming
would be pure tasks in modern or in data engineering
that would apply these concepts.
So I have a whole blog post that is probably half written on that subject.
I believe I had a few other ideas to follow up on this one.
And sometimes I try to get people at work too,
so my colleagues to go and write some of these posts too.
So I've been talking with people that are writing posts
that are somewhat related or complementary to this one.
Though it's hard to justify.
So I've got all this software to write to,
and I've got like very like thriving communities,
you know, for 3% and airflow.
And sometimes it's so hard to just kind of hit the pause button
on the universe and write a blog post.
But, you know, it's really rewarding.
So I'm looking forward to do, you know, it's really rewarding. So I'm looking forward to do some more of that.
My goal is to write one blog post a month.
And I think it's been at least like two or three months since I wrote that one.
Fantastic.
And what's Free Code Camp?
I mean, that's obviously the hosted thing that you ran the Medium post on.
Is that something you're involved in as well?
No.
So what happened is I wrote the Medium post on. Is that something you were involved in as well? No, so what happened is I read the blog post and these guys thought I was taken off and
they offered to put it in their organization and what they would provide in return is more
readership and to do kind of a review and correct some of the structure.
So someone there did a very thorough pass on the article and changed the structure a little bit,
which it helps, right?
I'm not a professional writer.
Oh, yeah.
That's very good.
Very good, yeah.
And they were like,
oh, you'll get access to our tens of thousands of readers.
So I was like, okay, why not?
I believe I kind of did a disservice
to my organization so it should have been under the Airbnb medium organization yeah but I just
know how popular it's going to get so it was like I'm just trying this you know if I can get more
readers why not just to kind of round things off where would people find out more information about
Airflow and supersets so so I believe now we're moving the documentation in some of the repositories, but GitHub is
really definitely the place to find the root of all the information for these two projects.
So one is at github.com slash Airbnb slash Superset and Airflow is under, I believe, so it's under Apache slash incubator dash Airflow.
But these things are well, you know, search engine optimization is kicking in.
It's pretty easy to think about these things.
There's tons of documentation now for Airflow, not only Airflow's documentation, but people's blog posts and best practice guides.
So there are tons and tons of resources at this point for Airflow. documentation, but people's blog posts and best practice guides.
So there are tons and tons of resources at this point for Airflow and a growing amount of resources for Superset too.
So tons there.
And yeah, it seems like in light of, you know, you were talking about my accomplishment in
this project, but like this blog post is so much much when you think about all the work that goes into
like creating and starting an open source project uh versus a blog post like the blog post so much
easier yeah it's just a one-time thing and uh but but it's great it's a it brings like a different
kind of um a feeling and it's been awesome i definitely want to do it again definitely i always find that the
most uh the most impactful and simple blog posts are the ones that have had the most time you've
thought about it in the background really so so what appears to be a very sort of cogent and and
concise and very well put together kind of blog post actually is a huge amount of work in there
so um yeah well done for doing that and um so just want to say thanks very much for coming on the
show it's been fantastic speaking to you.
And good luck with everything going forward.
And I look forward to reading the rest of your blog posts in the future on this topic.
Perfect.
Thank you so much for having me.
Okay.
Cheers.
Thanks. Thank you.