Drill to Detail - Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin

Starting point is 00:00:00 So my guest on this week's episode is Maxime Bouchemin who works at Airbnb as well as being a main committer on Apache Airflow and Airbnb's SuperSec projects. So Maxime also wrote a very good and influential blog post recently entitled The Rise of the Data Engineer. And I've invited him onto the show to talk about that post, his work as a data engineer at Airbnb, and how he got to that point having worked in a more traditional BI developer role many years ago. And also his work on Airflow and Supersets,

Starting point is 00:00:40 which I know many of you have been kind of listening to and hearing about and so on. It's quite good to get the person behind it to talk about it as well. So Maxime, first of all, thank you for coming on the show. Welcome. And just introduce yourself properly and what you do at Airbnb at the moment. Perfect. Thank you for having me on the show. So it's an honor to be on the show. So I'm going to talk a little bit about how I got to Airbnb and what I do at Airbnb now. So I come from Facebook. So I used to work at Yahoo and Facebook and now at Airbnb. And what brought me to Airbnb was, well, first, it's important to believe in the mission and believe in the company. And, you know, I really,

Starting point is 00:01:32 the mission of like belong anywhere at Airbnb really resonated with me. So, you know, this idea that you could, you know, that home is not necessarily as a, you know, constant or maybe home can be something that changes over time as you change lifestyle through your life. And, you know, I like some of these ideas. And then, you know, I spoke with people at Airbnb a few times casually about potentially working there. And it was just really apparent to me that I could, you know, have so much impact there coming from Facebook where arguably they are a few years ahead, at least in terms of data and tooling and all that stuff. I spoke to people at Airbnb and I was like, it felt like I had seen the future.

Starting point is 00:02:17 I could bring that to Airbnb and help them jump and maybe skip forward and have a lot of impact there. And it was also clear that they needed something like Airflow at the time. So Airflow, for context, is a batch process orchestrator. And as I was speaking with the people there, they're like, if you join Airbnb, you can start working on this project and make it open source. And it was really important to me, this idea of I want to manage a, I want to start a big open source project. And this might be just the opportunity. So that's what brought me there in the first place.

Starting point is 00:03:01 Okay. Okay. So your background, what's interesting as well is your background and your route into this development is from quite traditional kind of bi development role and i think you when you worked at facebook at the start you were classed as a kind of bi developer what's your what's your kind of history in that sort of area right so i started so my career very so early on in around like 2000 I did a little bit of web development. But soon after, I started getting involved in the data projects at Ubisoft at the time.

Starting point is 00:03:32 And they were starting to talk about building a data warehouse, which I guess some of the theory, some of the books about data warehousing had been written in the 90s. And bigger companies were building data warehouses and you know ubisoft was starting to get serious around that and uh you know building a warehouse and they bought this this package called hyperion s base and they they were looking for a tech a techie to to manage and you know help you know make that project successful internally so then i started working on on all of these things. So building the warehouse, we had the Microsoft SQL Server suite at the time. I believe it was the 97, or I think it was called SQL Server 7.

Starting point is 00:04:17 So very early on in the projects, we had, I believe a little bit later, we got Business Objects, but kind of this traditional stack. And we started basically reading the books and building the warehouse and working with people at Hyperion to build our financial solution around their tools. So that's where, you know, coming from, so I've got seven years or so at Ubisoft where I was just focused on traditional data warehousing, business intelligence, ETL, store procedures, and all that stuff. So that's my foundation. And when I left to go to Yahoo, that was a big shift because Yahoo was a lot more,

Starting point is 00:05:03 you know, somewhat closer to what we think of a data engineer nowadays so more programming and scripting and and perhaps a little bit less less tooling more thinking in parallel big data type stuff and it was also the rise of of hadoop at yahoo at the time okay okay So that leads quite nicely into the reason that I wanted to speak to you really. So you wrote a blog post recently on Medium called The Rise of the Data Engineer. And I think, you know,

Starting point is 00:05:32 I don't know what the numbers are like on it, but certainly it looks like a very kind of, it looks like an article that resonated with a lot of people. And I think it summarized some of the changes that are happening within our industry and how the role of BI and data warehouse and developers changed over time and how it's different now within organizations like Airbnb, for example.

Starting point is 00:05:52 So maybe just tell us a bit about, to summarize what the article was and what was the background to it? What kind of motivated you to write the article, really? Right, so I had this thought for, I think it had been at least a year or two that I'd been thinking about writing something like the rise of the data engineer. And but I'm sure if you Google the rise of the data scientists, you'll find an equivalent or similar post in a lot of ways that, you know, at a point in time, someone decided to kind of ground this idea of like, what is a data scientist? What do they do? Why do organizations need them? And I'd been thinking, you know, now that the

Starting point is 00:06:42 word or the title data engineer was getting thrown around and it was becoming quite a big popular thing. There was nothing that had really defined like what is a data engineer? How does it relate to existing positions like, you know, business intelligence engineer or data warehouse architect? Or how do, you know do data engineers and data scientists collaborate together? So I thought there's a great opportunity here for me to, from my specific perspective, to explore what is data engineering and to kind of define it myself since no one had done it before. I was thinking maybe I have the opportunity here to define it for others so that my vision becomes the actual vision for this industry. And you mentioned the numbers a little bit. So this post, I was surprised to find that

Starting point is 00:07:33 it got extremely popular on Medium. And so I have about 65,000 views and believe it or not, 20,000 people read through the entire article. So that means it's something important. A little bit, something I wanted to mention too is this other post that came, I think, soon after. What is it called? I'll try to dig it out. But it was around data engineering as well and stating the fact that at this time, I believe, or at the time they wrote it,

Starting point is 00:08:07 there was 6,500 people on LinkedIn calling their title or saying, I'm a data engineer, while there was just about the same number of job recs open to try to hire 6,500 data engineers in San Francisco alone. So there's definitely something big happening in this space. Okay, so again, I think what resonated was, certainly with me,

Starting point is 00:08:34 was that the world you described and the way that kind of BI development and ETL development and so on is done within startups and within kind of companies working with large amounts of data and so on is kind of different. It's a different kind of role really to BI development and so on there. So why don't you just outline in a way what is it that you do day to day at Airbnb in terms of development and the development process? How does that differ, do you think, from the things that people are more used to with kind of formal ETL and formal BI development and so on?

Starting point is 00:09:04 What's different? What warrants it being a different kind of role in your mind to bi development right um so the first thing is you know business intelligence engineering and the tools that the tool set and those processes from from before they still exist in a lot of organizations and you know there's some organizations are taking a different approach to data and analytics and ETL. But I want to say that the old approach still exists and is still valid. And the tools from the past still work well for a lot of organizations. What is different, though? So one major factor, I believe, is the rise of, I guess, I hate to say that word, but like big data and the big data tooling and the Hadoop ecosystem is very different as traditional databases and computation and storage has changed quite a bit. So the tool set in that environment and the scale has grown quite a bit. And I think a lot of the tooling and processes from the past

Starting point is 00:10:14 don't work anymore, which warrants a new set of tools and a new approach. I believe also the information work is getting more technical in general. So that means traditional analysts might be able to write SQL nowadays, but everyone is climbing that ladder of complexity and becoming more technical. And for data engineers, that means in a lot of cases, writing more code, where in the past, maybe ETL tools were more drag and for data engineers that means in a lot of cases writing more code where in the past maybe etl tools were more drag and drop at this point in time we're expecting modern data engineers to to write high quality code because the problems we're solving are complex and require you know potentially like more abstract tooling and uh being able to

Starting point is 00:11:06 write code um that touches some of the elements of the answer there's so much more to it but yes okay okay so and again i suppose data engineer there's a there's a there's a deliberate distinction there made between data engineer and data scientist so again how does it differ how would you say an end data engineer differs from a data scientist so again how does it differ how would you say an engineer data engineer differs from a data scientist what what point are you trying to make there really right uh so first to try to ground uh you know what a data scientist is so to me that the term that you know there's kind of a real definition and it's been it's been overloaded quite a bit but to me uh data scientists has something to do with uh well first it's it's been overloaded quite a bit. But to me, data scientists has something to do with,

Starting point is 00:11:47 well, first, it's an analyst that can write code or someone with strong analytical skills who is able to code. There's also an element of publishing, perhaps, right? So science is academic, and there's an element where potentially you could say like a data scientist, if they're really doing science, they should publish articles, do peer review, and follow this scientific process, I would say. Now, where I see the term as being very overloaded is, you know, I think analysts that work in San Francisco or just, you know, data analysts that live in San Francisco are called data scientists because they want to be called that because it's a sexy name. And that's a modern appellation.

Starting point is 00:12:35 It's something that people aspire to this title. So it's been overloaded quite a bit. And now in relation to data engineering, so to me, the core of data engineering is, you know, the core role is someone who would build data structures and data pipelines for an organization. And that's, you know, essentially what we used to call ETL. But ETL has changed quite a bit in the face of new tools, a new set of tools. Also, in some of the processes and some of the new tools, I've redefined some of the foundational concept of ETL. For instance, data modeling, I think data modeling hasn't changed necessarily that much. If you look at concepts like, you know, star schemas and dimensional modeling, I would say some of this still applies, but has changed enough in the light of new tools and databases that don't necessarily have the same constraints as they used to have. So where is the line between data science and data engineering? There's probably a fair amount of overlap too,

Starting point is 00:13:53 and we want people to overlap. We don't necessarily want to put a wall there, but I would say data engineers care most about building data structures and data pipelines for longer-term solution. While data scientists might be focused on something this week and something else next week, the engineer would be building a longer-term solution. Also, on the side of data science, there's this idea of using machine learning quite a bit. And that is also true on the data engineering side, but maybe on a slightly longer term vision.

Starting point is 00:14:32 Okay. Okay. So I think certainly, I think the first mention I heard of the term data engineer was, that was obviously your post there. I think Kurt Monash posted something a while ago, again, making this distinction between not everything you do within big data and so on is is data science you know there's there's the kind of there's people that specialize more in the infrastructure and the architecture and the

Starting point is 00:14:53 pipelines as you said and that is a distinct kind of like role in itself really um i think i think you hit you hit on it there with the etl part and i think having come into that world myself from from kind of a more traditional world of etl tools and informatica and so on ETL like you say ETL is changing and I think there's a question as to whether or not what we do now with scripting and and everything being at this more kind of like I wouldn't say immature but certainly a more a more kind of like a basic level whether that's whether that is a function of how kind of like uh basic level whether that's whether that is uh a function of how kind of like how how new this is or whether the way we do etl has changed completely and i think be interested to talk to you later about airflow and and so on um did you see that blog

Starting point is 00:15:35 post there was a there was a blog post by somebody else as well which was engineers shouldn't write etl and it was by jeff magnuson and it was out recently as well and it was a similar kind of topic but it was talking about how because ETL has changed different people should be doing it and doing it in a different way I mean fundamentally do you think ETL is different now do you think it's how would you approach it differently really for this and what's different about doing it in this environment yes so I'm not familiar with the article you're mentioning but I'll definitely look it up and I'm curious it sounds controversial so controversial, so now you get my interest and career. Yeah, there's two points to it, really.

Starting point is 00:16:09 One is that ETL has changed, like you said, but then there's a point of saying that actually if it is an ETL task, then actually it should still be done the old way. But if it's different, if it is data engineering, I don't know. The point of it is saying that in a way you shouldn't make data scientists be ETL developers because it because just because it's different data it's interesting kind of area really yeah that's that's one thing you know data data engineers are kind of here to to save data scientists from doing etl in a very poor manner so um i believe that's that was the end goal when i when i got to airbnb there was already you know dozens of data scientists that clearly did not know much about data structures and data pipelining and were doing a horrible job. While they were really good

Starting point is 00:16:50 at what they do, they were not good at data engineering. As I came in, there was a small team of data engineers or ETL people really at Airbnb that were building data structures and pipelines that data scientists

Starting point is 00:17:06 could use so that their analysis would be built on the foundation of strong pipelines. So instead of going back to the raw tables and the raw ingredients and building their dirty derivatives, they would start from where data has been cleansed and organized and where there's been consensus on defining metrics and dimensions. And it then becomes harder for them to, or a lot easier for them to get right metrics and get the analysis that were in line with each other. Now talking about ETL and how that's changed, I don't know if you want to take the tangent. No, no, please.

Starting point is 00:17:51 Okay, so on ETL, so how has it changed? So in the 2000s, I would say there's been this rise of a lot of ETL tools by vendors, business intelligence vendors that were selling things like Informatica, IBM Data Stage, SQL Server, I believe it was called Integration Services, and AppInitio. So a whole set of tools that were all drag-and-drop tools. So the idea was you have this software package, you connect to your data sources,

Starting point is 00:18:26 you drag and drop your table, you drag and drop transformers, and you build a small graph of data, objects, and transformations. And so really often they would have these data flows and workflows, and you'd build those by drag and dropping. And that's all fine and dandy. The premise was that people working with data perhaps did not know or didn't want to write code,

Starting point is 00:18:51 you know, so that they would do drag and drop. In theory, it would make that easier for them. But then, you know, in the post, in the rise of data engineering, I argued that the problems we're solving now, and perhaps that we were solving at the time, are too complex to be done with drag and drop tools. With drag and drop, while it might seem easy at first, you lose on the whole software engineering, everything that you get in software engineering because you're writing code, things like source control and being able to diff different branches,

Starting point is 00:19:29 being able to create abstractions, being able to create blocks of code that you're going to reuse through looping, inheritance, composition. You'd have some version of that in the drag and drop tools. But it still made it hard to do things that are easy to do when you're writing code. And I think maybe I argue in the post, and I'd like to maybe write a post that would be more specific

Starting point is 00:20:03 about why is drag and drop not the right abstraction or why was it a mistake almost like for it to have like a decade of drag and drop tools in the ETL space. best way to express logic. And there's a reason why software engineers are not drag and dropping four loops on the screen and that they write in an actual programming language. And I believe a lot of those reasons why software engineers write code and don't do drag and drop in some sort of development environment is because it's a superior abstraction

Starting point is 00:20:41 and it's something solid that is timeless. And that applies to data engineering as well as it does to software engineering. But is that not something that is true but therefore limits the people who can do this to a very small set of people? I mean, I guess the point of the drag and drop, point and click kind of like ATL tools was to make it possible for people other than software engineers to do this work. So do you not think that within the industry we're in now, this is more, at some point, how they're going to scale up to handle this really?

Starting point is 00:21:14 I mean, do you not think at some point drag and drop will come to this or is it just fundamentally flawed, do you think? Well, so drag and drop might be okay for a certain level of abstraction. So if you're doing something simple, I'm trying to equate maybe there's like these Lego tools for kids that want to learn how to code and they might go into some toy environment where they can specify a series of actions as visual blocks. And it might be a good mental model for some or for people ramping up. But if you're writing software at scale, you need things like source control and you need to be able to diff your code and you need to be able to create a class,

Starting point is 00:21:59 create a function, create these reusable blocks. And I believe that in drag and drop, people have created that right so you can have a a for loop as a block right like that way you would drag and drop a for loop and maybe the abstraction is more visual but it is the same or a similar level of abstraction it's just the mean the process to to do it is is different different. But if you can understand a tool like Informatica and all of its glory and complexity, I believe you can probably understand

Starting point is 00:22:32 the same level of abstraction as a written or as code, right? Like, I don't think people would be, oh, I'm able to drag and drop a source table, but I'm not able to instantiate a source object. I believe it's the same abstraction. And if people are able of these abstractions in a drag and drop environment, they would be able to do that in a code. It's definitely interesting.

Starting point is 00:23:00 I mean, my experience has been within this kind of industry. There is no equivalent of something like informatica you know a lot of things that we a lot of things that we we we had from the bi world have now resurfaced in in the kind of big data world as such you know we've now got uh platforms like bigquery and athena and so on that give us a kind of a more tabular interface over over kind of the data we've got you know tools like looker for example and superset you know that do a more kind of the data we've got you know tools like looker for example and superset you know that do a more kind of like uh user-friendly bi and analytics uh sort of platform on top of this but there is no there is no equivalent of informatica and there is no

Starting point is 00:23:35 kind of graphical point and click tool for big data but what there is is things like airflow that you're working on so i mean tell us, tell us about Airflow. What that is, tell us what problem it solves and what it is, first of all. Right, so Airflow is a, I would call it a workflow orchestrator for modern enterprises or organizations that are working with data. And I guess, you know, fundamentally,

Starting point is 00:24:03 you know, Airflow is just a way to schedule and run a set of jobs and tasks with complex dependencies. And, you know, in modern organizations, when you have, you know, perhaps a few people or dozens of people or hundreds of people working with data every day. These people will write jobs that need to be on a schedule, and that typically depends on each other. So say ETL is a very classic example of that where, hey, I want to load my fact table. But first, I need to make sure that the source data for the day has landed. Once the data lands, I'm going to populate my dimensions in a certain order based on whether, you know, are things landed, are all the dependencies met. sets of processes and that need to run on a schedule with really complex dependencies and airflow it's a tool that helps people orchestrate all of that and to give you an idea of like the complexity of these workflows in modern organizations so at Facebook I believe we're at the time where I left. So about three years ago, we were running hundreds of thousands of tasks every day. And at Airbnb now, I believe we're using Airflow, we run around 60,000 tasks every day. And these tasks need to run in a very specific order. Each one of these tasks depends on a complex network of other tasks. And these tasks can go from, you know,

Starting point is 00:25:46 populating, you know, data in a table or in partitions to, you know, data that can help different parts of the business. So you can picture there's whole workflows of tasks for areas like, you know, payments and fraud detection and search ranking. So each team has their own sets of complex data pipelines or workflows that need to be orchestrated in a very specific way

Starting point is 00:26:16 and run every day on a schedule. Airflow also makes it easy to not only author these jobs, but to monitor them and track them and to stay sane while trying to understand why did the data did not lend today? Or why is it not lended yet? And where is the error report? And can I get some retries when there's some transient errors? Can I get some tasks to retry within the parameters that I set? Can I get alerted? Can I get some tasks to retry within the parameters that I set? Can I get alerted? Can I get an easy access to my logs?

Starting point is 00:26:53 Can I get alerted when things are not landed in time? So Airflow is a whole set of tools around monitoring or authoring, monitoring, troubleshooting these complex workflows of jobs. Okay, and this was developed at Airbnb and it's open source, is that correct? I mean, I guess this is something that you felt was a key thing you needed to be able to do and have to do what you're doing now as a data engineer. Right, so as I left Facebook, so Facebook adds a set of similar tools, one called Data Swarm and something called DataBee, And those were internal tools that were not open source at the time, but were also similar in a lot of ways. So one thing is like they were written in Python, they worked at scale, they allow people

Starting point is 00:27:38 to author their workflows and troubleshoot them and there was also this one of the core ideas was being able to dynamically author workflows and I can maybe I'll get into that in a little so the idea of being able to not only write a static workflow but to write a program that will define a workflow dynamically. And as I left Facebook, I thought it's going to be really hard for me to operate at the same level that I'm operating at Facebook with these tools, without these tools. So first thing, I'm going to build the tools that I need, and then I'm going to be able to solve the problems that I've been solving with the right set of tools. And I believe that people, so the people at Airbnb at the time were looking at some of the open source solutions that existed. So there's something called Uzi and Azkaban and Luigi.

Starting point is 00:28:38 And we looked at all these tools and we decided that we wanted to build something new in the light of people coming from the places where these tools had been written and saying you shouldn't use, like someone from Yahoo said, please do whatever but don't use Uzi. And someone from LinkedIn was like, do whatever you want to do but make sure to not use Azkaban.

Starting point is 00:29:01 And so together we're like, and I came from Facebook and I was like, I wish I had the tools from Facebook. And we decided to not, to take a new, to decide to, can we build something similar, perhaps better than these tools

Starting point is 00:29:16 in the process, open source it and give it out to the community. Excellent, excellent. So, I mean, in terms of your involvement with this, I mean, obviously you're heavily involved there. You're a committer.

Starting point is 00:29:28 You know, what's the, how much time do you spend on this and how big an involvement have you got with this? Right, so Aeroflilo specifically was really my baby. So I started the project. I wrote the first line of code. That was probably the lone committer on the project for the first, let's say was probably the lone committer on the project for the the first let's say six months to a year before the project starting uh started getting any attention from

Starting point is 00:29:51 externally or before we even announced the open source it so that was a piece of software that i wrote from scratch uh you know and and that i that i pushed forward and wrote the code, the documentation, the unit tests, and eventually onboarded all sorts of people onto the project. And now, I would say, my first year and a half at Airbnb, so it's been two and a half years now, but I was mostly focused on Airflow and solving internal problems at Airbnb using airflow. Things like rewriting or experimentation or A-B testing framework

Starting point is 00:30:31 and then collaborating with teams and making sure they were able to build what they needed to build using airflow. Okay, so you mentioned dynamic dynamic kind of generation there i suppose i suppose in a way you know you've you've solved some things with with with kind of airflow and you mentioned dynamic generation there and so on what are the ways in which you're taking this forward that are kind of non-obvious to people from more traditional backgrounds that like that because i mean it sounds really interesting what you're trying to do there and so on tell us about that and and where you see this going really. Right. So dynamic workflow generation. So if you think of concepts like, I would say,

Starting point is 00:31:15 analytics, say as a service or analysis automation, or the whole idea that potentially, instead of having a data engineer writing workflows individually that are static that a data engineer could build something that can be used to generate workflows so it's a level of abstraction over the perhaps the what what a data engineer would normally do so let's say you need you need a specific kind of ETL for an experiment that you want to run and you want to run an A-B test on your data platform. And perhaps, you know, there was a time, you know,

Starting point is 00:31:55 early on at Airbnb where maybe you would write a small pipeline just for that specific use case. And the day after you want to run another experiment, but it's slightly different. And this time you want to, you're going to have to write a different pipeline for that new experiment now the airflow allows you allow allows us to write a piece of code that perhaps can read a config file or some configuration from in a database and based on that create complex workflows for each experiment with a set of parameters.

Starting point is 00:32:30 Other examples of that could be things like, so experimentation is a good use case for it. I should pull, I have a talk called Advanced Data Engineering Patterns with Apache Airflow that tries to describe a bunch of use cases for doing this sort of stuff. An example of that for us is we have this tool called Autodag where people can say an analyst or a data scientist that wants to run a certain query every day um you know they can they can put in a config file with some configuration element and easily get to a point where this is going to be scheduled and run on their behalf and there's going to be some some automation there um you know another example of this would be, so say an organization like

Starting point is 00:33:29 Facebook and, you know, and Airbnb, we want to compute the same set of metrics for different areas of the business over and over. So things around, say, engagement and growth accounting. So understanding like how many people are using a certain food feature on the website and how many people are new churn resurrected stale active so you can you can picture that we would allow people to fill in a form a simple form saying hey i would like to compute this for my area of the business and configure it in a very specific way, perhaps saying I'm interested in specific timeframes, dimensions, demographics.

Starting point is 00:34:13 And they would, by filling in this form or this configuration, would build dynamically a complex workflow on their behalf. So that becomes kind of the work of a data engineer as the service somehow um i can get into more complex use cases um so uh i'm not sure i'll do you want to go well that's interesting and that actually leads on i want to get onto the data modeling bit you talked about as well in a second but you the going back to that post which you haven't read so it's not kind of fair to go into much detail but the thrust of the the other blog post that i mentioned as well the one about kind of should engineers be writing etl code i think is interesting what you just said because the

Starting point is 00:34:51 thrust of it was that the engineers are always looking for engineers and data scientists and data engineers are always looking for new and novel ways to solve things like etl whereas actually in fact by doing that um you know we end up building systems that are not as stable and as and a lot of this work is more doing than thinking you know do you think that's the case or do you think in the world that you operate in that I operate in now that you can't have it as that you've got to be a bit more kind of agile a bit more kind of forward thinking a bit more dynamic in how you do ETL I mean what do you think on that? Well, ETL, it's true that in some ways it's mind-numbing, but I would say the easy component of ETL

Starting point is 00:35:30 or the mind-numbing component of ETL can be, with the right set of tools, it can be abstracted out and be solved very quickly. Now, there are things like consensus-seeking, say, how should we define metrics and dimensions? And how should we structure our table and our workflows? And how should we write optimized and performing ETL at scale? It's more challenging.

Starting point is 00:35:56 Change management is horrible in ETL, right? so hard to say if you want to change the definition of a metric slightly then you know there's all these derivative tables that you need to to reprocess and airflow certainly helps with these problems uh but but you know etl is is necessary right or should it whether it should be in batch or whether you know data pipeline should be you know and in batch or whether data pipelines should be in batch or in streaming fashion, I think is less important. But how should the data in your organization flow and get organized is a really important and core problem to modern companies. And there's no way to get get around it i would i would say yeah um you mentioned you mentioned also data modeling is changing so not only did you talk about etl is changing but data modeling as well and i guess that's a big part of it as well really

Starting point is 00:36:56 right a few more words around the idea of like etl and why it's necessary but it's a little bit like you think of like the data engineers as the librarians of uh ofL and why it's necessary, but it's a little bit like, if you think of like the data engineers as the librarians of data, right? Like they're the people who will say, and you know, the equivalent of a library would be like, people will organize all the books, put them on the shelves in the right place, fill it, like be in charge of managing the metadata

Starting point is 00:37:19 or the little cards by which you would search and find books. So it's really important to take all of this data that you get that's dirty and complicated and comes from different sources and it doesn't line up in a lot of ways and to line it up and organize it and store it for the future, for the well-being of analytics at your company. So you can actually ask questions, get answers and be somewhat structured in the way you do this um now data modeling is changing so um i would say like a lot of the books i would still recommend to people real read to read you know the kimball books um i i believe star uh you know star schemas and

Starting point is 00:37:58 dimensional modeling are uh are still true in a lot of ways, but there are things that are somewhat less relevant. One thing is the way that we store data now with columnar database or columnar file formats like Parquet and ORC. Things like creating surrogate keys, and now I'm getting a little bit technical, so I'm not sure what's the percentage of the audience that will relate to what is a surrogate keys. And now I'm getting a little bit technical, so I'm not sure what's the percentage of the audience that will relate to, you know, what is a surrogate key? But like now that we have dictionary encoding and we have, you know, file formats that are potentially columnar, do we

Starting point is 00:38:37 need surrogate keys anymore for, I mean, there are other reasons why we may need surrogate keys, but maintaining surrogate keys in traditional data warehousing was fairly complicated and heavy. And so you'd have all these problems around late-arriving facts and preloading dimension members and this whole idea of, you know, there's entire chapters in these books written around slowly changing dimensions, which you're probably getting like bad flashbacks thinking about these slowly changing dimension ideas. But I would definitely argue that slowly changing dimensions, we have kind of shortcuts,

Starting point is 00:39:29 that we have new solutions for these problems that are simpler. And perhaps in some cases it's due to the fact that storage and compute is cheaper than it used to be in relation to engineering time perhaps. And that's one reason and then some of the the new serialization formats or database engine make some of the optimization we would get or some of the the performance gain we would get from say managing surrogate keys are not as significant anymore from a perf standpoint we don't necessarily need that because uh because the databases are able to to kind of do that on our behalf

Starting point is 00:40:12 without without thinking too much about it yeah definitely definitely so so you i think you're also involved in the the supersets project as well i mean is it can is it something you can tell us about what it is and and i suppose what that's trying to achieve as well right so it's my second uh big project um and i believe like i started so i started this at airbnb about a year and a half ago um originally originally the premise was um you know we we wanted to to use this database called Druid. So you can check it out. So Druid is this colon-oriented, distributed, real-time database. That's a really cool database. And we had tons of use cases at Airbnb to use Druid. with DRID at the time is there was no way to really consume the data or visualize the data

Starting point is 00:41:05 easily as none of the tools that existed on that exists on the market had some DRID connectors. DRID used a REST API to query. So you would have to issue, you know, you have to write a JSON blob to query it and then get a JSON blob back and somehow write a custom application to, say, visualize your Druid datasets. So coming out of Facebook, I really wanted to recreate something similar to Scuba internally, which is also a non-open source project that exists at Facebook.

Starting point is 00:41:41 And Scuba is just this really fast database backend. It's mostly like in-memory columnized data that you can query gigabytes, terabytes of data in under a second. And Scuba at Facebook has this really nice front end that allows you to query the Scuba backend and get answers very, very quickly. So it's very high velocity. You point to a data set, you say, I want to see these metrics grouped by that, give me this visualization, all in a, you know, click interface that is very high velocity. So you can really ask, you can ask hundreds of questions in minutes, just because the database is so fast and the UI is very high velocity. So looking at Druid, you know, at the time,

Starting point is 00:42:29 Druid had a lot of the properties that the Scuba backend offered, and there was just no front end for it. So I was like, what about I start writing a front end for Druid as a hackathon project? And then, you know then this went pretty well, and we ended up selecting Druid to decide to use it as we were doing a proof of concept with it,

Starting point is 00:42:52 along with a little UI I was writing at the time. It worked pretty well, and it seemed like it had a lot of potential. And quickly after that, the scope grew around Superset, which was called Panoramics at the time. We changed the name multiple times on the project. The use case grew over time to become pretty much this open source, enterprise ready business intelligence web application.

Starting point is 00:43:24 Really, at this point in time, you know, Superset has become the main mean by which, you know, people query and consume data at Airbnb. And, you know, Superset is essentially a set of tools that allows you to point to a table, you know and and explore your data visualize it assemble dashboards and you know since then we also built a um sql ide on top of it so very much like uh you know a classic sql id you can write sql you can you can navigate your database to get your different table definitions and metadata write sql see your results, you know, run a create table as statement, then visualize this in Superset. So Superset is this full-on, you know, business intelligence web application that is completely enterprise-ready.

Starting point is 00:44:19 So that means if people, as a competitor, to say Looker, Periscope, Mode Analytics, you know, and eventually like Tableau tableau like internally we also use tableau and we like tableau but more and more people choose superset just because it's it's higher velocity and it makes it easy for people to assemble a dashboard very very quickly perhaps still a bit more scrappy but you know in the light of the lifecycle of a dashboard being shorter and shorter over time, how much time do you want to spend crafting a dashboard that will be somewhat obsolete a few weeks from now

Starting point is 00:44:56 when the business is shifting and thinking about new questions and new problems to solve? So that's an overview of Superset. The project is going to Apache. So as of last week, we started incubating with the Apache Software Foundation. So that's my second Apache project. And we really believe, you know, at Airbnb and personally, I really believe in the Apache Software Foundation way of doing things,

Starting point is 00:45:23 which is, you know, it's a meritocracy. And, you know, there's all sorts of nice processes around how to organize your project, how to collaborate with other companies. How do we define the release process for this piece of software? So it's been super exciting to work on this. And it's been like my main focus over the past year year and a

Starting point is 00:45:45 half or so uh where you know the airflow community is super solid and strong now and i feel like it's autonomous in a lot of ways so it doesn't necessarily need me as much as a benevolent dictator so things have been going really well there and now i'm focusing on on superset uh more and more so so you mentioned you mentioned looker there as a as a bi tool in the same sort of space and so so look at one thing that you that i didn't see in in superset that is in a toy at looker is this concept of a semantic model or a kind of like a business metadata layer um is that something that you you see as having value in this kind of space and it's something that will be there in in superset at some point or do you think it's maybe superfluous in this kind of environment what's

Starting point is 00:46:28 your thoughts on that so we do have a semantic layer um and you know there's we can talk as like bi guys from the previous you know generation we can talk about this this semantic layer a lot i'm really interested to talk about this so um so superset as a very simple semantic layer superset will not do joins on your behalf so that means the semantic layer is focused on a single table or view and that this is where you would define you know what are the the labels for your different columns and metrics and you and how are your calculated metrics or calculated columns or dimensions or calculated metrics,

Starting point is 00:47:09 what are their expressions and how should they be exposed in the UI. Now, for people coming from that previous era of business intelligence tools, so there was this, say if you take business objects or a micro strategy, these things would have a very heavy, complicated semantic layer that would hold a lot of business logic. So that business logic was like, in part, you know, in the data pipelines and data structure,

Starting point is 00:47:37 then you had this map on top of that for business objects, it was called the universe designer, and then the project management and micro strategy, where you would bring in your physical tables and explain to the tool or give the metadata to the tool to say, how can you join these tables to basically not produce bad results? So which table can be joined to what table and how to basically how this tool can generate queries on your behalf. So in Superset, we decided that this layer of complexity of how data should come together in the tabular format was not going to be part of Superset, and it would be upstream.

Starting point is 00:48:27 So either you provide a table that has all the summary information, the denormalized information that you need to answer your questions, or you can provide a view as well. So in a view, you can write your own joins, and you can write your own metric definition in a view too so we're just pushing shoveling that that problem upstream and deciding that uh you know the tool should not

Starting point is 00:48:54 take take care of that my opinion too is that you know you look at uh you mentioned looker and looker ml which is their the looker modeling language. So that's where you also define that semantic layer. And in the case of semantic layers in general, there's so much information that exists on that layer. And that layer is usually not accessible to many, right? And it also forces a really strong consensus on like how is the data modeled and organized. And it requires a whole set of like specific tools, right? So if you expect every single analyst or data scientist or, you know, data engineer that plays with a little bit of data to go and create that layer on top of the data they produced, that can be pretty

Starting point is 00:49:47 prohibitive, right, to learn, say, something like Looker ML, or even to get access to it. You might just be like, okay, I created my set of three tables. I'd like to query them now and make my dashboard and move on with my week. In the case of Looker, it'd be like, oh, now you have to learn about Looker ML, and we need to grant you access to that layer, or you need someone to do that on your behalf, right? And that person might be like, you know, you created a set of tables here that are very similar to these other tables. Why don't you use these? And let's together have a consensus on how your data should fit in the warehouse and then the person that's just trying to get something done um you know is brought into this extra layer of complexity and consensus and you know the tools that i've seen working really well in other environments are

Starting point is 00:50:37 these like high velocity tools where you can just move forward and do your own thing um yeah so okay interesting yeah interesting is yeah i mean so so just to round things off i mean at the end of the end of that blog post we talked about the rise of the data engineer you talked about organization within the department and you talked about kind of you know roles and responsibilities and so on within within there so maybe just outline where what the key kind of roles are within a kind of like a an organization like yours that has data engineers and does work on this kind of scale. And again, kind of what's different about it and why have you done it differently to more traditional kind of roles? One thing I didn't talk about early on in this conversation, we spoke about data scientists and software engineers, but I did not talk about data infrastructure engineers. So I guess, yeah, so, and, you know, a lot of these positions, as the company grows,

Starting point is 00:51:33 you need more clear role definitions and, you know, maybe it makes more sense at that point in time to start doing distinctions between the different roles and teams. But certainly at Airbnb, we're at a certain scale where certain roles become really clear cut, where maybe originally you could hire a few data engineers, data scientists, and data engineers are going to be in charge of the infrastructures to a certain point, might be building data products. In smaller organizations, people do more things, roles are not as clear-cut. In larger organizations, though, I like to make a distinction, or typically we'll see a distinction between people who do data infrastructure and people who do data engineering. And that specialization would be in the direction

Starting point is 00:52:27 of a data infrastructure engineer would be in charge of of basically installing maintaining and you know keeping up and doing some devops type workload around data platforms. So that means people that will be in charge of Hadoop and Hive and Druid and making sure these clusters are scaling with the need. They'll do capacity planning. They'll do all sorts of work to get alerted as they need to grow the clusters in different ways. And often these data infrastructure engineers will also, like their engineers, they'll build solutions.

Starting point is 00:53:11 So data infrastructure engineers at LinkedIn build something like Kafka to solve a use case that they had. Or at Airbnb, our data infrastructure engineers will build frameworks and solutions to, say, load data into Droid to glue these systems together and to do automation around the work that they would do manually. a dupe cluster to another, or things like a retention management solution so that data at Airbnb can get anonymized and summarized, or not summarized, but put into longer-term storage, get archived in some ways. So that would be trying to describe what a data infrastructure engineer would be. And then the data engineer is more specializing into data modeling, data pipe,

Starting point is 00:54:11 so building the data structures, the data pipelines that the company requires. And also, since we're talking about engineering and software, there's always this component of trying to automate your your work over time so data engineers will build more abstract solution to try to automate their their own work too fantastic i mean so so i mean just to kind of round things off me as you said earlier on you were kind of hoping in a way to sort of have a chance to define this kind of term and where we're going with this and it's a bit of a manifesto really you're doing here i mean where where are you where are you taking this really is it something that that now you've got now you've got people's attention you know you want to develop this out further or

Starting point is 00:54:51 what's the kind of end game with this or where do you want to get to really with this with this initiative really well so so at first i was kind of a shot in the dark of just doing that and seeing what uh what would happen and whether it would stimulate some conversations and define the role and it's really interesting to this exercise of writing a manifesto and i think it just turned out that it was really needed at that point in time that you know a lot of people were waiting for something like that and you know i'm inspired to go and write more blog posts just because of the success of this blog post i started writing one around kind of timeless best practices uh in data engineering

Starting point is 00:55:35 data modeling so things that used to in a lot of cases used to be um good practice in the past and are still today. And some new concepts, too, that are slightly more modern. So some ideas around using concepts from dynamic programming and apply them to ETL. So immutability and idempotency and, you know, this idea of pure function and dynamic programming would be pure tasks in modern or in data engineering that would apply these concepts. So I have a whole blog post that is probably half written on that subject.

Starting point is 00:56:18 I believe I had a few other ideas to follow up on this one. And sometimes I try to get people at work too, so my colleagues to go and write some of these posts too. So I've been talking with people that are writing posts that are somewhat related or complementary to this one. Though it's hard to justify. So I've got all this software to write to, and I've got like very like thriving communities,

Starting point is 00:56:49 you know, for 3% and airflow. And sometimes it's so hard to just kind of hit the pause button on the universe and write a blog post. But, you know, it's really rewarding. So I'm looking forward to do, you know, it's really rewarding. So I'm looking forward to do some more of that. My goal is to write one blog post a month. And I think it's been at least like two or three months since I wrote that one. Fantastic.

Starting point is 00:57:16 And what's Free Code Camp? I mean, that's obviously the hosted thing that you ran the Medium post on. Is that something you're involved in as well? No. So what happened is I wrote the Medium post on. Is that something you were involved in as well? No, so what happened is I read the blog post and these guys thought I was taken off and they offered to put it in their organization and what they would provide in return is more readership and to do kind of a review and correct some of the structure. So someone there did a very thorough pass on the article and changed the structure a little bit,

Starting point is 00:57:52 which it helps, right? I'm not a professional writer. Oh, yeah. That's very good. Very good, yeah. And they were like, oh, you'll get access to our tens of thousands of readers. So I was like, okay, why not?

Starting point is 00:58:03 I believe I kind of did a disservice to my organization so it should have been under the Airbnb medium organization yeah but I just know how popular it's going to get so it was like I'm just trying this you know if I can get more readers why not just to kind of round things off where would people find out more information about Airflow and supersets so so I believe now we're moving the documentation in some of the repositories, but GitHub is really definitely the place to find the root of all the information for these two projects. So one is at github.com slash Airbnb slash Superset and Airflow is under, I believe, so it's under Apache slash incubator dash Airflow. But these things are well, you know, search engine optimization is kicking in.

Starting point is 00:58:53 It's pretty easy to think about these things. There's tons of documentation now for Airflow, not only Airflow's documentation, but people's blog posts and best practice guides. So there are tons and tons of resources at this point for Airflow. documentation, but people's blog posts and best practice guides. So there are tons and tons of resources at this point for Airflow and a growing amount of resources for Superset too. So tons there. And yeah, it seems like in light of, you know, you were talking about my accomplishment in this project, but like this blog post is so much much when you think about all the work that goes into like creating and starting an open source project uh versus a blog post like the blog post so much

Starting point is 00:59:32 easier yeah it's just a one-time thing and uh but but it's great it's a it brings like a different kind of um a feeling and it's been awesome i definitely want to do it again definitely i always find that the most uh the most impactful and simple blog posts are the ones that have had the most time you've thought about it in the background really so so what appears to be a very sort of cogent and and concise and very well put together kind of blog post actually is a huge amount of work in there so um yeah well done for doing that and um so just want to say thanks very much for coming on the show it's been fantastic speaking to you. And good luck with everything going forward.

Starting point is 01:00:08 And I look forward to reading the rest of your blog posts in the future on this topic. Perfect. Thank you so much for having me. Okay. Cheers. Thanks. Thank you.

Drill to Detail - Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin

Mark Rittman is joined by Maxime Beauchemin to talk about analytics and data integration at Airbnb, the Apache Airflow and Airbnb Superset open-source projects, and his recent Medium article on "The R...ise of the Data Engineer"

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.