The Data Stack Show - 64: Data Stack Composability and Commoditization with Michel Tricot of Airbyte

Episode Date: December 1, 2021

Highlights from this week’s conversation include:Announcement: Data Stack Live! (1:00)Michel’s career background (4:13)Solving the technical and process challenges of moving data (7:04)Lessons lea...rned from managing data at Live Ramp (9:35)How to build a modern data stack (16:19)Triggers to signal when more data infrastructure is needed (23:19)Why Airbyte is an open-source product (30:23)Airbyte’s role in providing support to open-source problems (38:15)How important DPT is for the Airbyte protocol and platform (41:03)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We have a really exciting episode coming up. And what's most exciting is we're going to live stream it.
Starting point is 00:00:31 The topic is the modern data stack. And we're going to talk about what that means. It's December 15th and you'll want to register for the live stream. Now, Costas, it's really exciting because we have some amazing leaders from some amazing companies. So tell us who's going to be there. Yeah, amazing leaders and also an amazing topic. I think we have mentioned the modern data stack so many times on this show. I think it's time to get all the different vendors who have contributed in creating this new category of products. And they define the modern data stack and discuss
Starting point is 00:01:01 about what makes it so special. So we are going to have people like Databricks, dbt and Fivetran, companies that are implementing state-of-the-art technologies around their data stack like Hinge. And we are also going to have VCs and see what's their own opinion about the modern data stack. So in a sense VC is going also to be there. And yeah, it's going to be super exciting and super interesting. So we invite everyone to our first live streaming. Yeah, we're super excited. The date is December 15th.
Starting point is 00:01:35 It's going to be at 4 p.m. Eastern time and you can register at rudderstack.com slash live. So that's just rudderstack.com slash live and we that's just rudderstack.com slash live. And we'll send you a link to watch the live stream. We can't wait to see you there. Welcome to the Data Stack Show. Today, we're talking with Michel Tricot, and he is one of the founders of Airbyte. And Airbyte moves data for companies. Really interesting company. They've grown a ton. I think they've been around for a year, but he has a pretty long history in data. And this isn't going to surprise
Starting point is 00:02:11 you, Costas. He worked at a company called LiveRamp, which anyone who knows marketing knows LiveRamp. They do have a ton of marketing and audience data. And so, of course, I have to ask him about that experience. He was there pretty early, I believe. And so I want to hear what it's like to talk to a data sort of engineer, data integrations leader at a marketing data company like LiveRamp. So that is, I'm going to try and sneak that in if I can. Well, first of all, I love your French accent. We have to get more French people on this show.
Starting point is 00:02:46 It really is. Yeah. I remember our first French guest, Alex, and I just loved hearing him talk about data. It was great. Yeah, it's always very interesting to hear like Americans speak French. Anyway, so what I'm going to ask him, I mean, for me me like it's a very special episode because we are talking about the person who's building like a data pipelines company right so there are like many different things that i'd love like to ask him but i think the most important and the most
Starting point is 00:03:14 interesting part is the open source dimension of her bike and how building a community is part of the product and how this can actually become some kind of, let's say, mode also like for the company. And it's very interesting in this case, because you have to understand that like Airbyte came in a time where, let's say, the market of data pipelining was supposed to be, it was done, right? We had 5Tran, 5Tran1, like it was like, it's probably the best, the biggest vendor right now.
Starting point is 00:03:45 Suddenly you have like a bite coming, doing something like different. And this has impact. So I think it's going to be very interesting to talk with him, both like from a technical perspective, but also like from a business perspective. I agree. All right, well, let's jump in and talk with Michel. Let's do it. Michel, welcome to the Data Sack Show. We're really excited to talk with Michel. Let's do it. Michel, welcome to the Data Stack Show.
Starting point is 00:04:06 We're really excited to talk with you today. Hey, Eric. Thank you so much for having me. All right. Well, you've been working in data for a really long time. Can you just give us an overview of kind of where you started, what you've done, and then what you're working on today at Airbyte? Yeah, sure.
Starting point is 00:04:23 So I've been in the data space for the past 15 years, started my career in financial data. So I would say medium volume of data, a few hundred gigabytes. And then in 2011, I moved in the US and started in this, back in the day, small company called LiveRamp and was able to experience the hyper-growth from finding product market fit to getting it and to getting to an IPO to getting through an acquisition and I was head of integration and director of engineering over there I was leading a team of yeah 30 people and we built thousands of different data integration and data integration is basically how you take data from one place and it could also be about how you get data into another place.
Starting point is 00:05:07 And we're moving- A thousand integrations back then is huge. It is. It is. I think we got burned quite a few times. It's a very hard problem. But in the end, what you need to do is really thinking about how you build, how you maintain, and how you scale.
Starting point is 00:05:22 And it's just that these pipes, you keep having having more of them and they keep becoming bigger and bigger i think when when i left in 2017 we were moving hundreds and hundreds of terabytes of data every single day so i had to learn the how to learn how to build this system the the hallway wow yeah and, and after that, after LiveRamp, I joined another startup, started to do the same thing, which is how do you get data from point A to point B? And I was, okay, what if I go for the crazy idea of solving it for more than just one company at a time? And that's how John and I started Airbyte.
Starting point is 00:06:02 So helping people move data from point A to point B without having to spin up massive data teams to do it. Yeah. So I'd love to hear just a little bit about, well, I was watching a movie with my son the other night. And it's an older movie from the 90s, I think. And they're in this lab and a piece of equipment breaks. And they said, we lost all 30 gigabytes of data from this experiment. That's just so funny. Cause you're like, man, that was like catastrophic back then. But so you went from a couple hundred gigabytes to hundreds of terabytes a day. Can you just explain, I mean, I'm sure some of our listeners have gone
Starting point is 00:06:47 through that, but probably a lot of them haven't. Like, what are the key things that you sort of took away from that experience of sort of this exponential growth in magnitude of just trying to move data? Yeah. The key thing about moving data is you need to think about it as a, almost as a factory, which is, it is not just a technical challenge. It is a process challenge. And it is not something that you only solve with, that you only solve with code or you only solve with some software. It's also something that you need to solve with people. Because when you think about the amount of places where you have data, it's impossible to write software that can get it everywhere. I mean, you're going to spend years and years and years doing it. So for a single company, it's very a single company it's very hard so it's about like how do you set up the
Starting point is 00:07:45 right process so that you enable people to actually be able to pull data from there and you start dispatching the responsibility to more and more people and you can actually almost like crowdsource the maintenance crowdsource the building and crowdsource the the scaling of this of these connectors and once you start also the other thing is you need to think about it in a sense that you're not building a system that works 100% of the time. You're building a system that has to be resilient. That's the thing is data connectors, data integration, it always breaks one day or another and you need to build your system with this in mind. Because, I mean, in the end, you depend on a ton of external places
Starting point is 00:08:28 that you have no control on. I mean, I don't know, like tomorrow, Facebook can decide to change how the IPA behave or it can strike with the same. And you don't have control on their product's decision. And you need to make sure that the system you build is resilient to it. And you have the process in place to solve.
Starting point is 00:08:46 Yeah, interesting. I remember there were a bunch of companies that were sort of built around an API that Facebook had made available that sort of made it really easy to gather large amounts of information on sort of individuals. And they changed it overnight and like 20 startups just evaporated.
Starting point is 00:09:08 I mean, that's kind of an extreme example, but even minor changes, especially if you think about enterprise scale, can make a really big difference on a daily basis. An e-commerce company that relies on data to sort of understand conversion or sort of send repeat purchase emails or other things like that. I mean, if something breaks, it literally costs huge amounts of revenue. One thing we talked about as we were prepping for the show, which I would love for you to speak to. So you were at LiveRamp a while ago and you were dealing with data at a really large scale. And I think in many ways, like a scale that a lot of sort of your
Starting point is 00:09:45 average like data engineer, data analyst doesn't get to experience just because that was such a huge amount of data. But that was still a while ago. And so have you seen sort of the lessons that you learned there translate? I mean, the technology stack is very different today than it was back then in some ways. But there's sort of a trickle down effect from companies that you were sending audience data, hundreds of terabytes a day. What's the trickle down effect and how long does it actually take for sort of the problems that you solve to hit the average sort of data engineer or show up as tooling for the average data engineer? Yeah, that's a very good question. One thing that I love looking at right now is when I look at the data landscape
Starting point is 00:10:33 and how things are moving and the new type of product is you look at who are building these tools, who are building these products, and you realize that all of them, like most of them, have had this problem way before and it's just that with scale you encounter new challenges that you have to solve the hard way because there is no solution that exists on the market and because it's data it
Starting point is 00:11:00 just grows exponentially so everything that we've learned 10 years ago was something that was specific to LiveRamp, or it can be specific to Google or specific to Facebook, specific to Netflix, or any of these big companies that were built and that really became massive at that time. And engineers there had to learn what kind of technical asset, what kind of technical skills they had to build. And once they what kind of technical asset, what kind of technical skills they had to build. And once they are out of these companies, they realize that, hey, but data is actually scaling exponentially. So all these other companies that don't have the same volume of data are actually about
Starting point is 00:11:35 to face the same type of problems. And with this technical knowledge of how do you actually solve this problem. Now you get this new generation of products that allow the more common consumer, the more mainstream consumer to actually be able to be very, very good with data. So it was more like we were the early adopter and now we're in the land of the, of the mainstream. So, and I'd love for you to talk about sort of, okay, so five years ago,
Starting point is 00:12:07 you're solving huge power, you know, however long ago at LiveRamp or an engineer solving this at Facebook or Netflix. And so they sort of learn the fundamental components of the problem from an engineering standpoint. And then at the same time, technology is advancing. Right. And so then when they leave that company and sort of encounter the new technology that's there, is that sort of the point where they say, okay, I can now build something that solves this in a way that sort of meets needs of the mass market in a way that wasn't possible before? Yeah, that's correct. I mean, just think about warehouses, for example. I mean, 10 years ago, you had some warehouses, but most of the analytics was done on Hadoop
Starting point is 00:12:52 at scale. And it's just that people using Hadoop started to realize, okay, that's not the best system. On top of that, you start putting hives, you start putting more and more layers. And one day you have BigQuery. The other day you have Snowflake, and people were taking the analytics to the next level. And now for all these engineers who've been working with this technology, they say, okay, I have this amazing processing engine. What was I doing with all this complex system that is now becoming much simpler, or that can enable more use cases by using a data warehouse. And it's just, yes, technology is just growing and it makes creating
Starting point is 00:13:32 this product more easier and more approachable for maybe less data heavy companies. And for example, we always talk about the modern data stack. You do extract, you do load, you have your warehouse, you have your analytics, you try to orchestrate all of that with airflow, with transformation, with dbt, etc. But if I look at it in 2014, 2013, with Redshift, we already had exactly the same system internally. And it's just this type of system becomes more mainstream and there is more tooling so that you don't need a huge team
Starting point is 00:14:14 to build all the tooling around it. I mean, at the time, I don't know if Airflow existed, but we had our own workflow manager. We had our own transformation manager. Just that people coming out of companies are actually building that so that it can be used by more and more people. Michel, you mentioned the...
Starting point is 00:14:32 Hi, by the way. I'm sorry, I'm a bit late. But maybe you can relate to that from your accent. But after two years, I'm still struggling with the difference between kilometers and miles. I'm still struggling with the difference between kilometers and miles. So I'm sorry. I'm very sorry for my delay.
Starting point is 00:14:53 But I miscalculated a few things. So you mentioned the term like the modern data stack. What's your definition of the modern data stack? Like, yeah, what are the, let's say, like the main components of it? And you mentioned also that like, it's not like we are doing something new that we didn't do in the past, right? But why it's modern? I think it's something that is enabled by technology, which is the composability of your system.
Starting point is 00:15:19 What we've seen back in the day with Hadoop, with Spark, is that you have a very monolithic way of working with data. And with more and more tools being added, I mean, if you look at all the Apache projects, basically most of them are about data. And it's just all these little tools that are coming on top of it. And the modern data stack is more about how you go from an end-to-end solution to something
Starting point is 00:15:47 where you use the best of breed for every single piece in your data value chain. Because data is so tied to your business that generally using an I do everything solution doesn't work. You get to 70% of what you need and then you go a little bit outside of how it was thought about, then you need to build your own parallel data system. And for me, the modern data stack is more about the composability of a system. And the fact that as your business changes, evolves, you can start adding more and more building blocks. And you have the choice between picking which vendor, which solution you want to use, and it's
Starting point is 00:16:30 a matter of SME. And that's why products like Airflow, Prefect, and others are becoming so powerful because they become also a bit of... That's where you encode the logic of how you glue all these different tools together. Yeah, that makes where you encode the logic of how you glue all these different
Starting point is 00:16:46 tools together. Yeah, that makes total sense. And what are the main components of the data stack? I mean, you mentioned Airflow, for example, which is the orchestration part and somehow like glues everything together. But what else is needed there to have, let's say, the minimum viable data stack. Yeah, I would say ingestion, processing, transformation, visualization. Okay, makes sense.
Starting point is 00:17:13 And a little bit now with, maybe a little bit on the reverse ETL side, which is how do you actually activate the data back and put it back into a place where it can be activated? Yeah, what about a mail? Where do you think that this fits? Or it's something that's like, okay, activate the data back and put it back into a place where it can be activated. Yeah. What about a mail? Where do you think that this fits? Or it's something that's like, okay, let's have the data stack first in place.
Starting point is 00:17:32 Let's solve the basics. And then we move and do like, say, the more advanced like ML Ops and like all these things that you see like coming up right now, like with products like Tecton with feature stores and all that stuff. Yeah. So ultimately, I put that in the activation part. So whether it's about making the data available elsewhere, whether it's about an operational use case.
Starting point is 00:17:59 I mean, yes, you have the operational use case. And I think also that's where it's not just about analytics. And that's where all the orchestrator are important because as you said, you have like ML, you have quality, you have a lot of things that you might want to do. It's just depending on your business, you might or might not need that particular function. And it's just about where does that fit in your pipeline? But yes, that's part of it.
Starting point is 00:18:24 It's just the composability of all your data value chain from beginning to the end product. And where does AirBike fit now, today? So today we fit on the ingestion and loading piece, which is just breaking down silos and making sure that you don't have to think about the physical and the technical complexity of pulling data from one place and feeding it into another place. And what is our bite going to be in the future? What's the reason?
Starting point is 00:18:59 So the first thing is the goal today is really about commoditizing data integration. But when you think about data integration, there is a purpose behind it, which is moving data around. You have data on point A, you want to get it to point B. And that is the vision behind Airbodies. How can we make sure that there are pipes that allow the data to flow and to get to the place where it's going to be the most valuable for the organization. And it's not about extracting insight. It's not about visualization.
Starting point is 00:19:33 It's not about transformation. It's just let's focus on having a perfect movement of data. And that could come also with adding quality on top of it, adding a lot of additional features to make sure that you don't just have pipes, but you have smart pipes. Okay. Yeah. That's very interesting. And do you think that, I mean, you mentioned like composability, right? So we have like all these different like parts of the data stack and like, we try to like to make them all to work together. And that's why data engineers have so much demand right now.
Starting point is 00:20:06 So you mentioned quality. Quality right now, we have all these different products out there, like Monte Carlo, Big Eye, all these new guys who are entering the market, that somehow in order to deliver value, they need to work very closely with another part of the data stack. Either this might be the data warehouse or it might be the data pipelines, right? Yeah. Do you see in the future quality being part of some more fundamental part
Starting point is 00:20:38 of the data stack or do you see a different category remaining there? What do you see happening there? I think it's a matter of who is using it and who gets value from it. I think quality can exist at multiple layers, which is you can have physical quality like is there a missing field or is is there like a lower volume of data and that could be done at the pipe level but then you might have business quality which is is the sum of my revenue less than i don't know 200 000 yeah and000. Yeah. And that's why the composability is so important. Companies will learn what is important to them.
Starting point is 00:21:32 And quality is just something that you will put at different places. I mean, it's the same thing when you're thinking about factories. You have quality checks in multiple places because that allows you also to know where you have a problem. So, yeah, quality is just omnipresent in the data stack. And it's just who gets value from it. So companies like Big Eyes or others, they need to be there because you have so many people that are interacting with the data warehouse
Starting point is 00:21:59 that know what is good data and what is bad data and that need to have the tool. I think maybe one thing we didn't talk about when we talk about the modern data stack is it's about making data a platform instead of something that is fully controlled by data engineers. And once you start exposing data as a platform to the rest of your organization, then you need to have more than one tool for doing quality at different step of the pipeline. Yeah, 100%.
Starting point is 00:22:30 I think that's very, very, I would say, obvious when you enter data quality in the ML space, where the tools that you need to use there to figure out if you have to retrain your model, for example, do stuff about the model itself and trying to figure out if you have to retrain your model for example like do like stuff about the model itself and like trying to figure out if something goes wrong there like it's a completely different kind of beast that you have like to uh work with so yeah i agree i think we're just like at the beginning of like figuring out quality to be honest like it's it's a huge issue and i'm very curious to see what else will come out there, like in the market. I have a question for each of you though, because, and I've actually, this is, Kostas might be tired of hearing this, but I think it's a really interesting question for our listeners because
Starting point is 00:23:14 they just stand a sort of enterprise to startup. But Michel, when do you talk about composability and we think about quality, size of company and sort of complexity as a proxy for size has a really significant influence on the pain you feel from sort of lack of data quality, right? So the example is when you're, I heard someone say like, okay, what is, what is your analytics when you're a two person startup in a garage? Like you just directly query your Postgres like app database, right? And you learn everything you want to know, right? But then when you're a thousand person company, that's a completely different game. And like you said, you sort of have, you need to pull data from many sources. You need to do transformations on it. There's a quality component, there's visualization, and then sort of the activation side of it. What are the triggers that you've seen that are sort of indicators that people need to address those issues? And I mean, I'll also caveat that by saying in an ideal world, I think smart companies try to solve these
Starting point is 00:24:23 problems sooner with good infrastructure, good orchestration, good data quality practices. But I think anyone who's been inside a company knows it's really hard to do that while you're growing a business. So how does size influence all of these factors that we're talking about? So first of all, it's just a matter of how much context and who interacts with your system. And that's why composability is so important. Before, it was an easy persona that was working with the data. As your organization grows, it's not just your data that grows, it's your team is growing. The people that are interested in data is growing.
Starting point is 00:25:04 You might have marketing that wants to know something about data. You might have sales. You might that are interested in data is growing. Like you might have marketing that wants to know something about data. You might have sales, you might have finance, you might have product and they all want something with data. And that generates complexity because they don't have the context about all the data that is flowing through these pipes.
Starting point is 00:25:21 And that's why when we think about the modern data stack as becoming a platform for other roles, that is the complexity that needs to be fixed. And that's why composability is so important because you don't know tomorrow who is going to need data to make your organization better and go faster. And so at that point, you want to make sure that you bring a system that is not just frozen in time, but it's one that can actually evolve with your company and with your teams. So, and of course that comes with complexity, but in general, complexity can be not addressed, but can be made simple or simpler with more composability and more choice. Yeah, that's fascinating. We had a guest recently who made an observation I've been thinking about a lot where they said the move to the cloud was supposed to simplify a lot of things and it did simplify
Starting point is 00:26:17 a lot of things, right? Deployment, sort of managing on-prem stuff, right? But he said it's made the tech stack way more complicated because everything's easier. And I think it's such a good observation that complexity is not driven primarily by technology or only by technology, but by demand for data inside the organization. And the lack of context is a huge challenge there. That's such an interesting observation. A hundred percent. I think, first of all, what you said, Eric,
Starting point is 00:26:49 about like what your analytics look like when you're like in the garage and you just have like a Postgres. I would argue that like, that's not real anymore. Like if you think about it and because of the cloud, right? Like even when you just start, you will have probably some data on Google Analytics. You will run some experiments with
Starting point is 00:27:13 some ads. You will probably have a basic CRM or at least some Google seats where you keep track of some things. So my feeling at least is that like more and more smaller companies will need something like Airbyte really soon, like to use it and get like the value out of the data that they have. I think why size is important and it matters is because of organizational complexity. Like that's where things like get really messy because suddenly, as Michelle mentioned, you don't know who else inside the company is going to need the data. But at the same time, it's much harder to communicate any issues with the data or fix or identify. When there are just
Starting point is 00:27:57 two founders and there is a problem in a spreadsheet, they just talk to each other and they fix it. Now think Now think about like a company that you have to collect the data that might be edited by salespeople on Salesforce. And then when you take this data, there are some analysts that they go clean it and create like some dashboards. And then the data scientists will take these dashboards and based on that, we'll create like a subset of the data to go and create a model and build a model that's going to magically i don't know come up with some numbers and then the data engineer will go take this data and push it back to the sales force for example so the sales people again they can do something
Starting point is 00:28:36 like just think about all the different departments we are talking already and how difficult it is like to communicate with all this even for like much simpler problems than the data that is has to be moved from one place to the other so i think this organizational complexity is like super important and i think that's one of the reasons that like you have like some influencers let's say like in this space like the guy from like local optimistic where Local Optimistic, where he posted the post where he said that the problem about data, it's an organizational problem, and it's not like a technical problem and all these things. You have also this model that
Starting point is 00:29:15 Michel is talking about, the different parts of, let's say, the supply chain of data where you need quality, for example, in different parts, and someone else cares for each one of these right anyway i think we're still at the beginning and we're scratching the surface of of the complexity of building like at the end a data-driven organization it's going to be very very different working with these systems compared to building mobile apps for example like the complexity is very very different. We're in a new era for data. It's just by opening and making it more accessible, you discover a new thing that you can do with it. And it's going to continue to grow and people are going to become greedier and greedier
Starting point is 00:29:56 for having more data and make better decisions. So intimately, people we've worked already with that much data have an edge on the type of product that can be built to enable this new generation of data consumers. Yeah. Michel, I want to go back to our bike a little bit because we can't be talking a lot about data in general. We can't have multiple episodes, the three of us, talking about that stuff for sure.
Starting point is 00:30:24 But I would like to share a little bit more about the product and the company with our audience. So Airbyte is an open source product. Why open source? Why it's important and how it has helped the company grow? Yeah. One thing that I was mentioning before was really that solving data movement and solving data integration is not just a code or a technological problem. It is a process and people problem. Because when you look at all existing solutions, they generally plateau at an amount of connector that they can support. And the reason is simple, it's very hard for one single entity to manage that many connectors, because that's the problem with data connectivity. That's why it's very hard to solve. You have so many places that you don't know what are all these use cases. And at that point, when we thought about open source, that was basically because of that.
Starting point is 00:31:26 It's something that needs to be built and that needs to be almost like crowdsourced. You want to make sure that you have more than one company that has the control on how you actually move data around and what kind of connector matters. Because building is's relatively easy but maintenance is where the cost is so at that point what you want is you want to give the power to people who are using the platform to actually solve the problems when it when it arrives because if they're using a closed source solution if they have a problem they will have to wait four weeks so in that case what they will do is they will start building it internally. But then when you build it internally, this becomes a gigantic monster
Starting point is 00:32:08 that grows and grows and grows until it's out of control. And here you have access to something that sometimes you need to fix, and the rest of the community has access to it, or someone else from the community fixes it, and you get access to the fix. And by creating this very virtuous cycle, then you get more connectors and you get access to the fix. And by creating this very virtual cycle, then you get more connectors and you get more people that contribute and that actually have a seamless experience with data integration.
Starting point is 00:32:33 So open source was really about solving the people aspect of it. Yes, open source is also technology, but it was really about let's build a community and let's make sure that we make data available across the community and users of Airbyte. Right. So, okay, that makes total sense. I mean, the part of the connectors themselves. So before Airbyte, there was Singer. I mean, it's still out there. And I know that Airbyte, like as a protocol, it's actually an extension
Starting point is 00:33:05 of Singer. What is Airbyte doing that is different in what Singer did and what stitch the company behind Singer, before they got acquired at least, did? Yeah, so interestingly when we started Airbyte, we started to build on top of Singer. So we discovered some of the flaws. The thing is, the team has a lot of experience in building data integration. So we saw flaws in how it was. And we have a compatibility with Singer to make sure that people who've invested time of their team into Singer can also leverage these connectors within Airbyte. But in the end, the protocol,
Starting point is 00:33:46 I won't call them out, like the guidelines are way too permissive. And that breaks the contract of solving data integration by having almost like pairwise compatibility. And the day you have this absence of rules and this absence of guidelines, then you're basically building one-to-one pipelines and that's all and you get to this n square problem instead of n times n n plus m uh
Starting point is 00:34:13 problem so that's what that was what we saw with singer also the the community of singer was very i mean after stitch got acquired by talent I think Talent dropped the ball on Seager. And a community like that needs to, you need to really invest in it. And that's something we've done very, very early on. Like one of the first hire we had at Airbyte was really someone who was here with the community and helping them be successful with open source.
Starting point is 00:34:41 And because we started a year ago, so obviously the first version were pretty unstable. So having someone to just help every single person in the community was very important. And we've continued to grow that function and make sure we have seamless experience
Starting point is 00:34:57 on open source. But it's just that if you don't support your community, you cannot build that network of people who just help each other and build and maintain connectors together. So you said that like Airbyte is actually fixing some of the issues that had, not states, sorry, Singer. So what are these new elements that the Airbyte, not exactly protocol, let's say guidelines or whatever we want to call it, brings that Singer didn't have. Yeah.
Starting point is 00:35:27 So we actually call airbyte a protocol for the reason that we have very strong... We've encoded a lot of behavior and logic, and there is almost like a specification on how you build it and what messages should look like. But there are a few things. First one is airbytes. You don't have a problem with environment. That was a big problem with Singer, which is you want to use a tap, you don't have the right pattern, you don't have the right C library, the right bindings. So 80% of the time, you need to do a lot more digging to get it to
Starting point is 00:36:03 work. So first thing. Second thing was, it has to be programmatically configurable. It means that a connector should expose what kind of input it requires, like what does the state look like, so that you can be smarter on the platform level, and you can start building on top of connectors instead of hard coding behavior with that that's something we've actually learned while using singer which was if you want to use a tap
Starting point is 00:36:30 from singer you have to read through the code you don't have a way to automatically know oh i need an api key and the start date i need something and that's we made it part of the interface of the protocol. Now, the other thing was about being language agnostic. And that was very important because if you look, for example, at data integration, not everything is an API. You have queues, you have databases, you have Kafka, you have a lot of things. Very often, they've been thought with the programming language just going to be consuming data or pushing data to that. And I would hate to have to push data at scale on Kafka with Python. If I want to do it, I want to do it in Java. And so having the
Starting point is 00:37:19 flexibility and being language agnostic was a very important requirement that we had. So these are like, I'm just summarizing, but that's what kind of the criteria that we had and how we thought about data connectors. And it's also like, if you want to grow your community, not everybody knows how to write Python. Sometimes they want to write it in C sharp if they want to. So they should be able to contribute with C sharp. Like one of the first contribution we had was in Elixir. Like I've never played with Elixir ever, but sure.
Starting point is 00:37:49 Wow. I mean, I know Elixir is growing in popularity, but that's kind of obscure. Yeah, but in the end, it worked. And it was really a proof of concept of how Airbyte can work with more than just one language and can be used by people that have the talent and that
Starting point is 00:38:07 are using the tools that are the best suited to solve that particular problem. Yeah, I have a question about that because I understand and I believe that there's always some kind of trade-off between flexibility and quality. If you, for example,
Starting point is 00:38:24 let's say, take Kafka Connect, which is, let's say take Kafka Connect, right? Which is, let's say, another framework that you can use to create connectors, of course, specifically for Kafka. But the whole idea of the community around and all these things, they are similar, right? But at some point, you as a buyer, you will have or you want to ensure the quality, right?
Starting point is 00:38:43 How you can do that when you have like so much freedom in terms of like how someone can code something or what framework they can use. Let's say someone comes with Elixir or someone comes with Haskell and writes like something on Haskell, right? What are you going to do as a byte with that? Yeah, that's a, that's a very, very good question. So one thing that we're working on right now is a contribution. We're basically creating a new contribution model and that is going to be powered by cloud.
Starting point is 00:39:16 So what we want is there will be a set of connectors that is fully maintained by Airbyte. And just that some of them are so core, like typically database connector, we need to make sure we have very, very strong quality and like, not quality, but very strong say on the roadmap. But for the other ones that are not part of this subset of certified Airbyte connectors,
Starting point is 00:39:44 what we're going to be doing is making them available on the Airbyte cloud and provide a ref share to the rest of our community, the people who are actually maintaining these connectors, whenever in exchange for an SLA. And then community members can be individual contributors, or they can be data agencies, or they can even be vendors. So if you're a vendor and you want to create a new revenue stream via Airbyte, that's something you can do. And today with Oktoberfest, we got massive, massive amount of connectors that have contributed to Airbyte. People are really seeing the value of having this connector to run on Airbyte. And so there is this will and desire to be part of the program to get rev share as the connector becomes more successful. And at that point,
Starting point is 00:40:32 you also have a nice balance, which is if someone stops maintaining it, or if the SCA is not there, either this connector gets transferred to someone else or someone is going to create a better one. So there is a bit of a race to some extent on making sure that the connector is high quality. Yeah, it's very interesting and I'd love to chat about that again in the future. One last question from me. How important is dbt for the airbyte protocol and for airbyte as a platform i'm separated too so yeah i'd like to hear about that not at all for the protocol we use it more as a post processing piece on warehouses but in the end what is just making the data a bit more consumable when it's been loaded but it's not required for the protocol. The protocol is just about configuration,
Starting point is 00:41:27 data exchange, and connection. That's all. And for the platform, it's more who you're talking to and who you're working with. I mean, most people we work with are data analysts, data scientists, data engineers, analytics engineers.
Starting point is 00:41:41 And if they don't have an airflow running running or some orchestrator on top of it, they want to have a very simple way to kick off like dbt jobs, whether it's by using open source or right now we're also working on how we can make it work with dbt cloud. But it's more a handoff to the rest of the data set. As I mentioned, we want to be the best at just extract load and data movement. That's all. We don't want to do transformation. What we want though is to have a way, a mechanism to hand off what happens to the downstream system. Okay. That's super interesting. I could keep talking about that like for a long time, but I think we are getting close to our
Starting point is 00:42:22 time here, right, Eric? We have a few more minutes. If you have another question, go for it. I'm good. I want you to ask. So, Michel, one question, and I'm interested in your perspective on this. So, of course, our listeners know I have a background in marketing. You were at LiveRamp. LiveRamp has been a major player in the marketing data space for a long time. And anyone who works in data inside of a company
Starting point is 00:42:45 knows that marketing tends to be the most hungry or one of the most hungry, sort of, or generates a lot of demand for data. They're very hungry. And you mentioned that, like the complexity around you give people data and then there's more demand for data because it creates more questions and more value. And marketing is a major consumer there. I'd love to know, when you think about marketing, a lot of times it's sort of audiences, advertising conversion data. A lot of it's happening on sort of the client side or sort of actual experience and then feeding experience and conversions back into the system to sort of
Starting point is 00:43:21 optimize basically advertising algorithms. I know it's more complex than that, but so that's a huge need in marketing. What are the major sort of use cases or the biggest areas of demand that you see for companies that are using Airbyte? Are there particular types of data? Does it fall sort of to one department? Who are the most greedy data consumers and even use cases around that when it comes to Airbyte? Yeah, so I would say, I would give two. So definitely marketing is a big one, but it's rarely marketing by itself. It's generally more like bigger initiative and marketing doesn't care so much about replicating
Starting point is 00:44:04 product databases. But at some point they realized that they need this information and marketing is really a consumer. In the end, it's Airbyte empowers them to just move the data. So they don't even have to talk to a data team to do it. We work mostly with the data teams to build a platform and marketing can serve. For the use case, it's going to be about, as you mentioned, like attribution, 360 views of customers. So across all the touch points, whether it's on
Starting point is 00:44:30 the product, whether it's on the finance, whether it's on Stripe payment ads, like how do you get this whole like 360 view of your customers? Now, the other use case that we see a lot is on product that are actually building, like companies that are building a product and that need to have connectivity to that product. So if you look at e-commerce analytics company, they are good at measuring analytics. They are good at providing value to their customers. But to do that, they need to actually pull data from Shopify, from Google, from Bing, from Facebook, etc. And they want
Starting point is 00:45:11 to focus on their value prop. They don't want to focus on the connectivity part of it. So at that point, we're more in an operational use case, which is we become the layer for them to acquire that data on behalf of their customers. And that's been a pretty big use case for us as well. But otherwise, yes, marketing analytics is huge. We also have product use cases, which is larger engineering team or product team wants to understand, like, get analytics on Git commits.
Starting point is 00:45:43 They want to have analytics on peers. They want to have analytics on on peers they want to have analytics on who closes closes demos and they build their own internal tool or analytics to actually measure the efficiencies of their teams i think by creating a protocol it allows you to stay away from very very specific use case that could narrow like the scope of your product. And at that point, if we only focus on the piece about data movement, then we can enable use case that we don't even have idea about.
Starting point is 00:46:16 Some people were using Airbyte to prime a cache on Redis. Every hour, they would just drop everything on Redis and just refill all their database into Redis. And that, they would just drop everything on Redis and just refit all their database into Redis. And that's something you cannot predict, but it's possible because the platform is flexible and focuses on movement instead of silos. Yeah, it is super interesting. I think you talked about machine learning as an activation use case. And I think that's a really helpful way to think about it because in many ways, if you think about really well-done marketing analytics, that's actually what you need to feed a machine learning model that's going to drive business for you, right?
Starting point is 00:47:01 And so it really is almost like you sort of get the marketing analytics layer correct. And then that opens the door to machine learning, which is super interesting. Okay. That's where complexity comes from then. You answer one question and now you have 10 more. Sure. Exactly. And so you need more, you need more team, you need more specialization in how you extract insights. So yes. Yeah, for sure. Okay. One more question for you need more specialization in how you extract insights. So yes. Yeah, for sure. Okay, one more question for you.
Starting point is 00:47:29 And we've talked a little bit about this on the show over multiple episodes. So as Costas knows, I've talked about a world where, can you imagine that all sort of data movement and sort of processing aligns with a particular business model. And in a few clicks, you can basically set up an entire stack, right? We're not necessarily completely there, but we're getting closer. And you've talked about commoditization of data products a lot. So I kind of want to assume that a lot of these things are commoditized.
Starting point is 00:48:05 What does commoditization, number one, what's your definition of commoditization? But number two, I'm really interested to know, what does commoditization unlock for us, especially for people working in the data industry? Because I think there's still a lot of inefficiency just because companies are figuring out how to build technology, things are getting cheaper, but at different rates. And so there's still a lot of complexity or sort of froth, as people would say in the market. But let's assume everything gets commoditized. What does that unlock? But first, what does commoditization mean? Yeah. So first of all, I just want to go on. One thing is I don't think every data product can become a commodity. What I'm
Starting point is 00:48:46 saying is more about data integration should be commoditized. Like the ability to pull your own data and your fragmented data asset should become a commodity. It shouldn't be something you have to think about. It's just, it's your data. You need to be able to move it where it's going to have the most value for you. So when we think about commodity, that's how we think about it. It's just, let's make sure that you can very quickly break down these silos. So that's what we mean by community.
Starting point is 00:49:17 And it's also by the simplification, like how simple it is to use and almost to a point where you shouldn't have to think about the fact that moving data is something that is a problem for an engineer and that's that's what we mean at that point by commodity now i mean i i won't call like a machine learning algorithm a commodity even though it works with data i won't call like now processing becomes more for community but then it becomes the role of this this company like what kind this company. How do you build on top of
Starting point is 00:49:48 commodity? And typically for data movement, it's about quality. It's about observability. You have a lot of things that you can build on top of it that makes something that is commoditized even more valuable. It's like the infrastructure piece, right? If we think about movement and then even, I mean, I don't know if you'd call Snowflake a commodity. It's not really the language people use.
Starting point is 00:50:14 But if you think about warehousing in general, you can set up a really robust pipeline structure and warehouse really easily, very easily. Those things are becoming commoditized, which is great. Like it's opening so many doors. Yeah. And it's just like commodity means it's something that people believe to always work. And that's where we want to be with data integration.
Starting point is 00:50:40 And then it becomes like, what intelligence did you build on top of this infrastructure? And what kind of additional value that rely on this fundamental you can actually start building and what kind of use case it enables. I would like to think about this as the Maslow pyramid, which is if your fundamentals are not there, basically your fundamental is your commodity
Starting point is 00:51:01 and you want to make sure that your fundamentals are addressed so that you can start thinking higher level and higher level with things that have even, even more value, that bring even more value to your business. Yeah, I love that. Thinking about the data stack is sort of as now as higher, it's great. One last question for you. And I'm just thinking about people in our audience who are excited by learning from
Starting point is 00:51:24 your experience. Having such a long history working with data in a variety of contexts and now building a data company. Do you have any advice for data engineers out there who are thinking about the future of their career working in data? Yeah, I would say think about trading. Trading, trade clever and enables other people to be good with data i would say that's the secret for data engineer because you you don't want to be building data connectors for example because that's something that we we discussed it which is is going to take you a ton of time what you want is what can you do to actually enable other people to be extremely, extremely good with data? And what kind of tooling, what kind of new technology you need to
Starting point is 00:52:11 build to make people who consume data even better with data? Because that's how you get your level with the rest of your team. And that's how, as an engineer, you can actually, as a data engineer, you can really grow quickly. Think about use case, think about enabling other people. Incredible advice. Well, Michel, I'm sad that we're out of time. We're going to have to have you back on the show because there are so many questions that I think both Costas and I didn't get to ask, but thank you for joining us. This has been a great conversation and we'll have you back on the show again soon.
Starting point is 00:52:41 Yeah. Thank you so much, Eric. Thank you so much, Eric. Thank you so much, Chris. I think one of the most interesting takeaways from this show, we've talked about the increasing complexity of the data stack. No one has framed it in the context of demand from various parts of the organization. And I almost feel a little bit stupid not kind of having thought of that as a way to frame it. But I thought that was a very elegant explanation of a main driver of the complexity because it's so easy for us to think about the tool. Oh, there's a new tool for this. Oh, there's a new tool for this, right? CDC, streaming, all this sort of stuff. And really demand from different parts of the organization and their different needs are the main driver. And that's a great reminder for me. And I hope everyone who's listening.
Starting point is 00:53:30 Yeah. Yeah. Well, I think, okay, obviously like Michelle is like very good in like articulating pretty complex concepts, which makes sense. Like, I think it's one of the skills that people who are successful in building like tech companies have, right? So I think it's one of the skills that people who are successful in building like tech companies have right so i think it's an indication of the success of airbite also that he's able like to do that what i'll keep like from our conversation with him is the concept of composability i think that's a very interesting like way of thinking about what the data stack is that's one thing the other thing that i found also like very very interesting is that machine learning at the end is also activation, which again, it's something that makes a ton of sense. Just keep thinking of it in a very
Starting point is 00:54:15 different way. And it's actually interesting because if you think about in the market, the companies that they are doing, they are building products around like serving models and the companies that they are doing reverse it daily. Although at the end, the end result is the same in the sense of the need from the company is the same. Like they are very, very different like products and companies. So that's another like something very interesting like to observe in the market and see how it's going to evolve as the market grows.
Starting point is 00:54:46 All right. Well, thank you for joining us on this episode. And we have many more great shows for you coming up before the end of the year as we round out the season of The Data Stack Show. We'll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 00:55:22 Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.