The Data Stack Show - 64: Data Stack Composability and Commoditization with Michel Tricot of Airbyte
Episode Date: December 1, 2021Highlights from this week’s conversation include:Announcement: Data Stack Live! (1:00)Michel’s career background (4:13)Solving the technical and process challenges of moving data (7:04)Lessons lea...rned from managing data at Live Ramp (9:35)How to build a modern data stack (16:19)Triggers to signal when more data infrastructure is needed (23:19)Why Airbyte is an open-source product (30:23)Airbyte’s role in providing support to open-source problems (38:15)How important DPT is for the Airbyte protocol and platform (41:03)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We have a really exciting episode coming up.
And what's most exciting is we're going to live stream it.
The topic is the modern data stack.
And we're going to talk about what that means.
It's December 15th and you'll want to register for the live stream.
Now, Costas, it's really exciting because we have some amazing leaders from some amazing companies.
So tell us who's going to be there.
Yeah, amazing leaders and also an amazing topic. I think we have mentioned the modern data stack
so many times on this show. I think it's time to get all the different vendors who have contributed
in creating this new category of products. And they define the modern data stack and discuss
about what makes it so special. So we are going to have people like
Databricks, dbt and Fivetran, companies that are implementing state-of-the-art technologies around
their data stack like Hinge. And we are also going to have VCs and see what's their own opinion about
the modern data stack. So in a sense VC is going also to be there. And yeah, it's going to be super exciting
and super interesting.
So we invite everyone to our first live streaming.
Yeah, we're super excited.
The date is December 15th.
It's going to be at 4 p.m. Eastern time
and you can register at rudderstack.com slash live.
So that's just rudderstack.com slash live and we that's just rudderstack.com slash live. And we'll send
you a link to watch the live stream. We can't wait to see you there.
Welcome to the Data Stack Show. Today, we're talking with Michel Tricot,
and he is one of the founders of Airbyte. And Airbyte moves data for companies. Really interesting
company. They've grown a ton. I think they've
been around for a year, but he has a pretty long history in data. And this isn't going to surprise
you, Costas. He worked at a company called LiveRamp, which anyone who knows marketing
knows LiveRamp. They do have a ton of marketing and audience data. And so, of course, I have to
ask him about that experience. He was there pretty early, I believe.
And so I want to hear what it's like to talk to a data sort of engineer,
data integrations leader at a marketing data company like LiveRamp.
So that is, I'm going to try and sneak that in if I can.
Well, first of all, I love your French accent.
We have to get more French people on this show.
It really is.
Yeah.
I remember our first French guest, Alex, and I just loved hearing him talk about data.
It was great.
Yeah, it's always very interesting to hear like Americans speak French.
Anyway, so what I'm going to ask him, I mean, for me me like it's a very special episode because we are talking
about the person who's building like a data pipelines company right so there are like many
different things that i'd love like to ask him but i think the most important and the most
interesting part is the open source dimension of her bike and how building a community is part
of the product and how this can actually become some kind of, let's say,
mode also like for the company.
And it's very interesting in this case, because you have to understand that like Airbyte came
in a time where, let's say, the market of data pipelining was supposed to be, it was
done, right?
We had 5Tran, 5Tran1, like it was like, it's probably the best, the biggest vendor right
now.
Suddenly you have like a bite coming, doing something like different.
And this has impact.
So I think it's going to be very interesting to talk with him,
both like from a technical perspective, but also like from a business perspective.
I agree.
All right, well, let's jump in and talk with Michel.
Let's do it.
Michel, welcome to the Data Sack Show. We're really excited to talk with Michel. Let's do it. Michel, welcome to the Data Stack Show.
We're really excited to talk with you today.
Hey, Eric.
Thank you so much for having me.
All right.
Well, you've been working in data for a really long time.
Can you just give us an overview of kind of where you started, what you've done, and then
what you're working on today at Airbyte?
Yeah, sure.
So I've been in the data space for the past 15 years, started my career in financial data.
So I would say medium volume of data, a few hundred gigabytes.
And then in 2011, I moved in the US and started in this, back in the day, small company called
LiveRamp and was able to experience the hyper-growth from finding product market fit to getting it and to
getting to an IPO to getting through an acquisition and I was head of integration and director of
engineering over there I was leading a team of yeah 30 people and we built thousands of different
data integration and data integration is basically how you take data from one place and it could also
be about how you get data into another place.
And we're moving-
A thousand integrations back then is huge.
It is.
It is.
I think we got burned quite a few times.
It's a very hard problem.
But in the end, what you need to do is really thinking about how you build,
how you maintain, and how you scale.
And it's just that these pipes, you keep having having more of them and they keep becoming bigger and bigger i think when when i left in 2017 we were
moving hundreds and hundreds of terabytes of data every single day so i had to learn the how to
learn how to build this system the the hallway wow yeah and, and after that, after LiveRamp,
I joined another startup, started to do the same thing,
which is how do you get data from point A to point B?
And I was, okay, what if I go for the crazy idea
of solving it for more than just one company at a time?
And that's how John and I started Airbyte.
So helping people move data from point A to point B without having to spin up massive data teams to do it.
Yeah.
So I'd love to hear just a little bit about, well, I was watching a movie with my son the other night.
And it's an older movie from the 90s, I think.
And they're in this lab and a piece of equipment breaks. And
they said, we lost all 30 gigabytes of data from this experiment. That's just so funny. Cause
you're like, man, that was like catastrophic back then. But so you went from a couple hundred
gigabytes to hundreds of terabytes a day. Can you just explain, I mean, I'm sure some of our listeners have gone
through that, but probably a lot of them haven't. Like, what are the key things that you sort of
took away from that experience of sort of this exponential growth in magnitude of just trying
to move data? Yeah. The key thing about moving data is you need to think about it as a, almost as a factory,
which is, it is not just a technical challenge. It is a process challenge. And it is not something
that you only solve with, that you only solve with code or you only solve with some software.
It's also something that you need to solve with people. Because when you think about the amount of places where you have data,
it's impossible to write software that can get it everywhere. I mean, you're going to spend
years and years and years doing it. So for a single company, it's very a single company it's very hard so it's about like how do you set up the
right process so that you enable people to actually be able to pull data from there and you start
dispatching the responsibility to more and more people and you can actually almost like crowdsource
the maintenance crowdsource the building and crowdsource the the scaling of this of these
connectors and once you start also the other thing is you need to think about it in a sense that you're not building a system that works 100% of the time. You're
building a system that has to be resilient. That's the thing is data connectors, data integration,
it always breaks one day or another and you need to build your system with this in mind.
Because, I mean, in the end,
you depend on a ton of external places
that you have no control on.
I mean, I don't know, like tomorrow,
Facebook can decide to change how the IPA behave
or it can strike with the same.
And you don't have control on their product's decision.
And you need to make sure that the system you build
is resilient to it.
And you have the process in place to solve.
Yeah, interesting.
I remember there were a bunch of companies
that were sort of built around an API
that Facebook had made available
that sort of made it really easy to gather
large amounts of information on sort of individuals.
And they changed it overnight
and like 20 startups just evaporated.
I mean, that's kind of an extreme example, but even minor changes, especially if you think about
enterprise scale, can make a really big difference on a daily basis. An e-commerce company
that relies on data to sort of understand conversion or sort of send repeat purchase
emails or other things like that.
I mean, if something breaks, it literally costs huge amounts of revenue. One thing we talked about
as we were prepping for the show, which I would love for you to speak to. So you were at LiveRamp
a while ago and you were dealing with data at a really large scale. And I think in many ways,
like a scale that a lot of sort of your
average like data engineer, data analyst doesn't get to experience just because that was such a
huge amount of data. But that was still a while ago. And so have you seen sort of the lessons
that you learned there translate? I mean, the technology stack is very different today than it was back then in some ways.
But there's sort of a trickle down effect from companies that you were sending audience data, hundreds of terabytes a day.
What's the trickle down effect and how long does it actually take for sort of the problems that you solve to hit the average sort of data engineer or show up as tooling for the average data engineer?
Yeah, that's a very good question.
One thing that I love looking at right now
is when I look at the data landscape
and how things are moving
and the new type of product
is you look at who are building these tools,
who are building these products,
and you realize that all of them,
like most of them,
have had
this problem way before and it's just that with scale you encounter new challenges that you have to solve the hard way because there is no solution that exists on the market and because it's data it
just grows exponentially so everything that we've learned 10 years ago
was something that was specific to LiveRamp, or it can be specific to Google or specific to Facebook, specific to Netflix, or any of these big companies that were built and that really became
massive at that time. And engineers there had to learn what kind of technical asset,
what kind of technical skills they had to build. And once they what kind of technical asset, what kind of technical skills
they had to build.
And once they are out of these companies, they realize that, hey, but data is actually
scaling exponentially.
So all these other companies that don't have the same volume of data are actually about
to face the same type of problems.
And with this technical knowledge of how do you actually solve this problem. Now you get this new generation of products that allow the more common
consumer,
the more mainstream consumer to actually be able to be very,
very good with data.
So it was more like we were the early adopter and now we're in the land of
the, of the mainstream.
So, and I'd love for you to talk about sort of, okay, so five years ago,
you're solving huge power, you know, however long ago at LiveRamp or an engineer solving this at
Facebook or Netflix. And so they sort of learn the fundamental components of the problem from
an engineering standpoint. And then at the same time, technology is advancing.
Right. And so then when they leave that company and sort of encounter the new technology that's
there, is that sort of the point where they say, okay, I can now build something that
solves this in a way that sort of meets needs of the mass market in a way that wasn't possible
before? Yeah, that's correct. I mean, just think about warehouses, for example.
I mean, 10 years ago, you had some warehouses, but most of the analytics was done on Hadoop
at scale.
And it's just that people using Hadoop started to realize, okay, that's not the best system.
On top of that, you start putting hives, you start putting more and more layers.
And one day you have BigQuery.
The other day you have Snowflake, and people were taking the analytics to the next level. And now for all these engineers who've been working with this technology, they say, okay,
I have this amazing processing engine. What was I doing with all this complex system that is now
becoming much simpler, or that can enable more use cases by
using a data warehouse. And it's just, yes, technology is just growing and it makes creating
this product more easier and more approachable for maybe less data heavy companies. And for example,
we always talk about the modern data stack. You do extract,
you do load, you have your warehouse, you have your analytics, you try to orchestrate all of
that with airflow, with transformation, with dbt, etc. But if I look at it in 2014, 2013,
with Redshift, we already had exactly the same system internally.
And it's just this type of system becomes more mainstream
and there is more tooling
so that you don't need a huge team
to build all the tooling around it.
I mean, at the time, I don't know if Airflow existed,
but we had our own workflow manager.
We had our own transformation manager.
Just that people coming out of companies
are actually building that
so that it can be used by more and more people.
Michel, you mentioned the...
Hi, by the way.
I'm sorry, I'm a bit late.
But maybe you can relate to that from your accent.
But after two years,
I'm still struggling with the difference
between kilometers and miles.
I'm still struggling with the difference between kilometers and miles. So I'm sorry.
I'm very sorry for my delay.
But I miscalculated a few things. So you mentioned the term like the modern data stack.
What's your definition of the modern data stack?
Like, yeah, what are the, let's say, like the main components of it?
And you mentioned also that like, it's not like we are doing something new
that we didn't do in the past, right?
But why it's modern?
I think it's something that is enabled by technology,
which is the composability of your system.
What we've seen back in the day with Hadoop, with Spark,
is that you have a very monolithic way of working with data.
And with more and more tools being added,
I mean, if you look at all the Apache projects,
basically most of them are about data.
And it's just all these little tools that are coming on top of it.
And the modern data stack is more about
how you go from an end-to-end solution to something
where you use the best of breed for every single piece in your data value chain. Because data is
so tied to your business that generally using an I do everything solution doesn't work. You get to
70% of what you need and then you go a little bit outside of how it was thought
about, then you need to build your own parallel data system. And for me, the modern data stack
is more about the composability of a system. And the fact that as your business changes, evolves,
you can start adding more and more building blocks. And you have the choice between
picking which vendor,
which solution you want to use, and it's
a matter of SME. And that's why
products like Airflow,
Prefect, and others are
becoming so powerful because they become
also a bit of...
That's where you encode
the logic of how you glue
all these different tools together. Yeah, that makes where you encode the logic of how you glue all these different
tools together.
Yeah, that makes total sense.
And what are the main components of the data stack?
I mean, you mentioned Airflow, for example, which is the orchestration part and somehow
like glues everything together.
But what else is needed there to have, let's say, the minimum viable data stack. Yeah, I would say ingestion, processing,
transformation, visualization.
Okay, makes sense.
And a little bit now with,
maybe a little bit on the reverse ETL side,
which is how do you actually activate the data back
and put it back into a place where it can be activated?
Yeah, what about a mail? Where do you think that this fits? Or it's something that's like, okay, activate the data back and put it back into a place where it can be activated. Yeah.
What about a mail?
Where do you think that this fits?
Or it's something that's like, okay, let's have the data stack first in place.
Let's solve the basics.
And then we move and do like, say, the more advanced like ML Ops and like all these things
that you see like coming up right now, like with products like Tecton with feature stores
and all that stuff.
Yeah.
So ultimately, I put that in the activation part.
So whether it's about making the data available elsewhere, whether it's about an operational
use case.
I mean, yes, you have the operational use case.
And I think also that's where it's not just about analytics.
And that's where all the orchestrator are important because as you said, you have like ML, you have quality,
you have a lot of things that you might want to do.
It's just depending on your business,
you might or might not need that particular function.
And it's just about where does that fit in your pipeline?
But yes, that's part of it.
It's just the composability of all your data value chain
from beginning to the end product.
And where does AirBike fit now, today?
So today we fit on the ingestion and loading piece,
which is just breaking down silos
and making sure that you don't have to think about the physical and the technical complexity of pulling data from one place and feeding it into another place.
And what is our bite going to be in the future?
What's the reason?
So the first thing is the goal today is really about commoditizing data integration.
But when you think about data integration, there is a purpose behind it, which is moving data around.
You have data on point A, you want to get it to point B.
And that is the vision behind Airbodies.
How can we make sure that there are pipes that allow the data to flow and to get to the place where it's going to be
the most valuable for the organization.
And it's not about extracting insight.
It's not about visualization.
It's not about transformation.
It's just let's focus on having a perfect movement of data.
And that could come also with adding quality on top of it,
adding a lot of additional
features to make sure that you don't just have pipes, but you have smart pipes.
Okay. Yeah. That's very interesting. And do you think that, I mean, you mentioned like
composability, right? So we have like all these different like parts of the data stack and like,
we try to like to make them all to work together. And that's why data engineers have so much demand right now.
So you mentioned quality. Quality right now, we have all these different products out there,
like Monte Carlo, Big Eye, all these new guys who are entering the market, that somehow in order to
deliver value, they need to work very closely with another part of the data stack.
Either this might be the data warehouse
or it might be the data pipelines, right?
Yeah.
Do you see in the future quality
being part of some more fundamental part
of the data stack
or do you see a different category remaining there?
What do you see happening there?
I think it's a matter of who is using it and who gets value from it.
I think quality can exist at multiple layers, which is you can have physical quality like is there a missing field or is is there like a lower volume
of data and that could be done at the pipe level but then you might have business quality which is
is the sum of my revenue less than i don't know 200 000 yeah and000. Yeah. And that's why the composability is so important.
Companies will learn what is important to them.
And quality is just something that you will put at different places.
I mean, it's the same thing when you're thinking about factories.
You have quality checks in multiple places
because that allows you also to know where you have a problem.
So, yeah, quality is just omnipresent in the data stack.
And it's just who gets value from it.
So companies like Big Eyes or others, they need to be there
because you have so many people that are interacting with the data warehouse
that know what is good data and what is bad data
and that need to have the tool.
I think maybe one thing we didn't talk about when we talk about the modern data stack is
it's about making data a platform instead of something that is fully controlled by data
engineers.
And once you start exposing data as a platform to the rest of your organization, then you
need to have more than one tool for doing quality at different step of the pipeline.
Yeah, 100%.
I think that's very, very, I would say, obvious when you enter data quality in the ML space,
where the tools that you need to use there to figure out if you have to retrain your
model, for example, do stuff about the model itself and trying to figure out if you have to retrain your model for example like do like stuff about the model
itself and like trying to figure out if something goes wrong there like it's a completely different
kind of beast that you have like to uh work with so yeah i agree i think we're just like at the
beginning of like figuring out quality to be honest like it's it's a huge issue and i'm very
curious to see what else will come out there, like in the market.
I have a question for each of you though, because, and I've actually, this is, Kostas might be tired of hearing this, but I think it's a really interesting question for our listeners because
they just stand a sort of enterprise to startup. But Michel, when do you talk about composability and we think about quality, size of company and sort of complexity as a proxy for size has a really significant influence on the pain you feel from sort of lack of data quality, right?
So the example is when you're, I heard someone say like, okay, what is, what is your analytics when you're a two person startup in a garage? Like you just
directly query your Postgres like app database, right? And you learn everything you want to know,
right? But then when you're a thousand person company, that's a completely different game.
And like you said, you sort of have, you need to pull data from many sources. You need to do
transformations on it. There's a quality component, there's visualization, and then sort of the activation side of it. What are the triggers that you've
seen that are sort of indicators that people need to address those issues? And I mean,
I'll also caveat that by saying in an ideal world, I think smart companies try to solve these
problems sooner with good infrastructure, good orchestration, good data quality practices.
But I think anyone who's been inside a company knows it's really hard to do that while you're growing a business.
So how does size influence all of these factors that we're talking about?
So first of all, it's just a matter of how much context and who interacts with your system.
And that's why composability is so important.
Before, it was an easy persona that was working with the data.
As your organization grows, it's not just your data that grows, it's your team is growing.
The people that are interested in data is growing.
You might have marketing that wants to know something about data. You might have sales. You might that are interested in data is growing. Like you might have marketing
that wants to know something about data.
You might have sales, you might have finance,
you might have product
and they all want something with data.
And that generates complexity
because they don't have the context
about all the data that is flowing through these pipes.
And that's why when we think about the modern data stack
as becoming a platform for other roles, that is the complexity that needs to be fixed. And that's why composability
is so important because you don't know tomorrow who is going to need data to make your organization
better and go faster. And so at that point, you want to make sure that you bring a system that is not just frozen in time, but it's one that can actually evolve with your company and with your teams.
So, and of course that comes with complexity, but in general, complexity can be not addressed, but can be made simple or simpler with more composability and more choice.
Yeah, that's fascinating.
We had a guest recently who made an observation I've been thinking about a lot where they
said the move to the cloud was supposed to simplify a lot of things and it did simplify
a lot of things, right?
Deployment, sort of managing on-prem stuff, right?
But he said it's made the tech stack way more complicated because everything's easier.
And I think it's such a good observation that complexity is not driven primarily by technology
or only by technology, but by demand for data inside the organization. And the lack of context
is a huge challenge there. That's such an interesting observation.
A hundred percent.
I think, first of all, what you said, Eric,
about like what your analytics look like
when you're like in the garage
and you just have like a Postgres.
I would argue that like, that's not real anymore.
Like if you think about it
and because of the cloud, right?
Like even when you just start,
you will have probably some data on Google Analytics. You will run some experiments with
some ads. You will probably have a basic CRM or at least some Google seats where you keep track
of some things. So my feeling at least is that like more and more smaller companies will need something
like Airbyte really soon, like to use it and get like the value out of the data that they
have.
I think why size is important and it matters is because of organizational complexity.
Like that's where things like get really messy because suddenly, as Michelle mentioned,
you don't know who else inside the company is going to need the data. But at the same time,
it's much harder to communicate any issues with the data or fix or identify. When there are just
two founders and there is a problem in a spreadsheet, they just talk to each other
and they fix it. Now think Now think about like a company that
you have to collect the data that might be edited by salespeople on Salesforce. And then when you
take this data, there are some analysts that they go clean it and create like some dashboards. And
then the data scientists will take these dashboards and based on that, we'll create like a subset of
the data to go and create a model and build a model that's going to
magically i don't know come up with some numbers and then the data engineer will go take this data
and push it back to the sales force for example so the sales people again they can do something
like just think about all the different departments we are talking already and
how difficult it is like to communicate with all this even for like
much simpler problems than the data that is has to be moved from one place to the other so
i think this organizational complexity is like super important and i think that's one of the
reasons that like you have like some influencers let's say like in this space like the guy from
like local optimistic where Local Optimistic, where
he posted the post where he said that the problem about data, it's an organizational problem,
and it's not like a technical problem and all these things. You have also this model that
Michel is talking about, the different parts of, let's say, the supply chain of data where you need
quality, for example, in different parts, and someone else cares for each one of these right anyway i think we're still at the beginning and we're
scratching the surface of of the complexity of building like at the end a data-driven organization
it's going to be very very different working with these systems compared to building mobile
apps for example like the complexity is very very different. We're in a new era for data.
It's just by opening and making it more accessible, you discover a new thing that you can do with
it.
And it's going to continue to grow and people are going to become greedier and greedier
for having more data and make better decisions.
So intimately, people we've worked already with that much data have an edge on the type of product that can be built
to enable this new generation of data consumers.
Yeah.
Michel, I want to go back to our bike a little bit
because we can't be talking a lot about data in general.
We can't have multiple episodes, the three of us,
talking about that stuff for sure.
But I would like to share a little bit more about the product and the company with our audience.
So Airbyte is an open source product. Why open source? Why it's important and how it has helped
the company grow? Yeah. One thing that I was mentioning before was really that solving data movement and solving data integration is not just a code or a technological problem.
It is a process and people problem.
Because when you look at all existing solutions, they generally plateau at an amount of connector that they can support. And the reason is simple, it's very hard
for one single entity to manage that many connectors, because that's the problem with data
connectivity. That's why it's very hard to solve. You have so many places that you don't know what
are all these use cases. And at that point, when we thought about open source, that was basically because of that.
It's something that needs to be built and that needs to be almost like crowdsourced.
You want to make sure that you have more than one company that has the control on how you actually move data around and what kind of connector matters.
Because building is's relatively easy but
maintenance is where the cost is so at that point what you want is you want to give the power to
people who are using the platform to actually solve the problems when it when it arrives because
if they're using a closed source solution if they have a problem they will have to wait four weeks
so in that case what they will do is they will start building it internally.
But then when you build it internally, this becomes a gigantic monster
that grows and grows and grows until it's out of control.
And here you have access to something that sometimes you need to fix,
and the rest of the community has access to it,
or someone else from the community fixes it,
and you get access to the fix.
And by creating this very virtuous cycle, then you get more connectors and you get access to the fix. And by creating this very virtual cycle,
then you get more connectors and you get more people that contribute
and that actually have a seamless experience with data integration.
So open source was really about solving the people aspect of it.
Yes, open source is also technology,
but it was really about let's build a community and let's make sure that we make
data available across the community and users of Airbyte.
Right. So, okay, that makes total sense. I mean,
the part of the connectors themselves. So before Airbyte, there was
Singer. I mean, it's still out there. And I know that Airbyte, like as a
protocol, it's actually an extension
of Singer. What is Airbyte doing that is different in what Singer did and what stitch the company
behind Singer, before they got acquired at least, did? Yeah, so interestingly when we started Airbyte,
we started to build on top of Singer. So we discovered some of the flaws.
The thing is, the team has a lot of experience in building data integration.
So we saw flaws in how it was.
And we have a compatibility with Singer to make sure that people who've invested time
of their team into Singer can also leverage these connectors within Airbyte.
But in the end, the protocol,
I won't call them out,
like the guidelines are way too permissive.
And that breaks the contract of solving data integration
by having almost like pairwise compatibility.
And the day you have this absence of rules
and this absence of guidelines,
then you're basically building one-to-one
pipelines and that's all and you get to this n square problem instead of n times n n plus m uh
problem so that's what that was what we saw with singer also the the community of singer was very
i mean after stitch got acquired by talent I think Talent dropped the ball on Seager.
And a community like that needs to,
you need to really invest in it.
And that's something we've done very, very early on.
Like one of the first hire we had at Airbyte
was really someone who was here with the community
and helping them be successful with open source.
And because we started a year ago,
so obviously the first version
were pretty unstable.
So having someone to just help
every single person in the community
was very important.
And we've continued to grow that function
and make sure we have seamless experience
on open source.
But it's just that
if you don't support your community,
you cannot build that network
of people who just help each other and build and maintain connectors together.
So you said that like Airbyte is actually fixing some of the issues that had, not states, sorry, Singer.
So what are these new elements that the Airbyte, not exactly protocol, let's say guidelines or whatever we want to call it, brings that Singer didn't have.
Yeah.
So we actually call airbyte a protocol for the reason
that we have very strong...
We've encoded a lot of behavior and logic,
and there is almost like a specification on how you build it
and what messages should look like. But there are a few things. First one is
airbytes. You don't have a problem with environment. That was a big problem with Singer,
which is you want to use a tap, you don't have the right pattern, you don't have the right C
library, the right bindings. So 80% of the time, you need to do a lot more digging to get it to
work. So first thing.
Second thing was, it has to be programmatically configurable.
It means that a connector should expose what kind of input it requires,
like what does the state look like,
so that you can be smarter on the platform level,
and you can start building on top of connectors
instead of hard coding behavior with
that that's something we've actually learned while using singer which was if you want to use a tap
from singer you have to read through the code you don't have a way to automatically know oh i need
an api key and the start date i need something and that's we made it part of the interface of the protocol. Now, the other thing was about being language agnostic.
And that was very important because if you look, for example, at data integration,
not everything is an API.
You have queues, you have databases, you have Kafka, you have a lot of things.
Very often, they've been thought with the programming language
just going to be consuming data or pushing data to that. And I would hate to have to push data
at scale on Kafka with Python. If I want to do it, I want to do it in Java. And so having the
flexibility and being language agnostic was a very important requirement that we had.
So these are like, I'm just summarizing, but that's what kind of the criteria that we had and how we thought about data connectors.
And it's also like, if you want to grow your community, not everybody knows how to write
Python.
Sometimes they want to write it in C sharp if they want to.
So they should be able to contribute with C sharp.
Like one of the first contribution we had was in Elixir.
Like I've never played with Elixir ever, but sure.
Wow.
I mean, I know Elixir is growing in popularity,
but that's kind of obscure.
Yeah, but in the end, it worked.
And it was really a proof of concept
of how Airbyte can work with more than just one language
and can be used
by people that have the talent and that
are using the tools that
are the best suited to solve that particular problem.
Yeah, I
have a question about that because
I understand and I
believe that there's always some kind of trade-off
between flexibility and quality.
If you, for example,
let's say, take Kafka Connect, which is, let's say take Kafka Connect, right?
Which is, let's say, another framework
that you can use to create connectors,
of course, specifically for Kafka.
But the whole idea of the community around
and all these things, they are similar, right?
But at some point, you as a buyer,
you will have or you want to ensure the quality, right?
How you can do that when you have like so much freedom in terms of like how
someone can code something or what framework they can use.
Let's say someone comes with Elixir or someone comes with Haskell and writes
like something on Haskell, right?
What are you going to do as a byte with that?
Yeah, that's a, that's a very, very good question.
So one thing that we're working on right now is a contribution.
We're basically creating a new contribution model and that is going to be powered by cloud.
So what we want is there will be a set of connectors
that is fully maintained by Airbyte.
And just that some of them are so core,
like typically database connector,
we need to make sure we have very, very strong quality
and like, not quality, but very strong say on the roadmap.
But for the other ones that are not
part of this subset of certified Airbyte connectors,
what we're going to be doing
is making them available on the Airbyte cloud and provide a ref share to the rest of our community,
the people who are actually maintaining these connectors, whenever in exchange for an SLA.
And then community members can be individual contributors, or they can be data agencies,
or they can even be vendors. So if you're a vendor and you want to create a new revenue stream via Airbyte, that's something
you can do. And today with Oktoberfest, we got massive, massive amount of connectors that have
contributed to Airbyte. People are really seeing the value of having this connector to run on
Airbyte. And so there is this will and desire to be part of the program to get rev share as the connector becomes more successful. And at that point,
you also have a nice balance, which is if someone stops maintaining it, or if the SCA is not there,
either this connector gets transferred to someone else or someone is going to create a better one.
So there is a bit of a race to some extent on making sure that the connector is high quality.
Yeah, it's very interesting and I'd love to chat about that again in the future. One last question
from me. How important is dbt for the airbyte protocol and for airbyte as a platform i'm separated too so yeah i'd like
to hear about that not at all for the protocol we use it more as a post processing piece on
warehouses but in the end what is just making the data a bit more consumable when it's been loaded
but it's not required for the protocol. The protocol is just about configuration,
data exchange, and connection.
That's all.
And for the platform,
it's more who you're talking to
and who you're working with.
I mean, most people we work with
are data analysts, data scientists,
data engineers, analytics engineers.
And if they don't have an airflow running running or some orchestrator on top of it, they want
to have a very simple way to kick off like dbt jobs, whether it's by using open source
or right now we're also working on how we can make it work with dbt cloud.
But it's more a handoff to the rest of the data set.
As I mentioned, we want to be the best at just extract load and
data movement. That's all. We don't want to do transformation. What we want though is to have
a way, a mechanism to hand off what happens to the downstream system. Okay. That's super interesting.
I could keep talking about that like for a long time, but I think we are getting close to our
time here, right, Eric? We have a few more minutes. If you have another question, go for it.
I'm good. I want you to ask.
So, Michel, one question,
and I'm interested in your perspective on this.
So, of course, our listeners know I have a background in marketing.
You were at LiveRamp.
LiveRamp has been a major player in the marketing data space for a long time.
And anyone who works in data inside of a company
knows that marketing tends to be the most hungry or one of the most hungry,
sort of, or generates a lot of demand for data. They're very hungry. And you mentioned that,
like the complexity around you give people data and then there's more demand for data
because it creates more questions and more value. And marketing is a major consumer there.
I'd love to know,
when you think about marketing, a lot of times it's sort of audiences,
advertising conversion data. A lot of it's happening on sort of the client side or sort
of actual experience and then feeding experience and conversions back into the system to sort of
optimize basically advertising algorithms. I know it's more complex than that, but so that's a huge need in marketing. What are the major sort
of use cases or the biggest areas of demand that you see for companies that are using Airbyte? Are
there particular types of data? Does it fall sort of to one department? Who are the most greedy
data consumers and even use cases around that when it comes to
Airbyte?
Yeah, so I would say, I would give two.
So definitely marketing is a big one, but it's rarely marketing by itself.
It's generally more like bigger initiative and marketing doesn't care so much about replicating
product databases.
But at some point they realized that they need this information and marketing is really
a consumer.
In the end, it's Airbyte empowers them to just move the data.
So they don't even have to talk to a data team to do it.
We work mostly with the data teams to build a platform and marketing can serve.
For the use case, it's going to be about, as you mentioned,
like attribution, 360 views of customers. So across all the touch points, whether it's on
the product, whether it's on the finance, whether it's on Stripe payment ads, like how do you get
this whole like 360 view of your customers? Now, the other use case that we see a lot is on product
that are actually building, like companies
that are building a product and that need to have connectivity to that product.
So if you look at e-commerce analytics company, they are good at measuring analytics.
They are good at providing value to their customers.
But to do that, they need to actually pull data from Shopify, from Google, from Bing, from
Facebook, etc. And they want
to focus on their value prop. They don't want to focus on the connectivity part of it.
So at that point, we're more in an operational use case, which is
we become the layer for them to acquire that data on behalf
of their customers.
And that's been a pretty big use case for us as well.
But otherwise, yes, marketing analytics is huge.
We also have product use cases, which is larger engineering team or product team
wants to understand, like, get analytics on Git commits.
They want to have analytics on peers.
They want to have analytics on on peers they want to have
analytics on who closes closes demos and they build their own internal tool or analytics
to actually measure the efficiencies of their teams i think by creating a protocol it allows you
to stay away from very very specific use case that could narrow like the scope of your product.
And at that point, if we only focus
on the piece about data movement,
then we can enable use case that we don't even have idea about.
Some people were using Airbyte to prime a cache on Redis.
Every hour, they would just drop everything on Redis
and just refill all their database into Redis. And that, they would just drop everything on Redis and just refit all their database into
Redis. And that's something you cannot predict, but it's possible because the platform is flexible
and focuses on movement instead of silos. Yeah, it is super interesting. I think you talked about
machine learning as an activation use case.
And I think that's a really helpful way to think about it because in many ways, if you think about really well-done marketing analytics, that's actually what you need to feed a machine
learning model that's going to drive business for you, right?
And so it really is almost like you sort of get the marketing analytics
layer correct. And then that opens the door to machine learning, which is super interesting.
Okay. That's where complexity comes from then. You answer one question and now you have 10 more.
Sure. Exactly. And so you need more, you need more team, you need more specialization
in how you extract insights. So yes. Yeah, for sure. Okay. One more question for you need more specialization in how you extract insights.
So yes.
Yeah, for sure.
Okay, one more question for you.
And we've talked a little bit about this on the show over multiple episodes.
So as Costas knows, I've talked about a world where,
can you imagine that all sort of data movement and sort of processing
aligns with a particular business model.
And in a few clicks, you can basically set up an entire stack, right?
We're not necessarily completely there, but we're getting closer.
And you've talked about commoditization of data products a lot.
So I kind of want to assume that a lot of these things are commoditized.
What does commoditization, number one, what's your definition of commoditization?
But number two, I'm really interested to know, what does commoditization unlock for us,
especially for people working in the data industry?
Because I think there's still a lot of inefficiency just because companies are figuring out how to build technology,
things are getting cheaper, but at different rates. And so there's still a lot of complexity
or sort of froth, as people would say in the market. But let's assume everything gets
commoditized. What does that unlock? But first, what does commoditization mean?
Yeah. So first of all, I just want to go on. One thing is I don't think every data product can become a commodity. What I'm
saying is more about data integration should be commoditized. Like the ability to pull your own
data and your fragmented data asset should become a commodity. It shouldn't be something you have to
think about. It's just, it's your data. You need to be able to move it where it's going to have
the most value for you.
So when we think about commodity,
that's how we think about it.
It's just, let's make sure that you can very quickly break down these silos.
So that's what we mean by community.
And it's also by the simplification,
like how simple it is to use
and almost to a point
where you shouldn't have to think about the fact that
moving data is something that is a problem for an engineer and that's that's what we mean at that
point by commodity now i mean i i won't call like a machine learning algorithm a commodity even
though it works with data i won't call like now processing becomes more for community but then it
becomes the role of this this company like what kind this company. How do you build on top of
commodity? And typically for data movement, it's about
quality. It's about observability.
You have a lot of things that you can build on top of it that makes
something that is commoditized even more valuable.
It's like the infrastructure piece, right?
If we think about movement and then even, I mean,
I don't know if you'd call Snowflake a commodity.
It's not really the language people use.
But if you think about warehousing in general,
you can set up a really robust pipeline structure
and warehouse really easily, very easily.
Those things are becoming commoditized, which is great.
Like it's opening so many doors.
Yeah.
And it's just like commodity means it's something that people believe to always work.
And that's where we want to be with data integration.
And then it becomes like, what intelligence did you build on top of this infrastructure?
And what kind of additional value
that rely on this fundamental
you can actually start building
and what kind of use case it enables.
I would like to think about this as the Maslow pyramid,
which is if your fundamentals are not there,
basically your fundamental is your commodity
and you want to make sure that your fundamentals
are addressed so that you can start thinking
higher level and higher level with things that have even, even more value, that bring
even more value to your business.
Yeah, I love that.
Thinking about the data stack is sort of as now as higher, it's great.
One last question for you.
And I'm just thinking about people in our audience who are excited by learning from
your experience.
Having such a long history working with data in a variety of contexts and now building a data company.
Do you have any advice for data engineers out there who are thinking about the future of their career working in data?
Yeah, I would say think about trading.
Trading, trade clever and enables other people to be good with data i would say that's the secret for data engineer because you you don't want to be building data
connectors for example because that's something that we we discussed it which is is going to take
you a ton of time what you want is what can you do to actually enable other people to be extremely,
extremely good with data? And what kind of tooling, what kind of new technology you need to
build to make people who consume data even better with data? Because that's how you get your level
with the rest of your team. And that's how, as an engineer, you can actually, as a data engineer,
you can really grow quickly. Think about use case, think about enabling other people.
Incredible advice.
Well, Michel, I'm sad that we're out of time.
We're going to have to have you back on the show because there are so many questions that
I think both Costas and I didn't get to ask, but thank you for joining us.
This has been a great conversation and we'll have you back on the show again soon.
Yeah.
Thank you so much, Eric.
Thank you so much, Eric. Thank you so much, Chris. I think one of the most interesting takeaways from this show, we've talked about the increasing
complexity of the data stack. No one has framed it in the context of demand from various parts
of the organization. And I almost feel a little bit stupid not kind of having thought of that as a way to frame it. But I thought that was a very elegant explanation of a main driver of the complexity because it's
so easy for us to think about the tool. Oh, there's a new tool for this. Oh, there's a new
tool for this, right? CDC, streaming, all this sort of stuff. And really demand from different
parts of the organization and their different needs are the main driver. And that's a great reminder for me. And I hope everyone who's listening.
Yeah. Yeah. Well, I think, okay, obviously like Michelle is like very good in like articulating
pretty complex concepts, which makes sense. Like, I think it's one of the skills that
people who are successful in building like tech companies have, right? So I think it's one of the skills that people who are successful in building like tech
companies have right so i think it's an indication of the success of airbite also that he's able like
to do that what i'll keep like from our conversation with him is the concept of composability i think
that's a very interesting like way of thinking about what the data stack is that's one thing
the other thing that i found also like very very interesting is that machine learning at the end is also activation,
which again, it's something that makes a ton of sense. Just keep thinking of it in a very
different way. And it's actually interesting because if you think about in the market,
the companies that they are doing, they are building products around like
serving models and the companies that they are doing reverse it daily.
Although at the end, the end result is the same in the sense of the need from the company
is the same.
Like they are very, very different like products and companies.
So that's another like something very interesting like to observe in the market and see how
it's going to evolve as the market grows.
All right.
Well, thank you for joining us on this episode.
And we have many more great shows for you coming up before the end of the year as we round out the season of The Data Stack Show.
We'll catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.