The Data Stack Show - 72: Building Data Ops Into the Data Lifecycle with Douwe Maan of Meltano
Episode Date: January 26, 2022Highlights from this week’s conversation include:Douwe’s career journey (3:04)The missing piece in GitLab’s data tooling (7:35)The open-source offering in the data space (12:38)Singer’s connec...tion with Meltano (22:31)How Meltano manages connectors on a diverse codebase (35:21)The data house side of Meltano (39:47)Data house operating versus Airflow (44:06)Meltano’s vision present today (47:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Today, we're going to talk with Dawa, who is the CEO of Meltano.
And I almost caught myself saying CEO and founder, but Meltano has such an interesting story.
It was a project started inside of GitLab, which is a really large company that builds a DevOps platform. And Dawa worked on the project inside of GitLab, which is a really large company that builds a DevOps platform.
And Dawa worked on the project inside of GitLab. And I'm so interested to hear from him about how
Meltano came to be inside of GitLab. We've talked with several companies who, several guests on the
show, who have been part of technologies that were spun out. So we talked with someone from Netflix. Recently, we talked with someone who
worked on building Hudi, you know, and several other technologies like that.
GitLab isn't quite as big as some of those companies. You know, they recently IPO'd. And
so to see this happen and kind of have it be so fresh, I'm really excited to hear the origin story
about Meltano. How about you,
Costas? You having built tools that, you know, in the ETL space, I'm sure have a ton of questions.
Yeah, yeah. I really want to discuss with him about like the evolution of Meltano. Meltano
has gone through like transformation as a platform. I mean, many people probably remember
it as like an ELT, like a competitor to Stitch Data and
Fivetran. Today, something different. It's more of a platform like in this new category that
they call like DataOps, which is very exciting for me because what it tries to do is like to
bring all these best practices from software engineering into data engineering. And yeah, I'd love to see what happened,
how the project changed,
how it became
a company with VC money right now.
And discuss about
open source projects
like Singer, because
Nultano is very active there.
So yeah, we will have plenty of things
to chat about, for sure.
I have no doubt. Well, let's jump in and talk with Dawa.
Let's do it.
Dawa, welcome to the Data Sack Show.
We can't wait to talk to you about Meltano.
Thanks for having me, Eric.
I'm very excited to be here.
Okay.
You have such an interesting pathway that led you to, you know, being the CEO of Meltano.
Can you just tell us a little bit about your career trajectory,
how you got involved with Meltano, and then sort of the story of how you became its CEO,
because it was inside of another company before. Yeah, that's right. So Meltano was founded inside
GitLab. So if we go a little bit further back, I can kind of describe how I ended up there.
I personally got into programming and computers at a very early age.
At the age of nine, you know, my father always had computers around the house and not just
stuff running Windows, but we had like Linux.
So I always saw computers as something that would be tinkered with.
And that was an outlet for creativity rather than just something that does a thing and
you use it when you need it.
So from a very early age, I got into programming and through open source, I was able to teach myself a lot of things that, you know, in another
time might've required going to college to the extent that by the end of high school, I had
built a bunch of web applications and I had founded a company. And through the company I had at the
time, which was called Stingo, which built products for bed and breakfast owners to manage their
reservations and their calendar and communication. And yeah, I exactly at the end of high school, I was initially working for a
company that builds iOS and Mac apps as lead engineer. And then through that company with
one of my bosses at the time, we ended up co-founding a company that built a software
for bed and breakfast. So the cool thing is that I was, you know, in high school, early college,
there were not a lot of people around me who were kind of building products at this level already.
So I really looked for like-minded individuals in the Netherlands and European kind of tech and programming space.
So I ended up at Ruby, European Ruby
conference in Athens, where I was by myself and I was, you know, over lunch, I walked up to a table
and I introduced myself to someone speaking there because I wanted to, you know, have a place to put
my sandwich down. And I told him about what I was doing and that I was from the Netherlands.
And he mentioned to me that his boss, and he pointed to the corner of the room,
was from the Netherlands as well. So I walked up to his boss and I explained to him what I was doing. And this is this bed and breakfast company we had. And it
turned out I was looking to talking to Sid Sibrandi, the CEO and founder of GitLab, which at the time
was this tiny little Dutch company that had been built around a Ukrainian open source software
called GitLab. You know, one of these version control code review kind of GitHub like tools.
And it turned out that Sid's
parents owned a bed and breakfast in the north of the Netherlands. So his parents became customers
of the product that I had basically, from an engineering perspective, single-handedly built,
although I won't take full credit for the company side of things. And coincidentally, Sid and I kept
running into each other at different conferences around Europe in the coming months, up to the
point where he asked me to join GitLab just around the time that it was going through its Y Combinator program and
was raising its first funding. And then the timeline kind of worked out because the company
I was running at the time, I was 18, but my co-founders were 35 and 56 or something. So we
were at very different risk tolerance levels in our lives. So we decided to kind of wind that down
and I jumped on the chance to join GitLab. And for the first year or so, I was a software engineer and then I became
responsible also for building out the engineering team, hiring more engineers from the open source
community, which is always a really great position to be in where you can bring people in that have
already kind of proven themselves and their enthusiasm for the product and their ability to,
yeah, to come up with solutions that will help them and others. And then over a number of years, I got into engineering
management up to the point where in 2019, GitLab had grown massively from 10 people to 1400.
And I was starting to feel that itch and want to go back to earlier startup days where you have a
smaller team and there's so much to do every day. You can really feel the impact of the decision
you're making in a very short term. And in general, that way of being at the forefront of solving some new
problem and having super happy users. So as I'm sure everyone in the room today is familiar with,
it's really great. So I joined Meltano in 2019, but Meltano had been around since 2018.
Meltano was originally founded inside GitLab because the GitLab data team and GitLab as a whole realized that the state of data tooling was very different from the types of developer tools we had gotten used to that embrace best practices such as version control and code review and allows entire teams to collaborate on their product in a way that enables really quick iteration and
makes it easy to experiment and make sure that people can just make changes without being worried
that they'll break stuff in production. And as an engineer looking at the state of data tooling,
myself, but also other engineers in GitLab, we were kind of surprised to see that a lot of these
best practices that we saw as pretty transferable and a lot of the problems that these teams have
as parallels were not being addressed yet by the tooling of the day.
So just like-
Sorry, go ahead.
Well, just to dig in a little bit,
that's super interesting.
And so just to say that another way,
you were looking at data tooling.
So let's just say, you know, whatever,
traditional ETL or streaming or whatever.
Were those more, the challenges were that
they were primarily sort of UI based and like
tucked a lot of the, a lot of the mechanics under the hood.
And so you don't have things like version control or other sort of, there's not really
like a development life cycle with data tooling as there is with normal software.
Was that sort of the key piece that was missing?
Yeah.
Yeah.
We can talk about that a little bit more.
That's great.
So GitLab was relatively late to start setting up its data team. So the initial beginnings of that was really just
GitLab engineers looking around and seeing, okay, you know, we got to build a data stack,
we got to move data from A to B, and we want to analyze it. And they came into it with certain
expectations, like, oh, yeah, you know, we're developers, this is all kind of like building
application or building these pipelines. And then what they found is exactly what you're describing.
Some of the things that they had started taking for granted, even though even in the software
development world, DevOps was not really a thing 10 years ago.
I grew up FTPing into a web server and making life changes to PHP files in production.
And that very much feels like the way the data space is still today, or at least a couple
years ago.
So the big thing is definitely a lot of these tools being UI based, being kind of proprietary
SaaS tools that run in a browser somewhere, and don't give you a lot of the flexibility and
customizability and ownership and say over a really core component of your stack that developers
expected in combination with these tools not being open source, which also ties into
being sort of limited by what they do today and not having that opportunity to improve them or to
make them fit your workloads better. But the fact that they're UI based and that they come from a
world where, you know, companies have these big end-to-end data tools they log into and they make
all the changes in the user interface didn't jive with these expectations of pipelines are code
everything can be code version controlled everyone in the team no matter their disciplines or their
their kind of comfort around el for example is able to go in see the configurations and propose
changes trace how data flows through the system by having a full overview of everything and exactly
like you're saying version control code review continuous integration and deployment having
automatic tests run so that things don't accidentally break, having isolated environments so that you can make
changes locally with complete freedom without ever worrying about accidentally breaking the
dashboard the CFO is looking at. These were things that we were expecting to find and did not. So we
saw it as an opportunity not just to build an internal tool for GitLab to use, but we saw that
there was an opportunity in the market here to build data tooling that really embraces at a really deep level, the software development
best practices of DevOps and open source. And from day one, GitLab realized that by building a tool
that would help GitLab in this way, that would also be able to help people externally. So from
day one, the hope was that this would one day develop into its own business unit, its own
business per se, by building something valuable for us that would transfer to others. And we saw one the hope was that this would one day develop into its own business unit its own business you
know per se by building something valuable for us that would transfer to others and we saw an
opportunity to make data ops a reality similarly to how gitlab had been pivotal in making dev ops
and you know dev sec ops a reality so in 2018 when the data team was really small we get lab set up
this team to start building this tool called Meltano. Meltano
being an abbreviation for model, extract,
load, transform, analyze,
notebook, and orchestrate.
I didn't know that. That's great. Yeah, no, it's awesome.
It's some of the stages of the data lifecycle
that we identified, and I don't know who
it was that put it together in this particular order,
but I think Meltano has a really great sound
and kind of mouthfeel to it.
And it's cool that it kind of relates back to all of those aspects of the data lifecycle.
But we also saw that GitLab's data needs were growing at a pace that the internal team
building Meltano was just not able to keep up with.
So GitLab did end up using some of the more traditional tools in the space, you know,
Fivetran and Stitch for EL, a bunch of different tools we tried out for the BI side of things.
But we always believed that the future of data tools, not just for GitLab internally, but also for the whole world,
would look a lot more like software development tools and data people becoming more and more
comfortable, not with programming per se, but at least with concepts of version control and
command line interfaces, managing your configuration in YAML files. And the
Meltano team never gave up on that, on that, yeah, that goal or that vision for the future.
I love it.
I, you know, it's interesting if you think about
some of the more UI-based tools,
a lot of those are driven by analytics use cases
from other parts of the organization.
And so it makes sense, you know,
sort of the way they were built,
you know, sort of with the SaaS model
and tucking the mechanics under the hood.
And so now we've had a lot of people on the show
where they're trying to bring software development principles
into the data space because they realize the need there.
But I just love thinking about the team at GitLab
who's been building DevOps stuff coming into the data space
and saying, whoa, like what's going on?
Like, you know, where is,
where's all the componentry? Like, so great. For me, it was really interesting in, and we're
jumping in the timeline around a little bit, but I'll talk more about how, you know, I came to join
Meltano, but when I joined Meltano, I was very new to the data space and I knew that, you know,
clearly there was a need there for something github was building but i was just really surprised to find also the breadth and depth of the open
source offering in the data space i was positively surprised on some fronts because there exists
really great or you know set up to be really great with a few more years of iteration the
itools for example like metabase and superset and red, and there's a bunch. Whereas DBT is phenomenal as a
transformation tool that also kind of introduces a lot of analysts to some of these software
development best practices. But I saw at some point, I was surprised to see that especially
on the data integration side, of course, there exists tools like RudderStack that are kind of
focusing a little bit more on what we now call reverse ELT. But everyone still seems to be using a Fivetran or a Stitch,
and there exists a library of connectors in the Singer standard
that have been built around Stitch,
but a full stack that can replace a Fivetran, for example,
that you can just run open source.
I was surprised to find in 2019
that that hadn't been a completely solved problem already.
So we can go back to 2019 or 2018 when Meltano was
founded and kind of cover a little bit of the time and the changes that have gone on in Meltano
during that time. When Meltano was founded in 2018, we had this hope of building an end-to-end
platform that could do everything from data integration to helping you build the dashboards
end-to-end from data to dashboard is what we called it at the time. And we were on the one hand looking at great open source technologies
that we could leverage. And we were also willing to build our own new stuff that would really
work well with this software development way of thinking. From day one, this was going to be open
source. We were going to build it with the community and we really wanted them involved,
not just from a feedback perspective, but also actively helping us make this a reality.
But we came to realize over the course of 2018 and 2019, and I joined at the end of 2019,
that this end-to-end vision was too heavy in a way for people to adopt and start using and start contributing because we kind of assumed that you would replace your entire data stack with this
Maltano thing, which meant that we had a lot of ground to cover until we could actually plausibly replace
whatever best-in-class tools that companies had picked so far.
So by the end of 2019, when I joined, we were working on making the end-to-end thing work
where you could bring plugins into Maltano for a particular data source like a Stripe
or Shopify or what have you.
We were kind of focusing on the business to business
or the B2C rather e-commerce field
just to have a use case in mind to focus on.
And we had built something
where you could bring plugins in for these sources
and you could indeed with one kind of one click
go from entering your credentials for Stripe or Shopify
or one of a number of tools we supported
and then having a dashboard show up at the end. But we were getting some interest from really early startup founders
who didn't have the resources to build a data team and set up their own stack. But we were not
actually getting the interest from the data engineering or data analytics community that
we were looking for. So in early 2020, from GitLab's perspective, the decision was made that
the numbers that we were seeing in terms of
traction and usage did not warrant the continued kind of full-time staff of six people on the team
at the time, which was a general manager, Daniel Morrell, myself, an engineering lead, and then
four engineers. One of them we found out earlier is actually a friend of Kostas's, Janis Roussos.
He's really awesome. But we realized that six people on a product that was kind of flat in
terms of growth was just not going to work. so the decision was made to reduce the headcount down
to one to essentially extend the runway sixfold and i was left by myself on the product essentially
to figure out how i could turn notano around so over those first few months that was of course
super daunting because i was essentially the newest to data out of the entire team. My background is in software engineering. And I realized that I was kind of blind to the needs of data professionals
themselves. And I was very aware of whatever, all you have is a hammer, everything looks like a nail.
And am I just seeing things that aren't there? It's a big problem with the data world, really
bringing more developer style tooling in and making open source data stacks
more of a compelling alternative. So I started talking to a lot of the data people that had
become Meltano fans and followers over the years, not users, not contributors in many cases,
but at least people who were willing to talk to us about what they liked and what resonated
originally. And I found out that sort of accidentally in Meltano, by identifying these great open
source technologies for different stages of the lifecycle, we had found Singer as the
standard for open source data connectors, which was built by Stitch, as we talked a
second ago about, which has this ecosystem of at this point, more than 300 connectors
for different sources and destinations.
And the question I was getting from these users was that like, well, you know, you're
building your own open source BI, but there's already a bunch of solutions for that.
You're embracing dbt for transformation.
That is great.
But you know, dbt is great standalone, but this singer thing could really benefit from
better tooling around running these pipelines, deploying them, configuring them, building
new connectors for data sources.
So we realized, I realized that not necessarily by changing the
product, but by changing the positioning to focus exclusively on open source ELT. And Luke, this is
the best way to run Singer and DBT powered pipelines on your own infrastructure, on your
own machine. And you get all of these DevOps and DataOps advantages for free because your pipelines
are managed in a YAML file. And you get testing and all of this stuff. Over the course of 2020,
just through the simple act of changing
the way the website talked about what Meltano was,
we suddenly started picking up tons and tons of usage
as an ELT tool.
Even though from our perspective,
Meltano had always been an end-to-end platform
that picks best-in-class technologies
to build integrations with
that can run on top of the platform.
So by the end of 2020,
we had really kind of created the change
in the Singer ecosystem
that we and the community agreed was needed.
There was always this weird situation
where Stitch itself is a paid proprietary
SaaS data integration platform,
but the connectors that run on it in many cases
are open source and available for free
and you can just download them.
But those connectors by themselves
don't give you all of the EL functionality
to actually want to run the stuff in production.
And that is where we stepped in to the point where in early 2021, earlier this year, I
got the permission from GitLab to start bringing some more people.
And we started talking about setting Meltano up for best success in the market and really
becoming the tool that makes data ops a reality for the data lifecycle and data teams as a whole.
And we realized that since GitLab being a 1,400-person company,
where literally 1,399 people were working on this big thing called GitLab and marketing for GitLab and sales for GitLab and everything GitLab,
and I was by myself in working on this tiny little other thing. And we realized that
some of the stuff you need as a startup to be able to move fast and make compelling offers to great
candidates, GitLab was just not set up to do anymore because the realities and the needs were
so different. So we realized that in order for GitLab or for Multano not to be slowed down by
the inevitable increase in bureaucracy that had kind of come up in GitLab, our best path forward was to spin out.
So over the course of 2020, as we were gaining traction,
I had already had literally dozens of VC firms that had reached out
to talk about this eventually, like what's Naltano going to be?
Is it always going to be internal?
Is it going to be its own thing one day?
So early 2021, earlier this year,
we started concretely talking to some of these potential VCs
and that led to us leading a seed funding round from GV, formerly known as Google Ventures.
And that led to my transition from literally in January, I was a general manager of a product
by myself.
In February or March, I hired two people while we were still in GitLab, so we were three.
And then three months later, I was founder and CEO of a startup that really quickly built a team to about eight, nine, ten people.
And six months earlier, I had just been by myself.
So that was amazing.
But as you can imagine, also a whole new challenge and opportunity for myself to be pushed to my limits and have to overcome them, which, of course, is extremely rewarding.
Yeah, that's amazing. I think we should spend some time later to share with us a little bit of like what this
transition felt like.
Because to be honest, like you have like a quite amazing, let's say, journey so far from
like, as you said, from being a teenager, building apps, going like very early on GitLab,
being a manager for engineers and now a CEO.
I think there's a lot of like wisdom like to share there,
like even for just like the emotion side of things,
right?
Like how the emotions change.
But let's do that a little bit later
because I want to ask you about Singer.
Singer is a very interesting,
how to say that,
like case of open source projects, especially like in the data space, because I had like the opportunity to, let's say, experience the war between Fivetran and Stitch Data as it was happening, because I was also competing with them.
And it was very interesting how these companies were positioned and how Singer came into the game, like to support this positioning that data have.
But Stitch Data left the game a little bit early.
They launched this thing, it got traction, then they got acquired by talent.
And then we were left with Singer out there, where people keep using it.
And it's the moment today, like all these years, we have like Meltano, which is building tooling
around it. We have Airbyte, which is pretty much based on the Singer protocol. And I'm pretty sure
we will see more stuff happening around it. So I'd like to ask you, first of all, what was like
Singer when you first started working with it and what was missing from it? What was like
that Stitch Data didn't do about Singer? Yeah, great question. So when I came, you know,
when I really started digging into the data space and Meltano and the tools we had adopted in 2019,
Singer had already been the standard for data connectors that we had adopted because the
library at the time was, I think, somewhere in the 100 to 200 range of connectors that were supported. And there was
a community of a few thousand people around it. And there seemed to be, at least on the more
popular connectors in the ecosystem, frequent enough updates that they would be production
ready. But from talking to the people, what we realized is that connectors for sources and
destinations, just these tiny little executables
that you can run on your terminal
and you can pipe them together
to have data flow from A to B,
are not enough to actually replace
an entire EL solution.
And that's, of course, also why Stitch itself,
the hosted platform for running these senior connectors,
is paid because a lot of the value
is not just in the connectors themselves,
but in the tooling that manages incremental replication,
that manages backfills, that manages all kinds of aspects
about the production level reliability of these pipelines
that goes beyond just running the code.
And Meltano had already built that.
The other thing that we saw is that people found it too difficult
to build new connectors and to improve existing ones.
There existed this Singer Python library that had a number of helper functions, and most of the connectors were built around this library. But there was a lot of decision-making on the side
of the engineer as for how exactly to use these, how to deal with incremental replication state,
how to manage, how to deal with selection of specific streams and columns, which are roughly
analogous to like tables and database table columns. So we realized there was also an opportunity for
better tooling around building these connectors. And then finally, the big problem was discoverability.
Singer.io, the official website for Singer, has a list of about 99 connectors, but in most cases,
those link to the connectors in the Singer IO namespace on
GitHub, where a lot of these repos are housed. And as we've been talking about, Singer,
unfortunately, I think because of the talent acquisition, sort of lost the motivation to
really actively maintain these projects. So a lot of these repositories ended up with,
I mean, dozens of unanswered open issues and pull requests and bugs that had been
known for ages, but just had not been fixed. Even if a fix had been provided by the community,
the plugin you would have downloaded would still have had the bug. So there's two issues there in
discoverability. One of them being that in many cases, these Singer.io repositories actually had
forks that were more actively maintained. And those are really the ones you should be using if you want to have the highest quality
and everything.
And the other part was that Singer.io only listed these connectors that Singer at one
point had adopted into their own GitHub namespace.
There existed hundreds of connectors in other companies, consulting firms, other data products,
own GitHub repositories
that were also available for free
in often cases more maintained,
but were not discoverable at all
unless you knew how to do the special search on GitHub.
So we identified these three issues,
building these pipelines and running them in production,
building connectors, and then discovering connectors.
So we just set out essentially to address them one by one
to lift up the Singer ecosystem and empower it,
not to necessarily own it and make it our own,
but to make it, give it all the tools it needs
to be able to stand on its own and keep growing,
even without our kind of continued heavy-handed involvement.
So Meltano itself became this runner
that makes it really easy to run, configure, deploy.
We built the Meltano SDK for Singer taps
and targets that makes it easier than ever to build new connectors. The code footprint of an
existing connector that is ported to the SDK is reduced by about 90%. And people have told us
that getting a new connector up and running with all of the Singer bells and whistles like
replication, incremental replication, and stream and column selections only takes as much of two hours
because of some of these abstractions that we have built around REST APIs, GraphQL APIs,
and other custom methods. And then finally, we Meltano Hub for Singer tabs and targets to catalog
all of the different tabs and targets in the ecosystem, which it turns out there are more
than 300 sources and destinations that have Singer connectors for them. And about half of those have been updated in the last year.
And the other ones are not necessarily outdated.
Those might just be APIs that don't require quite as frequent updates.
So the Singer ecosystem is a really great place now compared to how we found it as Meltano
about a year and a half from now.
And we have recently also set up the Singer Working Group, which has us in it, along with a
number of big players in the Singer ecosystem, including the Stitch team at Talent, who were,
of course, the original creators of the spec, other tools that use Singer in their power,
their connections like Hot Glue and Y42, and there's a few others, as well as some of these
consulting firms that had built a lot of these connectors over the years for their clients
that needed sources that were not supported by some of the tools like Fivetran. So Singer is now
at a place where it can, in combination with Matano and these other tools we've built, rival
Fivetran and a lot of these other tools, especially on the size of the connector library and the
advantage of it being open source, which means that you were never limited by anyone else
if you want to improve or extend or customize these connectors
or if you want to build a new one for a new source.
And interestingly, having Puttsinger in such a place
has actually given Meltano the opportunity
to look at what we're doing and what our mission is
and what our goal is and to take a step back
from this really narrow focus on EL, which we kind of took as a strategic decision in early 2020, as I was describing,
and to focus again on bringing DataOps to the entire data lifecycle by building Meltano
into a DataOps operating system that can form the foundation of every team's ideal data
stack by allowing best-in-class open source components
for various stages of the data lifecycle
to be brought on top of the OS,
with the OS taking care of the consistent installation,
configuration, deployment,
and the integration between the various tools.
And I can talk a ton more about that
because it's kind of where we're going,
but it is good to stand still a little bit on Singer
and what it was and what it is today
and what we've been doing. Yeah, yeah a few questions uh about the future but i'm sorry i'm
a little bit like curious about like the evolution of singer right because from what i hear from you
we are talking about okay we had like singer the ergonomics of like the sdks and all the stuff like
we're not like the best you created like on top of that the miltano sdk or like the
extension how does this to be clear we have not extended singer in any way so far we are working
with the singer working group on singer extensions but we want to make sure that those are supported
and approved by all of the different players in the singer ecosystem because we think a big part
of its power is the fact that it is no longer purely connected to one particular product or company.
Like it used to be when it was just the connector framework for Stitch.
And similarly, we're seeing other open source data integration vendors, like somebody mentioned before, coming up and building their own connector standards on top of Singer with private extensions.
But we believe that Singer is kind of special in that it is agnostic and really community led and everyone in the ecosystem different consulting firms and different tools
can adopt it because it is the defector open source standard without any particular company
that owns it today which is a strength okay perfect perfect the reason that i'm asking is
because like i'm quite aware of like how the airbyte version of singer works which
it is built on top of uh singer it's not singer exactly right like they have made some very smart
decisions in terms of like how the interfaces work like with the standard input output like
between like docker images and stuff like that that gives like a lot of let's say their operability
between like different like languages and frameworks stuff like that that gives a lot of, let's say, interoperability between different languages and frameworks
and stuff like that.
But it's something different.
It's not exactly Singer.
I mean, there are elements of Singer,
but I cannot imagine, I'd say,
backward compatibility in this thing.
It's something different at the end, right?
So that's why I was asking if it's something similar
at the end, what Meltano is doing,
or you are focusing on maintaining and's something similar at the end what meltana is doing or you are like focusing on maintaining and reviving a singer at the end yeah i the interesting thing
is that because of the singer as a standard is really great like stitch came up with it it served
their needs for a long time but it also haven't hasn't evolved a lot over the years since they
have sort of lost interest so there are definitely a lot of areas
in which it can be improved.
But at the same time,
a lot of the issues with current Singer
or existing Singer connectors
were not actually because of limitations
in what Singer can do,
but just in the fact that a lot of these connectors
were not even making the most
of what Singer can already do today.
So we wanted to first address that
by making it so easy with the new SDK to start using everything that Singer can already do today. So we wanted to first address that by making it so easy with the new SDK
to start using everything that Singer can already do today to kind of reach the full potential
that was already there before starting to look ahead and see, okay, how can we make Singer better?
So the first important thing for us was to increase the consistency and behavior across
different connectors in the ecosystem, especially for newly written ones. And the SDK has delivered on that
and makes it so that you can opt into
some of these Singer capabilities
without having to completely figure out yourself
how to implement them.
And it automatically leads to more consistent behavior
across the board.
But now that people can actually make the most of Singer
through Multano and the SDK,
we are starting to work on improvements to the spec.
Airbyte was in this, you know, in their case, great position where they could just say,
okay, we don't need backward compatibility.
We're going to just call it, you know, the Airbyte spec.
We're going to take a lot of inspiration for Singer, and then we're just going to fix
everything we think is broken and improve it.
And they could do so unilaterally.
But we think that there is so much potential in the Singer ecosystem and the existing community
of literally hundreds of thousands of consulting firms and different data engineering teams and data product developers that we didn't want to just let it go
because then you get the disposition of that famous XKCD comic that says there are 12 standards,
they all suck, I'm going to make a new standard. And then the next frame says,
now there are 13 standards. And then it kind of becomes this loop. So we decided the only way to
really make Singer better is to bring, well, first kind of increase this loop. So we decided the only way to really make Singer better
is to bring, well, first kind of increase people's confidence
and trust and belief that this is going somewhere.
And through these things we've brought
in the Singer ecosystem,
we have definitely kind of revived that enthusiasm.
And then the next thing was to get all of the big players
invested in Singer kind of together in a room
to start working on those next iterations
of Singer together. And the first priorities for the Singer working group I've been talking about
are to address some of the same concerns that Airbyte has been able to already address because
they could do so. But we are starting to do this through a more standardized process where we get
everyone involved around the table and also bought into supporting this in their connectors going
forward and implementing in their tools. So that has to do with things that improve performance at throughput it has to do
with like the automatic discoverability of a connector's configuration features for example
which is now something that kind of lives separately in the repo from the actual connector
and there's a whole list of other things that you can find if you google a single working group and
you find this repo where we're working with these players. And we were actually really grateful to see that the Stitch team at Talent was just as excited as
us about this opportunity to kind of keep growing and improving this for the benefit of the entire
data community. And that ties back to the importance of Singer being seen as something
kind of separate and agnostic and something that will always survive as long as enough people use
it rather than something whose fate is tied
to one particular product.
In part, because from Meltano's perspective,
we don't want to take over ownership of Singer forever
because we are building a data ops operating system.
We're not just building an EL tool.
So it's in our interest for there to be
independently thriving open source technologies
for every step of the data lifecycle
that we can make better than the sum of their parts. But ultimately, it has to be this ecosystem and community around Singer that keeps it alive.
And we are happy to have a big role in that and put development resources and everything towards
it. But we cannot do it ourselves. I have a question that I think is also going to lead us into the future of Meltano and DataOps. And I want to ask you about how you, as Meltano, can manage the quality of these connectors.
And I think this is one of the biggest, let's say, arguments that a closed system like Fivetron
has that, yeah, sure, you can go download something from GitHub.
And of course, many of these like versions of the
connectors that just crap right like they are not updated they are not simply made it well
like all these things so how do you deal with that like with such like a diverse let's say
code base yeah yeah it's a really interesting question and it kind of goes through the trade-off
between the decentralized maintenance
of an open source ecosystem where you get a ton of advantages like there's not a single bottleneck
who slow who can slow things down and the the amount of connectors is essentially endless if
you decentralize the maintenance to different kind of invested parties but that also means that
we cannot fix a bug ourself unilaterally in some particular connector if we want to,
because we do not necessarily have ownership over that repository.
The way we're thinking about it is that in any open source ecosystem,
if there are enough users who are okay with this deal of, okay, I get to use it,
but I maybe occasionally have to fix stuff, then the top used connectors will automatically get enough usage and eyeballs that they are in a good state. And for us, it's more important to have a
decentralized ecosystem that can scale indefinitely than to have a smaller or controlled ecosystem
that we have tighter control over. But that does mean that if you are a company that just needs
connectors that will always work and you never have to worry about maybe fixing a bug yourself, Meltano or rather Singer might not be the best choice for you today.
But the more companies become involved that do this work, the higher quality, even companies that aren't willing to put in their own contributions can of connectors in the ecosystem is already higher than a lot of people might have thought a year or so ago because the back the best variants of a lot of these connectors
are in prior in forked repositories rather than the the initial singer io one that you will find
and a lot of them are seeing maintenance so in part to address this maintenance question we have
also set up naltano labs which is a way of pooling decentralized maintenance so that people don't
have to take on the maintenance burden indefinitely, but they can say, okay, for a period of time,
we are heavily using this one or are we improving it for our clients? So we are okay with kind of
taking on the maintenance hat for the next three months or so, but then it stays within the Meltano
Labs pool where we have some control over it, but we are not a bottleneck per se. The flip side of
this though, is that in the open source ecosystem already,
web applications you use every day,
including Rudderstack and Multano,
but also massive ones like Reddit
and Facebook and whatever
are all built in open source technology
that in many cases are also just managed
by individual contributors.
And you have the same motivation of,
or the same trade-off of,
can we expect that quality to always be there?
But we all know that there are high quality
maintained API client libraries
for all of the big APIs,
for all of the big programming languages.
You can find Shopify API clients
in every programming language.
In many cases, these are built even
by the vendor themselves,
or they're maintained by an active community of maintainers.
And if we trust these API client libraries enough
to use them in production software, then on the limit, there is no reason to not trust an ecosystem of connectors
at a similar level. But from the perspective now as a data ops OS, we don't really care which
particular technology you bring into Meltano, whether that is Singer or dbt, or even, you know,
Airbyte or Rudder stack, we have plans to support
all of these in the future, because we think that it's up to us to provide teams choice to put
together their ideal stack where they can make the trade offs they need. And we will build the data
ops OS that kind of ties it all together and allows them to treat their entire data stack as
a product in the way of the software product development lifecycle, rather than just a set
of disparate kind of tooling
and purchasing decisions.
So Singer is not going to be for everyone,
maybe not ever, but that's okay
because there are lots and lots of organizations
that do like the trade-off of,
I can fix it and improve it and customize it
without needing to ask someone for permission.
And I'm okay spending a few engineering hours per month
to do so, just as is the case today
with other open source projects.
Yeah.
Well, it's a huge conversation.
I mean, we could probably multiple episodes
just chatting about how you can structure
and this kind of like open source project.
And for me, it's like very, very interesting.
And I think there's a lot of value in there,
but let's keep that for another episode.
I'll be more than happy like to just dedicate one just for this. And let's get that for another episode. I'll be more than happy to just dedicate one just
for this. And let's get into the DataOps side of Meltano. So you mentioned at some point that
Meltano started as an end-to-end platform, okay? And it has transitioned now into a DataOps or
transitioning into a DataOps platform. What's the difference? What's the difference between the two?
Yeah, good question.
So when you're looking at kind of the previous generation of data tools,
what you primarily saw is these big products that kind of do it all.
They do everything from the integration to the analytics.
And this is potentially a consequence of these tools maybe having started
with a less technical analytics audience with a BI tool
and then working backwards into the rest of the stack until they do it all. But they do it all from a
kind of a UI-based SaaS web browser perspective. And the tools you'll find today that call themselves
data ops platforms are also these types of tools that try to do everything really well while
bringing in some of these data ops qualities and software development best practices. But the data
space of today is uniquely
horizontally integrated in the sense that you have for every kind of step in the data lifecycle and
every layer in the stack, you have a number of competing solutions and new ones coming up every
day and being funded by VCs and going through accelerator programs like Y Combinator. So it's
not realistic anymore for any data team really to find one tool that does
it all that they will actually be happy with in the long run, because you're going to be missing
out on a lot of these new improvements. But with the data space having turned from one big
application with full visibility and control of every aspect of the data stack into this world
where you have tools with a really narrow focus that need to be kind of individually integrated
between them, in many cases, manually by data teams. What has gone missing is this sense of a unified
unit called the data stack that can be reasoned with as a whole, that can be version controlled
as a whole, that can be end-to-end tested, and that can be experimented with and played around
with without worrying that there's some SaaS thing running somewhere that doesn't have this concept of an isolated environment.
So the way we're seeing the world now is that there is a really big opportunity for a new
foundation, a new layer in the data stack that we are calling the DataOps operating
system that forms the foundation of every team's ideal data stack.
That's how we've described our vision.
What that means is that these best-in-class open source components, like a Singer or an Airbyte for EL, a DBT for transformation,
Rudder stack or similar tools for reverse ELT, superset database, et cetera, for BI and analytics.
And of course, also you have all of these data science tools like Jupyter that can be brought
in that are also part of the data stack. We want all of this stuff to live together and be defined in a single repository in a declarative way
so that a team can reason about their data stack again
as one unit and get these advantages I was describing.
So compared to data ops platform just in the past,
the big difference in Multano is that we are modular
from first principles and architecture
and that we want to earn a new place in the data stack
instead of trying to replace something existing.
And we call ourselves a DataOps OS
because what we care about a lot
is in kind of merging these worlds
of software development and data engineering,
or at least allowing them to cross-pollinate
and learn from each other more.
Because we think that a lot of work
that we currently call data engineering is really data stack development and it's far closer to software
development where you're also picking you know off-the-shelf components custom components or
some open source technology you might be using some sas that you have to connect with over an api
and we are trying to allow data teams to start treating their work more like software development
and get those same advantages.
And our path is sort of,
you know, prepared for us
a little bit by dbt already
making analysts more comfortable
with some of these concepts.
And we are trying to go all the way
and bring data ops,
not just to EL in the case
of what Montana has been over the last year
or to T as dbt is doing,
but to the entire data stack.
And we think data stacks can be better than the sum of their parts if you bring in Meltano to help
manage it all and help the integration between the different components of the stack.
That's great. I have one last question because I start feeling like really bad that I'm
monopolizing the conversation here.
Oh, you're not.
I'm pretty sure I'm talking way more than you are, but yeah, your colleagues should
talk too.
Yeah, exactly.
And I'll wear like my engineering hat and I'll make like a question to hear about DataOps.
So what's the difference between like the DataOps operating system and something like
Airflow?
Yeah, that's a great question. So one big difference is that
in your data stack, data movement is kind of the domain of Airflow and similar workflow
orchestrators like, you know, a DAX or Prefect. And they, within their workflow orchestrator,
have, of course, reached out to different tools that handle parts of that workload. But there's more to the data stack than that. You have a BI tool at some point,
you might have tools that don't really fit within the Airflow way of working.
And if you're using Airflow, you still have to install it somewhere and deploy it somewhere and
manage the version control of your orchestrators. And similarly, if you're using a BI tool,
you still have to install it somewhere and manage your dashboards and version control of your orchestrators. And similarly, if you're using a BI tool, you still have to install it somewhere
and manage your dashboards and version control those.
So Meltano forms essentially the package manager
for your entire data stack
that all of these things can be brought into,
even things that are completely out of scope for Airflow,
which only cares about data movement, for example.
So Meltano allows you to, any tool your data team uses,
whether it's the analyst or the analytics engineer or the engineer, whether it's about the movement or the consumption at the end, they form part of a greater product where in some sense, the end users are your colleagues within the company.
The interface or the features are some of those consumption methods and dashboards.
And then the backend, so to speak, is more of where Airflow lives. But that front end and the whole product is what Meltano brings together by forming a package manager
for every tool in the data stack,
which from an engineering perspective,
you can also see as a terraform for data stacks
because we allow people to really easily
bring in tools declaratively or with a CLI.
And then Meltano manages the configuration
and the deployment and all of that stuff.
So that an engineer that wants to put together
a data stack
doesn't have to pick six tools,
learn how to install them,
learn how to configure them,
and then be the only person in the team
who really knows how it all works.
We want to also sort of democratize that,
make it, give it a single source of truth
that the entire team feels comfortable collaborating in
and also trying out new tools,
swapping out new tools really easily
by giving them the confidence
that if Meltano has support for your tool, adding it, trying it locally or wherever is going to take
just a few minutes of work instead of this daunting task of figuring out how am I going to integrate
it. Maybe this one is Docker, maybe this one is Python, maybe this one is NPM. We want to unify
all of that. Yeah, that's great. Eric, all yours. I have to apologize, by the way, to both of you,
because I just realized that based on the outline of the conversation that we have created before we started the recording,
like the stuff that I asked were completely different.
That's great.
It was awesome.
I learned a ton.
Like you said, it would be an organic conversation, right?
So we'll take it wherever it goes.
Yeah, I know we're close to time here,
but Dawid but a couple quick
questions so one is how much of what you just talked about i know there's sort of part vision
this is where meltano is going how much of that exists today i mean how much of that can you
actually use today well that's the perfect question. So architecturally, even during the year or so that Meltano was talked about
and perceived as an ELT tool,
Meltano was always
this plugin-based architecture
that allows different
open source technologies
and tools to be brought in.
So from a software perspective,
we're essentially already there.
The only thing we're still lacking
is in the specific plugins we support.
So far, we have invested
really heavily on support
for Singer, Taps and Targets, for EL, DBT, for Transformation, Airflow, for Orchestration.
And the biggest challenge for us now is to kind of keep building out in the breadth of types of
plugins we support. And of course, the level to which we support each individual plugin.
So in the very near roadmap, we will be investing a lot in the DBT integration that we already have
to make it as
good as it possibly can be. And at the same time, we are investing in bringing more parts of the
data stack and a lifecycle into Meltano. So very quickly, very soon, you're going to release support
for great expectations within your Meltano. We are looking at Superset and Lightdash as some of
these BI analytics tools that you can bring into your Multano project and manage and configure consistently
with everything else.
And similarly, we are looking at open source
and reverse ELT solutions like RudderStack,
like Grouparoo and a number of others.
And even on the EL side,
just to kind of show to the world also
that we are not just here to push Singer
or to push dbt,
we plan to support Fivetran through an API connection.
And even Airbyte is in scope for us,
even though in our previous kind of how people thought about Multano,
it would have looked like a direct competitor.
But from day one, we have been building an end-to-end platform
to make data ops a reality.
Originally, we thought we could do so by just building one platform that does it all.
We've come to realize that it has to be plugin-based.
And in that new world, we leave it completely up to data teams
what tools they want to use on top of Notano.
We just want to make sure
we support all the current
kind of popular investing class tools,
make sure that data ops
is somewhat possible with them,
version control and all of this stuff.
And we don't really care
to be a kingmaker
for one particular technology.
So over the coming months,
especially Q1 of the coming year,
we will be kind of building out
this broader and deeper plugin support, as well as data
ops specific functionality, like isolated environments, end-to-end testing, and a lot
of these things that software developers have already been using.
And we have to just figure out how to make them work with data and data tools and how
to explain them in ways that will resonate with data professionals.
So this is all going to pan out over the next three months or so.
But we have a Slack community
of more than 2000 people right now
that are with us on this journey
and are giving us feedback every day,
are giving us contributions to make it on this path.
So I would like to suggest to the people joining us,
of course, keep an eye on the features
we'll release over the coming months.
But if you want to be part of this conversation
and you want to shape the data tooling of the future
and be part of this wave that's going to make data teams as effective and
productive as software development teams have become over the last 10 years through the
introduction of DevOps, then the Meltano Slack community is the place to be. And just a very
quick pitch as well. We are also hiring both in engineering and marketing. So if you go to
meltano.com slash jobs, you can look at ways to help us out.
We are all remote.
We're hiring across the world
and we pay really competitively everywhere.
So check us out.
Awesome.
Well, Dawoud, this has been such a fun episode.
Really appreciate you sharing some of the backstories
and incredible story in six months
going from being the lone project manager
or product manager for an internal product
to raising around
and becoming CEO.
So congratulations, incredible journey.
And we're excited to see where you take it.
Thank you so much, Eric.
Yeah, I think there's tons that we could keep talking about, like Kostas already mentioned.
So I think we'll have to come back maybe in Q1 of next year when we have made some more
progress in the data ops vision.
Let's do it.
We can talk about how that's panning out.
And we can also spend some more time talking about the transition
from an engineering manager inside GitLab to a CEO.
That's definitely been an opportunity for myself
to run into my own kind of limitations
and then pass assumptions that don't go anymore.
We could easily fill an hour just on that topic alone.
Great. We'll definitely do it.
Thank you so much.
Thank you.
That was such a unique individual in that he has a depth of knowledge across such a
wide variety of subject matter.
And I think that's certainly been accelerated by him taking on the role of CEO at Meltano.
This is my takeaway from the show.
There's the old adage, I think, from the Netscape fundraising story,
I think it was, that you're successful in two ways, you bundle or you unbundle.
And I've been thinking about that a lot lately in the data tooling space, because there are
companies actively trying to bundle and actively trying to unbundle in general
across tooling, but then also within specific disciplines.
And thinking about Meltano as sort of the package manager for the entire data stack
is a really fascinating way to bundle.
And I think it opens up a lot of opportunity for them that a lot of other companies
aren't going to have because they don't have to necessarily make choices about specific tooling.
And so I know I'm going to be thinking about that all week because, you know, it's sort of a very
unique approach to bundling, or I guess bundling is, you know, an interesting way to describe what
they're doing. So how about you, Costas? Yeah, a hundred percent. I totally agree with you. It's very,
it's very interesting to see like platforms like this and getting,
and at the same time we have a team behind it that, you know,
has like the best possible pedigree to succeed in this because they are coming
like from, from GitLab, right?
Where that's exactly what they were doing,
like building this kind of tools, but for software engineering.
So I'm very excited to see how they are going to move forward.
Hopefully we will have him on another show like pretty soon.
So because things are like changing really fast, but I would also like to add that if
they succeed in what they're doing, I think we are also, they are also going to act as
a great accelerator also for the open source projects out there,
which is very interesting because we have open source projects with a varying degree of maturity,
let's say, especially when it comes to the EL part with all the connectors and all that stuff.
So putting in place something like Meltano and also all the governance that Meltano brings
with all the initiatives around open source, I think we are going to see these communities actually maturing
much much faster which is nice because me as a person who has experienced let's say the
the birth of Singer then it got into like some kind of winter situation where it was like existing but not existing,
maintained but not maintained. And today seeing like all these actors with Maltano being the
leader like to revive the project and govern the project like in a way that's going to be
valuable. It's super super interesting like it's very fascinating and I'm really
interested to see like what's going to happen in the next couple of months.
Me too.
And we'll definitely have to have Dawa back on the show because we barely scratched the
surface on several subjects.
So thanks for joining us again on the Data Stack Show.
And we have lots of great stuff coming up.
So make sure to subscribe and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.