The Data Stack Show - 72: Building Data Ops Into the Data Lifecycle with Douwe Maan of Meltano

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're going to talk with Dawa, who is the CEO of Meltano. And I almost caught myself saying CEO and founder, but Meltano has such an interesting story.

Starting point is 00:00:39 It was a project started inside of GitLab, which is a really large company that builds a DevOps platform. And Dawa worked on the project inside of GitLab, which is a really large company that builds a DevOps platform. And Dawa worked on the project inside of GitLab. And I'm so interested to hear from him about how Meltano came to be inside of GitLab. We've talked with several companies who, several guests on the show, who have been part of technologies that were spun out. So we talked with someone from Netflix. Recently, we talked with someone who worked on building Hudi, you know, and several other technologies like that. GitLab isn't quite as big as some of those companies. You know, they recently IPO'd. And so to see this happen and kind of have it be so fresh, I'm really excited to hear the origin story about Meltano. How about you,

Starting point is 00:01:25 Costas? You having built tools that, you know, in the ETL space, I'm sure have a ton of questions. Yeah, yeah. I really want to discuss with him about like the evolution of Meltano. Meltano has gone through like transformation as a platform. I mean, many people probably remember it as like an ELT, like a competitor to Stitch Data and Fivetran. Today, something different. It's more of a platform like in this new category that they call like DataOps, which is very exciting for me because what it tries to do is like to bring all these best practices from software engineering into data engineering. And yeah, I'd love to see what happened, how the project changed,

Starting point is 00:02:08 how it became a company with VC money right now. And discuss about open source projects like Singer, because Nultano is very active there. So yeah, we will have plenty of things to chat about, for sure.

Starting point is 00:02:23 I have no doubt. Well, let's jump in and talk with Dawa. Let's do it. Dawa, welcome to the Data Sack Show. We can't wait to talk to you about Meltano. Thanks for having me, Eric. I'm very excited to be here. Okay. You have such an interesting pathway that led you to, you know, being the CEO of Meltano.

Starting point is 00:02:42 Can you just tell us a little bit about your career trajectory, how you got involved with Meltano, and then sort of the story of how you became its CEO, because it was inside of another company before. Yeah, that's right. So Meltano was founded inside GitLab. So if we go a little bit further back, I can kind of describe how I ended up there. I personally got into programming and computers at a very early age. At the age of nine, you know, my father always had computers around the house and not just stuff running Windows, but we had like Linux. So I always saw computers as something that would be tinkered with.

Starting point is 00:03:16 And that was an outlet for creativity rather than just something that does a thing and you use it when you need it. So from a very early age, I got into programming and through open source, I was able to teach myself a lot of things that, you know, in another time might've required going to college to the extent that by the end of high school, I had built a bunch of web applications and I had founded a company. And through the company I had at the time, which was called Stingo, which built products for bed and breakfast owners to manage their reservations and their calendar and communication. And yeah, I exactly at the end of high school, I was initially working for a company that builds iOS and Mac apps as lead engineer. And then through that company with

Starting point is 00:03:54 one of my bosses at the time, we ended up co-founding a company that built a software for bed and breakfast. So the cool thing is that I was, you know, in high school, early college, there were not a lot of people around me who were kind of building products at this level already. So I really looked for like-minded individuals in the Netherlands and European kind of tech and programming space. So I ended up at Ruby, European Ruby conference in Athens, where I was by myself and I was, you know, over lunch, I walked up to a table and I introduced myself to someone speaking there because I wanted to, you know, have a place to put my sandwich down. And I told him about what I was doing and that I was from the Netherlands.

Starting point is 00:04:37 And he mentioned to me that his boss, and he pointed to the corner of the room, was from the Netherlands as well. So I walked up to his boss and I explained to him what I was doing. And this is this bed and breakfast company we had. And it turned out I was looking to talking to Sid Sibrandi, the CEO and founder of GitLab, which at the time was this tiny little Dutch company that had been built around a Ukrainian open source software called GitLab. You know, one of these version control code review kind of GitHub like tools. And it turned out that Sid's parents owned a bed and breakfast in the north of the Netherlands. So his parents became customers of the product that I had basically, from an engineering perspective, single-handedly built,

Starting point is 00:05:15 although I won't take full credit for the company side of things. And coincidentally, Sid and I kept running into each other at different conferences around Europe in the coming months, up to the point where he asked me to join GitLab just around the time that it was going through its Y Combinator program and was raising its first funding. And then the timeline kind of worked out because the company I was running at the time, I was 18, but my co-founders were 35 and 56 or something. So we were at very different risk tolerance levels in our lives. So we decided to kind of wind that down and I jumped on the chance to join GitLab. And for the first year or so, I was a software engineer and then I became responsible also for building out the engineering team, hiring more engineers from the open source

Starting point is 00:05:54 community, which is always a really great position to be in where you can bring people in that have already kind of proven themselves and their enthusiasm for the product and their ability to, yeah, to come up with solutions that will help them and others. And then over a number of years, I got into engineering management up to the point where in 2019, GitLab had grown massively from 10 people to 1400. And I was starting to feel that itch and want to go back to earlier startup days where you have a smaller team and there's so much to do every day. You can really feel the impact of the decision you're making in a very short term. And in general, that way of being at the forefront of solving some new problem and having super happy users. So as I'm sure everyone in the room today is familiar with,

Starting point is 00:06:33 it's really great. So I joined Meltano in 2019, but Meltano had been around since 2018. Meltano was originally founded inside GitLab because the GitLab data team and GitLab as a whole realized that the state of data tooling was very different from the types of developer tools we had gotten used to that embrace best practices such as version control and code review and allows entire teams to collaborate on their product in a way that enables really quick iteration and makes it easy to experiment and make sure that people can just make changes without being worried that they'll break stuff in production. And as an engineer looking at the state of data tooling, myself, but also other engineers in GitLab, we were kind of surprised to see that a lot of these best practices that we saw as pretty transferable and a lot of the problems that these teams have as parallels were not being addressed yet by the tooling of the day. So just like-

Starting point is 00:07:28 Sorry, go ahead. Well, just to dig in a little bit, that's super interesting. And so just to say that another way, you were looking at data tooling. So let's just say, you know, whatever, traditional ETL or streaming or whatever. Were those more, the challenges were that

Starting point is 00:07:42 they were primarily sort of UI based and like tucked a lot of the, a lot of the mechanics under the hood. And so you don't have things like version control or other sort of, there's not really like a development life cycle with data tooling as there is with normal software. Was that sort of the key piece that was missing? Yeah. Yeah. We can talk about that a little bit more.

Starting point is 00:08:03 That's great. So GitLab was relatively late to start setting up its data team. So the initial beginnings of that was really just GitLab engineers looking around and seeing, okay, you know, we got to build a data stack, we got to move data from A to B, and we want to analyze it. And they came into it with certain expectations, like, oh, yeah, you know, we're developers, this is all kind of like building application or building these pipelines. And then what they found is exactly what you're describing. Some of the things that they had started taking for granted, even though even in the software development world, DevOps was not really a thing 10 years ago.

Starting point is 00:08:33 I grew up FTPing into a web server and making life changes to PHP files in production. And that very much feels like the way the data space is still today, or at least a couple years ago. So the big thing is definitely a lot of these tools being UI based, being kind of proprietary SaaS tools that run in a browser somewhere, and don't give you a lot of the flexibility and customizability and ownership and say over a really core component of your stack that developers expected in combination with these tools not being open source, which also ties into being sort of limited by what they do today and not having that opportunity to improve them or to

Starting point is 00:09:10 make them fit your workloads better. But the fact that they're UI based and that they come from a world where, you know, companies have these big end-to-end data tools they log into and they make all the changes in the user interface didn't jive with these expectations of pipelines are code everything can be code version controlled everyone in the team no matter their disciplines or their their kind of comfort around el for example is able to go in see the configurations and propose changes trace how data flows through the system by having a full overview of everything and exactly like you're saying version control code review continuous integration and deployment having automatic tests run so that things don't accidentally break, having isolated environments so that you can make

Starting point is 00:09:49 changes locally with complete freedom without ever worrying about accidentally breaking the dashboard the CFO is looking at. These were things that we were expecting to find and did not. So we saw it as an opportunity not just to build an internal tool for GitLab to use, but we saw that there was an opportunity in the market here to build data tooling that really embraces at a really deep level, the software development best practices of DevOps and open source. And from day one, GitLab realized that by building a tool that would help GitLab in this way, that would also be able to help people externally. So from day one, the hope was that this would one day develop into its own business unit, its own business per se, by building something valuable for us that would transfer to others. And we saw one the hope was that this would one day develop into its own business unit its own business you

Starting point is 00:10:25 know per se by building something valuable for us that would transfer to others and we saw an opportunity to make data ops a reality similarly to how gitlab had been pivotal in making dev ops and you know dev sec ops a reality so in 2018 when the data team was really small we get lab set up this team to start building this tool called Meltano. Meltano being an abbreviation for model, extract, load, transform, analyze, notebook, and orchestrate. I didn't know that. That's great. Yeah, no, it's awesome.

Starting point is 00:10:54 It's some of the stages of the data lifecycle that we identified, and I don't know who it was that put it together in this particular order, but I think Meltano has a really great sound and kind of mouthfeel to it. And it's cool that it kind of relates back to all of those aspects of the data lifecycle. But we also saw that GitLab's data needs were growing at a pace that the internal team building Meltano was just not able to keep up with.

Starting point is 00:11:14 So GitLab did end up using some of the more traditional tools in the space, you know, Fivetran and Stitch for EL, a bunch of different tools we tried out for the BI side of things. But we always believed that the future of data tools, not just for GitLab internally, but also for the whole world, would look a lot more like software development tools and data people becoming more and more comfortable, not with programming per se, but at least with concepts of version control and command line interfaces, managing your configuration in YAML files. And the Meltano team never gave up on that, on that, yeah, that goal or that vision for the future. I love it.

Starting point is 00:11:48 I, you know, it's interesting if you think about some of the more UI-based tools, a lot of those are driven by analytics use cases from other parts of the organization. And so it makes sense, you know, sort of the way they were built, you know, sort of with the SaaS model and tucking the mechanics under the hood.

Starting point is 00:12:02 And so now we've had a lot of people on the show where they're trying to bring software development principles into the data space because they realize the need there. But I just love thinking about the team at GitLab who's been building DevOps stuff coming into the data space and saying, whoa, like what's going on? Like, you know, where is, where's all the componentry? Like, so great. For me, it was really interesting in, and we're

Starting point is 00:12:33 jumping in the timeline around a little bit, but I'll talk more about how, you know, I came to join Meltano, but when I joined Meltano, I was very new to the data space and I knew that, you know, clearly there was a need there for something github was building but i was just really surprised to find also the breadth and depth of the open source offering in the data space i was positively surprised on some fronts because there exists really great or you know set up to be really great with a few more years of iteration the itools for example like metabase and superset and red, and there's a bunch. Whereas DBT is phenomenal as a transformation tool that also kind of introduces a lot of analysts to some of these software development best practices. But I saw at some point, I was surprised to see that especially

Starting point is 00:13:16 on the data integration side, of course, there exists tools like RudderStack that are kind of focusing a little bit more on what we now call reverse ELT. But everyone still seems to be using a Fivetran or a Stitch, and there exists a library of connectors in the Singer standard that have been built around Stitch, but a full stack that can replace a Fivetran, for example, that you can just run open source. I was surprised to find in 2019 that that hadn't been a completely solved problem already.

Starting point is 00:13:42 So we can go back to 2019 or 2018 when Meltano was founded and kind of cover a little bit of the time and the changes that have gone on in Meltano during that time. When Meltano was founded in 2018, we had this hope of building an end-to-end platform that could do everything from data integration to helping you build the dashboards end-to-end from data to dashboard is what we called it at the time. And we were on the one hand looking at great open source technologies that we could leverage. And we were also willing to build our own new stuff that would really work well with this software development way of thinking. From day one, this was going to be open source. We were going to build it with the community and we really wanted them involved,

Starting point is 00:14:20 not just from a feedback perspective, but also actively helping us make this a reality. But we came to realize over the course of 2018 and 2019, and I joined at the end of 2019, that this end-to-end vision was too heavy in a way for people to adopt and start using and start contributing because we kind of assumed that you would replace your entire data stack with this Maltano thing, which meant that we had a lot of ground to cover until we could actually plausibly replace whatever best-in-class tools that companies had picked so far. So by the end of 2019, when I joined, we were working on making the end-to-end thing work where you could bring plugins into Maltano for a particular data source like a Stripe or Shopify or what have you.

Starting point is 00:15:05 We were kind of focusing on the business to business or the B2C rather e-commerce field just to have a use case in mind to focus on. And we had built something where you could bring plugins in for these sources and you could indeed with one kind of one click go from entering your credentials for Stripe or Shopify or one of a number of tools we supported

Starting point is 00:15:23 and then having a dashboard show up at the end. But we were getting some interest from really early startup founders who didn't have the resources to build a data team and set up their own stack. But we were not actually getting the interest from the data engineering or data analytics community that we were looking for. So in early 2020, from GitLab's perspective, the decision was made that the numbers that we were seeing in terms of traction and usage did not warrant the continued kind of full-time staff of six people on the team at the time, which was a general manager, Daniel Morrell, myself, an engineering lead, and then four engineers. One of them we found out earlier is actually a friend of Kostas's, Janis Roussos.

Starting point is 00:16:00 He's really awesome. But we realized that six people on a product that was kind of flat in terms of growth was just not going to work. so the decision was made to reduce the headcount down to one to essentially extend the runway sixfold and i was left by myself on the product essentially to figure out how i could turn notano around so over those first few months that was of course super daunting because i was essentially the newest to data out of the entire team. My background is in software engineering. And I realized that I was kind of blind to the needs of data professionals themselves. And I was very aware of whatever, all you have is a hammer, everything looks like a nail. And am I just seeing things that aren't there? It's a big problem with the data world, really bringing more developer style tooling in and making open source data stacks

Starting point is 00:16:45 more of a compelling alternative. So I started talking to a lot of the data people that had become Meltano fans and followers over the years, not users, not contributors in many cases, but at least people who were willing to talk to us about what they liked and what resonated originally. And I found out that sort of accidentally in Meltano, by identifying these great open source technologies for different stages of the lifecycle, we had found Singer as the standard for open source data connectors, which was built by Stitch, as we talked a second ago about, which has this ecosystem of at this point, more than 300 connectors for different sources and destinations.

Starting point is 00:17:20 And the question I was getting from these users was that like, well, you know, you're building your own open source BI, but there's already a bunch of solutions for that. You're embracing dbt for transformation. That is great. But you know, dbt is great standalone, but this singer thing could really benefit from better tooling around running these pipelines, deploying them, configuring them, building new connectors for data sources. So we realized, I realized that not necessarily by changing the

Starting point is 00:17:45 product, but by changing the positioning to focus exclusively on open source ELT. And Luke, this is the best way to run Singer and DBT powered pipelines on your own infrastructure, on your own machine. And you get all of these DevOps and DataOps advantages for free because your pipelines are managed in a YAML file. And you get testing and all of this stuff. Over the course of 2020, just through the simple act of changing the way the website talked about what Meltano was, we suddenly started picking up tons and tons of usage as an ELT tool.

Starting point is 00:18:11 Even though from our perspective, Meltano had always been an end-to-end platform that picks best-in-class technologies to build integrations with that can run on top of the platform. So by the end of 2020, we had really kind of created the change in the Singer ecosystem

Starting point is 00:18:26 that we and the community agreed was needed. There was always this weird situation where Stitch itself is a paid proprietary SaaS data integration platform, but the connectors that run on it in many cases are open source and available for free and you can just download them. But those connectors by themselves

Starting point is 00:18:41 don't give you all of the EL functionality to actually want to run the stuff in production. And that is where we stepped in to the point where in early 2021, earlier this year, I got the permission from GitLab to start bringing some more people. And we started talking about setting Meltano up for best success in the market and really becoming the tool that makes data ops a reality for the data lifecycle and data teams as a whole. And we realized that since GitLab being a 1,400-person company, where literally 1,399 people were working on this big thing called GitLab and marketing for GitLab and sales for GitLab and everything GitLab,

Starting point is 00:19:20 and I was by myself in working on this tiny little other thing. And we realized that some of the stuff you need as a startup to be able to move fast and make compelling offers to great candidates, GitLab was just not set up to do anymore because the realities and the needs were so different. So we realized that in order for GitLab or for Multano not to be slowed down by the inevitable increase in bureaucracy that had kind of come up in GitLab, our best path forward was to spin out. So over the course of 2020, as we were gaining traction, I had already had literally dozens of VC firms that had reached out to talk about this eventually, like what's Naltano going to be?

Starting point is 00:19:55 Is it always going to be internal? Is it going to be its own thing one day? So early 2021, earlier this year, we started concretely talking to some of these potential VCs and that led to us leading a seed funding round from GV, formerly known as Google Ventures. And that led to my transition from literally in January, I was a general manager of a product by myself. In February or March, I hired two people while we were still in GitLab, so we were three.

Starting point is 00:20:19 And then three months later, I was founder and CEO of a startup that really quickly built a team to about eight, nine, ten people. And six months earlier, I had just been by myself. So that was amazing. But as you can imagine, also a whole new challenge and opportunity for myself to be pushed to my limits and have to overcome them, which, of course, is extremely rewarding. Yeah, that's amazing. I think we should spend some time later to share with us a little bit of like what this transition felt like. Because to be honest, like you have like a quite amazing, let's say, journey so far from like, as you said, from being a teenager, building apps, going like very early on GitLab,

Starting point is 00:21:02 being a manager for engineers and now a CEO. I think there's a lot of like wisdom like to share there, like even for just like the emotion side of things, right? Like how the emotions change. But let's do that a little bit later because I want to ask you about Singer. Singer is a very interesting,

Starting point is 00:21:22 how to say that, like case of open source projects, especially like in the data space, because I had like the opportunity to, let's say, experience the war between Fivetran and Stitch Data as it was happening, because I was also competing with them. And it was very interesting how these companies were positioned and how Singer came into the game, like to support this positioning that data have. But Stitch Data left the game a little bit early. They launched this thing, it got traction, then they got acquired by talent. And then we were left with Singer out there, where people keep using it. And it's the moment today, like all these years, we have like Meltano, which is building tooling around it. We have Airbyte, which is pretty much based on the Singer protocol. And I'm pretty sure

Starting point is 00:22:12 we will see more stuff happening around it. So I'd like to ask you, first of all, what was like Singer when you first started working with it and what was missing from it? What was like that Stitch Data didn't do about Singer? Yeah, great question. So when I came, you know, when I really started digging into the data space and Meltano and the tools we had adopted in 2019, Singer had already been the standard for data connectors that we had adopted because the library at the time was, I think, somewhere in the 100 to 200 range of connectors that were supported. And there was a community of a few thousand people around it. And there seemed to be, at least on the more popular connectors in the ecosystem, frequent enough updates that they would be production

Starting point is 00:22:56 ready. But from talking to the people, what we realized is that connectors for sources and destinations, just these tiny little executables that you can run on your terminal and you can pipe them together to have data flow from A to B, are not enough to actually replace an entire EL solution. And that's, of course, also why Stitch itself,

Starting point is 00:23:16 the hosted platform for running these senior connectors, is paid because a lot of the value is not just in the connectors themselves, but in the tooling that manages incremental replication, that manages backfills, that manages all kinds of aspects about the production level reliability of these pipelines that goes beyond just running the code. And Meltano had already built that.

Starting point is 00:23:39 The other thing that we saw is that people found it too difficult to build new connectors and to improve existing ones. There existed this Singer Python library that had a number of helper functions, and most of the connectors were built around this library. But there was a lot of decision-making on the side of the engineer as for how exactly to use these, how to deal with incremental replication state, how to manage, how to deal with selection of specific streams and columns, which are roughly analogous to like tables and database table columns. So we realized there was also an opportunity for better tooling around building these connectors. And then finally, the big problem was discoverability. Singer.io, the official website for Singer, has a list of about 99 connectors, but in most cases,

Starting point is 00:24:22 those link to the connectors in the Singer IO namespace on GitHub, where a lot of these repos are housed. And as we've been talking about, Singer, unfortunately, I think because of the talent acquisition, sort of lost the motivation to really actively maintain these projects. So a lot of these repositories ended up with, I mean, dozens of unanswered open issues and pull requests and bugs that had been known for ages, but just had not been fixed. Even if a fix had been provided by the community, the plugin you would have downloaded would still have had the bug. So there's two issues there in discoverability. One of them being that in many cases, these Singer.io repositories actually had

Starting point is 00:25:02 forks that were more actively maintained. And those are really the ones you should be using if you want to have the highest quality and everything. And the other part was that Singer.io only listed these connectors that Singer at one point had adopted into their own GitHub namespace. There existed hundreds of connectors in other companies, consulting firms, other data products, own GitHub repositories that were also available for free in often cases more maintained,

Starting point is 00:25:29 but were not discoverable at all unless you knew how to do the special search on GitHub. So we identified these three issues, building these pipelines and running them in production, building connectors, and then discovering connectors. So we just set out essentially to address them one by one to lift up the Singer ecosystem and empower it, not to necessarily own it and make it our own,

Starting point is 00:25:50 but to make it, give it all the tools it needs to be able to stand on its own and keep growing, even without our kind of continued heavy-handed involvement. So Meltano itself became this runner that makes it really easy to run, configure, deploy. We built the Meltano SDK for Singer taps and targets that makes it easier than ever to build new connectors. The code footprint of an existing connector that is ported to the SDK is reduced by about 90%. And people have told us

Starting point is 00:26:15 that getting a new connector up and running with all of the Singer bells and whistles like replication, incremental replication, and stream and column selections only takes as much of two hours because of some of these abstractions that we have built around REST APIs, GraphQL APIs, and other custom methods. And then finally, we Meltano Hub for Singer tabs and targets to catalog all of the different tabs and targets in the ecosystem, which it turns out there are more than 300 sources and destinations that have Singer connectors for them. And about half of those have been updated in the last year. And the other ones are not necessarily outdated. Those might just be APIs that don't require quite as frequent updates.

Starting point is 00:26:53 So the Singer ecosystem is a really great place now compared to how we found it as Meltano about a year and a half from now. And we have recently also set up the Singer Working Group, which has us in it, along with a number of big players in the Singer ecosystem, including the Stitch team at Talent, who were, of course, the original creators of the spec, other tools that use Singer in their power, their connections like Hot Glue and Y42, and there's a few others, as well as some of these consulting firms that had built a lot of these connectors over the years for their clients that needed sources that were not supported by some of the tools like Fivetran. So Singer is now

Starting point is 00:27:31 at a place where it can, in combination with Matano and these other tools we've built, rival Fivetran and a lot of these other tools, especially on the size of the connector library and the advantage of it being open source, which means that you were never limited by anyone else if you want to improve or extend or customize these connectors or if you want to build a new one for a new source. And interestingly, having Puttsinger in such a place has actually given Meltano the opportunity to look at what we're doing and what our mission is

Starting point is 00:28:00 and what our goal is and to take a step back from this really narrow focus on EL, which we kind of took as a strategic decision in early 2020, as I was describing, and to focus again on bringing DataOps to the entire data lifecycle by building Meltano into a DataOps operating system that can form the foundation of every team's ideal data stack by allowing best-in-class open source components for various stages of the data lifecycle to be brought on top of the OS, with the OS taking care of the consistent installation,

Starting point is 00:28:33 configuration, deployment, and the integration between the various tools. And I can talk a ton more about that because it's kind of where we're going, but it is good to stand still a little bit on Singer and what it was and what it is today and what we've been doing. Yeah, yeah a few questions uh about the future but i'm sorry i'm a little bit like curious about like the evolution of singer right because from what i hear from you

Starting point is 00:28:55 we are talking about okay we had like singer the ergonomics of like the sdks and all the stuff like we're not like the best you created like on top of that the miltano sdk or like the extension how does this to be clear we have not extended singer in any way so far we are working with the singer working group on singer extensions but we want to make sure that those are supported and approved by all of the different players in the singer ecosystem because we think a big part of its power is the fact that it is no longer purely connected to one particular product or company. Like it used to be when it was just the connector framework for Stitch. And similarly, we're seeing other open source data integration vendors, like somebody mentioned before, coming up and building their own connector standards on top of Singer with private extensions.

Starting point is 00:29:41 But we believe that Singer is kind of special in that it is agnostic and really community led and everyone in the ecosystem different consulting firms and different tools can adopt it because it is the defector open source standard without any particular company that owns it today which is a strength okay perfect perfect the reason that i'm asking is because like i'm quite aware of like how the airbyte version of singer works which it is built on top of uh singer it's not singer exactly right like they have made some very smart decisions in terms of like how the interfaces work like with the standard input output like between like docker images and stuff like that that gives like a lot of let's say their operability between like different like languages and frameworks stuff like that that gives a lot of, let's say, interoperability between different languages and frameworks

Starting point is 00:30:26 and stuff like that. But it's something different. It's not exactly Singer. I mean, there are elements of Singer, but I cannot imagine, I'd say, backward compatibility in this thing. It's something different at the end, right? So that's why I was asking if it's something similar

Starting point is 00:30:42 at the end, what Meltano is doing, or you are focusing on maintaining and's something similar at the end what meltana is doing or you are like focusing on maintaining and reviving a singer at the end yeah i the interesting thing is that because of the singer as a standard is really great like stitch came up with it it served their needs for a long time but it also haven't hasn't evolved a lot over the years since they have sort of lost interest so there are definitely a lot of areas in which it can be improved. But at the same time, a lot of the issues with current Singer

Starting point is 00:31:11 or existing Singer connectors were not actually because of limitations in what Singer can do, but just in the fact that a lot of these connectors were not even making the most of what Singer can already do today. So we wanted to first address that by making it so easy with the new SDK to start using everything that Singer can already do today. So we wanted to first address that by making it so easy with the new SDK

Starting point is 00:31:25 to start using everything that Singer can already do today to kind of reach the full potential that was already there before starting to look ahead and see, okay, how can we make Singer better? So the first important thing for us was to increase the consistency and behavior across different connectors in the ecosystem, especially for newly written ones. And the SDK has delivered on that and makes it so that you can opt into some of these Singer capabilities without having to completely figure out yourself how to implement them.

Starting point is 00:31:52 And it automatically leads to more consistent behavior across the board. But now that people can actually make the most of Singer through Multano and the SDK, we are starting to work on improvements to the spec. Airbyte was in this, you know, in their case, great position where they could just say, okay, we don't need backward compatibility. We're going to just call it, you know, the Airbyte spec.

Starting point is 00:32:10 We're going to take a lot of inspiration for Singer, and then we're just going to fix everything we think is broken and improve it. And they could do so unilaterally. But we think that there is so much potential in the Singer ecosystem and the existing community of literally hundreds of thousands of consulting firms and different data engineering teams and data product developers that we didn't want to just let it go because then you get the disposition of that famous XKCD comic that says there are 12 standards, they all suck, I'm going to make a new standard. And then the next frame says, now there are 13 standards. And then it kind of becomes this loop. So we decided the only way to

Starting point is 00:32:43 really make Singer better is to bring, well, first kind of increase this loop. So we decided the only way to really make Singer better is to bring, well, first kind of increase people's confidence and trust and belief that this is going somewhere. And through these things we've brought in the Singer ecosystem, we have definitely kind of revived that enthusiasm. And then the next thing was to get all of the big players invested in Singer kind of together in a room

Starting point is 00:33:02 to start working on those next iterations of Singer together. And the first priorities for the Singer working group I've been talking about are to address some of the same concerns that Airbyte has been able to already address because they could do so. But we are starting to do this through a more standardized process where we get everyone involved around the table and also bought into supporting this in their connectors going forward and implementing in their tools. So that has to do with things that improve performance at throughput it has to do with like the automatic discoverability of a connector's configuration features for example which is now something that kind of lives separately in the repo from the actual connector

Starting point is 00:33:37 and there's a whole list of other things that you can find if you google a single working group and you find this repo where we're working with these players. And we were actually really grateful to see that the Stitch team at Talent was just as excited as us about this opportunity to kind of keep growing and improving this for the benefit of the entire data community. And that ties back to the importance of Singer being seen as something kind of separate and agnostic and something that will always survive as long as enough people use it rather than something whose fate is tied to one particular product. In part, because from Meltano's perspective,

Starting point is 00:34:10 we don't want to take over ownership of Singer forever because we are building a data ops operating system. We're not just building an EL tool. So it's in our interest for there to be independently thriving open source technologies for every step of the data lifecycle that we can make better than the sum of their parts. But ultimately, it has to be this ecosystem and community around Singer that keeps it alive. And we are happy to have a big role in that and put development resources and everything towards

Starting point is 00:34:35 it. But we cannot do it ourselves. I have a question that I think is also going to lead us into the future of Meltano and DataOps. And I want to ask you about how you, as Meltano, can manage the quality of these connectors. And I think this is one of the biggest, let's say, arguments that a closed system like Fivetron has that, yeah, sure, you can go download something from GitHub. And of course, many of these like versions of the connectors that just crap right like they are not updated they are not simply made it well like all these things so how do you deal with that like with such like a diverse let's say code base yeah yeah it's a really interesting question and it kind of goes through the trade-off between the decentralized maintenance

Starting point is 00:35:26 of an open source ecosystem where you get a ton of advantages like there's not a single bottleneck who slow who can slow things down and the the amount of connectors is essentially endless if you decentralize the maintenance to different kind of invested parties but that also means that we cannot fix a bug ourself unilaterally in some particular connector if we want to, because we do not necessarily have ownership over that repository. The way we're thinking about it is that in any open source ecosystem, if there are enough users who are okay with this deal of, okay, I get to use it, but I maybe occasionally have to fix stuff, then the top used connectors will automatically get enough usage and eyeballs that they are in a good state. And for us, it's more important to have a

Starting point is 00:36:11 decentralized ecosystem that can scale indefinitely than to have a smaller or controlled ecosystem that we have tighter control over. But that does mean that if you are a company that just needs connectors that will always work and you never have to worry about maybe fixing a bug yourself, Meltano or rather Singer might not be the best choice for you today. But the more companies become involved that do this work, the higher quality, even companies that aren't willing to put in their own contributions can of connectors in the ecosystem is already higher than a lot of people might have thought a year or so ago because the back the best variants of a lot of these connectors are in prior in forked repositories rather than the the initial singer io one that you will find and a lot of them are seeing maintenance so in part to address this maintenance question we have also set up naltano labs which is a way of pooling decentralized maintenance so that people don't have to take on the maintenance burden indefinitely, but they can say, okay, for a period of time,

Starting point is 00:37:09 we are heavily using this one or are we improving it for our clients? So we are okay with kind of taking on the maintenance hat for the next three months or so, but then it stays within the Meltano Labs pool where we have some control over it, but we are not a bottleneck per se. The flip side of this though, is that in the open source ecosystem already, web applications you use every day, including Rudderstack and Multano, but also massive ones like Reddit and Facebook and whatever

Starting point is 00:37:32 are all built in open source technology that in many cases are also just managed by individual contributors. And you have the same motivation of, or the same trade-off of, can we expect that quality to always be there? But we all know that there are high quality maintained API client libraries

Starting point is 00:37:48 for all of the big APIs, for all of the big programming languages. You can find Shopify API clients in every programming language. In many cases, these are built even by the vendor themselves, or they're maintained by an active community of maintainers. And if we trust these API client libraries enough

Starting point is 00:38:03 to use them in production software, then on the limit, there is no reason to not trust an ecosystem of connectors at a similar level. But from the perspective now as a data ops OS, we don't really care which particular technology you bring into Meltano, whether that is Singer or dbt, or even, you know, Airbyte or Rudder stack, we have plans to support all of these in the future, because we think that it's up to us to provide teams choice to put together their ideal stack where they can make the trade offs they need. And we will build the data ops OS that kind of ties it all together and allows them to treat their entire data stack as a product in the way of the software product development lifecycle, rather than just a set

Starting point is 00:38:44 of disparate kind of tooling and purchasing decisions. So Singer is not going to be for everyone, maybe not ever, but that's okay because there are lots and lots of organizations that do like the trade-off of, I can fix it and improve it and customize it without needing to ask someone for permission.

Starting point is 00:38:59 And I'm okay spending a few engineering hours per month to do so, just as is the case today with other open source projects. Yeah. Well, it's a huge conversation. I mean, we could probably multiple episodes just chatting about how you can structure and this kind of like open source project.

Starting point is 00:39:16 And for me, it's like very, very interesting. And I think there's a lot of value in there, but let's keep that for another episode. I'll be more than happy like to just dedicate one just for this. And let's get that for another episode. I'll be more than happy to just dedicate one just for this. And let's get into the DataOps side of Meltano. So you mentioned at some point that Meltano started as an end-to-end platform, okay? And it has transitioned now into a DataOps or transitioning into a DataOps platform. What's the difference? What's the difference between the two? Yeah, good question.

Starting point is 00:39:47 So when you're looking at kind of the previous generation of data tools, what you primarily saw is these big products that kind of do it all. They do everything from the integration to the analytics. And this is potentially a consequence of these tools maybe having started with a less technical analytics audience with a BI tool and then working backwards into the rest of the stack until they do it all. But they do it all from a kind of a UI-based SaaS web browser perspective. And the tools you'll find today that call themselves data ops platforms are also these types of tools that try to do everything really well while

Starting point is 00:40:19 bringing in some of these data ops qualities and software development best practices. But the data space of today is uniquely horizontally integrated in the sense that you have for every kind of step in the data lifecycle and every layer in the stack, you have a number of competing solutions and new ones coming up every day and being funded by VCs and going through accelerator programs like Y Combinator. So it's not realistic anymore for any data team really to find one tool that does it all that they will actually be happy with in the long run, because you're going to be missing out on a lot of these new improvements. But with the data space having turned from one big

Starting point is 00:40:53 application with full visibility and control of every aspect of the data stack into this world where you have tools with a really narrow focus that need to be kind of individually integrated between them, in many cases, manually by data teams. What has gone missing is this sense of a unified unit called the data stack that can be reasoned with as a whole, that can be version controlled as a whole, that can be end-to-end tested, and that can be experimented with and played around with without worrying that there's some SaaS thing running somewhere that doesn't have this concept of an isolated environment. So the way we're seeing the world now is that there is a really big opportunity for a new foundation, a new layer in the data stack that we are calling the DataOps operating

Starting point is 00:41:36 system that forms the foundation of every team's ideal data stack. That's how we've described our vision. What that means is that these best-in-class open source components, like a Singer or an Airbyte for EL, a DBT for transformation, Rudder stack or similar tools for reverse ELT, superset database, et cetera, for BI and analytics. And of course, also you have all of these data science tools like Jupyter that can be brought in that are also part of the data stack. We want all of this stuff to live together and be defined in a single repository in a declarative way so that a team can reason about their data stack again as one unit and get these advantages I was describing.

Starting point is 00:42:14 So compared to data ops platform just in the past, the big difference in Multano is that we are modular from first principles and architecture and that we want to earn a new place in the data stack instead of trying to replace something existing. And we call ourselves a DataOps OS because what we care about a lot is in kind of merging these worlds

Starting point is 00:42:36 of software development and data engineering, or at least allowing them to cross-pollinate and learn from each other more. Because we think that a lot of work that we currently call data engineering is really data stack development and it's far closer to software development where you're also picking you know off-the-shelf components custom components or some open source technology you might be using some sas that you have to connect with over an api and we are trying to allow data teams to start treating their work more like software development

Starting point is 00:43:03 and get those same advantages. And our path is sort of, you know, prepared for us a little bit by dbt already making analysts more comfortable with some of these concepts. And we are trying to go all the way and bring data ops,

Starting point is 00:43:17 not just to EL in the case of what Montana has been over the last year or to T as dbt is doing, but to the entire data stack. And we think data stacks can be better than the sum of their parts if you bring in Meltano to help manage it all and help the integration between the different components of the stack. That's great. I have one last question because I start feeling like really bad that I'm monopolizing the conversation here.

Starting point is 00:43:45 Oh, you're not. I'm pretty sure I'm talking way more than you are, but yeah, your colleagues should talk too. Yeah, exactly. And I'll wear like my engineering hat and I'll make like a question to hear about DataOps. So what's the difference between like the DataOps operating system and something like Airflow? Yeah, that's a great question. So one big difference is that

Starting point is 00:44:07 in your data stack, data movement is kind of the domain of Airflow and similar workflow orchestrators like, you know, a DAX or Prefect. And they, within their workflow orchestrator, have, of course, reached out to different tools that handle parts of that workload. But there's more to the data stack than that. You have a BI tool at some point, you might have tools that don't really fit within the Airflow way of working. And if you're using Airflow, you still have to install it somewhere and deploy it somewhere and manage the version control of your orchestrators. And similarly, if you're using a BI tool, you still have to install it somewhere and manage your dashboards and version control of your orchestrators. And similarly, if you're using a BI tool, you still have to install it somewhere and manage your dashboards and version control those.

Starting point is 00:44:49 So Meltano forms essentially the package manager for your entire data stack that all of these things can be brought into, even things that are completely out of scope for Airflow, which only cares about data movement, for example. So Meltano allows you to, any tool your data team uses, whether it's the analyst or the analytics engineer or the engineer, whether it's about the movement or the consumption at the end, they form part of a greater product where in some sense, the end users are your colleagues within the company. The interface or the features are some of those consumption methods and dashboards.

Starting point is 00:45:18 And then the backend, so to speak, is more of where Airflow lives. But that front end and the whole product is what Meltano brings together by forming a package manager for every tool in the data stack, which from an engineering perspective, you can also see as a terraform for data stacks because we allow people to really easily bring in tools declaratively or with a CLI. And then Meltano manages the configuration and the deployment and all of that stuff.

Starting point is 00:45:42 So that an engineer that wants to put together a data stack doesn't have to pick six tools, learn how to install them, learn how to configure them, and then be the only person in the team who really knows how it all works. We want to also sort of democratize that,

Starting point is 00:45:55 make it, give it a single source of truth that the entire team feels comfortable collaborating in and also trying out new tools, swapping out new tools really easily by giving them the confidence that if Meltano has support for your tool, adding it, trying it locally or wherever is going to take just a few minutes of work instead of this daunting task of figuring out how am I going to integrate it. Maybe this one is Docker, maybe this one is Python, maybe this one is NPM. We want to unify

Starting point is 00:46:18 all of that. Yeah, that's great. Eric, all yours. I have to apologize, by the way, to both of you, because I just realized that based on the outline of the conversation that we have created before we started the recording, like the stuff that I asked were completely different. That's great. It was awesome. I learned a ton. Like you said, it would be an organic conversation, right? So we'll take it wherever it goes.

Starting point is 00:46:42 Yeah, I know we're close to time here, but Dawid but a couple quick questions so one is how much of what you just talked about i know there's sort of part vision this is where meltano is going how much of that exists today i mean how much of that can you actually use today well that's the perfect question. So architecturally, even during the year or so that Meltano was talked about and perceived as an ELT tool, Meltano was always this plugin-based architecture

Starting point is 00:47:13 that allows different open source technologies and tools to be brought in. So from a software perspective, we're essentially already there. The only thing we're still lacking is in the specific plugins we support. So far, we have invested

Starting point is 00:47:23 really heavily on support for Singer, Taps and Targets, for EL, DBT, for Transformation, Airflow, for Orchestration. And the biggest challenge for us now is to kind of keep building out in the breadth of types of plugins we support. And of course, the level to which we support each individual plugin. So in the very near roadmap, we will be investing a lot in the DBT integration that we already have to make it as good as it possibly can be. And at the same time, we are investing in bringing more parts of the data stack and a lifecycle into Meltano. So very quickly, very soon, you're going to release support

Starting point is 00:47:55 for great expectations within your Meltano. We are looking at Superset and Lightdash as some of these BI analytics tools that you can bring into your Multano project and manage and configure consistently with everything else. And similarly, we are looking at open source and reverse ELT solutions like RudderStack, like Grouparoo and a number of others. And even on the EL side, just to kind of show to the world also

Starting point is 00:48:16 that we are not just here to push Singer or to push dbt, we plan to support Fivetran through an API connection. And even Airbyte is in scope for us, even though in our previous kind of how people thought about Multano, it would have looked like a direct competitor. But from day one, we have been building an end-to-end platform to make data ops a reality.

Starting point is 00:48:35 Originally, we thought we could do so by just building one platform that does it all. We've come to realize that it has to be plugin-based. And in that new world, we leave it completely up to data teams what tools they want to use on top of Notano. We just want to make sure we support all the current kind of popular investing class tools, make sure that data ops

Starting point is 00:48:51 is somewhat possible with them, version control and all of this stuff. And we don't really care to be a kingmaker for one particular technology. So over the coming months, especially Q1 of the coming year, we will be kind of building out

Starting point is 00:49:02 this broader and deeper plugin support, as well as data ops specific functionality, like isolated environments, end-to-end testing, and a lot of these things that software developers have already been using. And we have to just figure out how to make them work with data and data tools and how to explain them in ways that will resonate with data professionals. So this is all going to pan out over the next three months or so. But we have a Slack community of more than 2000 people right now

Starting point is 00:49:30 that are with us on this journey and are giving us feedback every day, are giving us contributions to make it on this path. So I would like to suggest to the people joining us, of course, keep an eye on the features we'll release over the coming months. But if you want to be part of this conversation and you want to shape the data tooling of the future

Starting point is 00:49:44 and be part of this wave that's going to make data teams as effective and productive as software development teams have become over the last 10 years through the introduction of DevOps, then the Meltano Slack community is the place to be. And just a very quick pitch as well. We are also hiring both in engineering and marketing. So if you go to meltano.com slash jobs, you can look at ways to help us out. We are all remote. We're hiring across the world and we pay really competitively everywhere.

Starting point is 00:50:08 So check us out. Awesome. Well, Dawoud, this has been such a fun episode. Really appreciate you sharing some of the backstories and incredible story in six months going from being the lone project manager or product manager for an internal product to raising around

Starting point is 00:50:25 and becoming CEO. So congratulations, incredible journey. And we're excited to see where you take it. Thank you so much, Eric. Yeah, I think there's tons that we could keep talking about, like Kostas already mentioned. So I think we'll have to come back maybe in Q1 of next year when we have made some more progress in the data ops vision. Let's do it.

Starting point is 00:50:41 We can talk about how that's panning out. And we can also spend some more time talking about the transition from an engineering manager inside GitLab to a CEO. That's definitely been an opportunity for myself to run into my own kind of limitations and then pass assumptions that don't go anymore. We could easily fill an hour just on that topic alone. Great. We'll definitely do it.

Starting point is 00:51:00 Thank you so much. Thank you. That was such a unique individual in that he has a depth of knowledge across such a wide variety of subject matter. And I think that's certainly been accelerated by him taking on the role of CEO at Meltano. This is my takeaway from the show. There's the old adage, I think, from the Netscape fundraising story, I think it was, that you're successful in two ways, you bundle or you unbundle.

Starting point is 00:51:33 And I've been thinking about that a lot lately in the data tooling space, because there are companies actively trying to bundle and actively trying to unbundle in general across tooling, but then also within specific disciplines. And thinking about Meltano as sort of the package manager for the entire data stack is a really fascinating way to bundle. And I think it opens up a lot of opportunity for them that a lot of other companies aren't going to have because they don't have to necessarily make choices about specific tooling. And so I know I'm going to be thinking about that all week because, you know, it's sort of a very

Starting point is 00:52:16 unique approach to bundling, or I guess bundling is, you know, an interesting way to describe what they're doing. So how about you, Costas? Yeah, a hundred percent. I totally agree with you. It's very, it's very interesting to see like platforms like this and getting, and at the same time we have a team behind it that, you know, has like the best possible pedigree to succeed in this because they are coming like from, from GitLab, right? Where that's exactly what they were doing, like building this kind of tools, but for software engineering.

Starting point is 00:52:43 So I'm very excited to see how they are going to move forward. Hopefully we will have him on another show like pretty soon. So because things are like changing really fast, but I would also like to add that if they succeed in what they're doing, I think we are also, they are also going to act as a great accelerator also for the open source projects out there, which is very interesting because we have open source projects with a varying degree of maturity, let's say, especially when it comes to the EL part with all the connectors and all that stuff. So putting in place something like Meltano and also all the governance that Meltano brings

Starting point is 00:53:24 with all the initiatives around open source, I think we are going to see these communities actually maturing much much faster which is nice because me as a person who has experienced let's say the the birth of Singer then it got into like some kind of winter situation where it was like existing but not existing, maintained but not maintained. And today seeing like all these actors with Maltano being the leader like to revive the project and govern the project like in a way that's going to be valuable. It's super super interesting like it's very fascinating and I'm really interested to see like what's going to happen in the next couple of months. Me too.

Starting point is 00:54:09 And we'll definitely have to have Dawa back on the show because we barely scratched the surface on several subjects. So thanks for joining us again on the Data Stack Show. And we have lots of great stuff coming up. So make sure to subscribe and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Starting point is 00:54:43 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 72: Building Data Ops Into the Data Lifecycle with Douwe Maan of Meltano

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.