The Data Stack Show - 117: DX for Data Tooling with Taylor Murphy of Meltano

Episode Date: December 14, 2022

Highlights from this week’s conversation include:Taylor’s journey into data (3:09)What’s been going on at Meltano recently? (7:28)Addressing basic problems in data even with advancements in tech...nology (12:23)What makes Meltano unique in the space (16:53)Why the CLI experience is important (25:37)Quality vs quantity in supporting connectors (35:51)What does data ops look like for Meltano (46:44)Takeaways and closing thoughts (52:56)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show, Costas. We talk all the time about how we want to have guests back on the show to catch up with them. And we were able to do that. We tracked down Taylor from Meltano. We had Dawa on a while ago. I can't remember how long ago,
Starting point is 00:00:41 but it was a while ago. And it was a fascinating conversation. They're building some super interesting things. And so we're going to catch up with Taylor who leads product. And I think I just want to hear how things are going. I mean, they were kind of building this almost command line interface, you know, sort of configuration layer for the data stack in general across pipelines, orchestration, et cetera, which is very compelling for a number of reasons. And so I want to hear how that's going. And of course, they, you know, are big investors in the Singer system and all those protocols
Starting point is 00:01:22 and that entire community. So yeah, I'm just excited to see, to hear how things have gone. How about you? Yeah. I'm very also like curious to see like where Multano is today. Multano is one of these products or companies, both that's like, when you see like their, you know, like their, what they have done, how they have started, how long they've been around and how hard they are trying
Starting point is 00:01:50 like to build a business around that, like really makes you like appreciate like what it means, like how important perseverance is like for building a business and like that's something that I have to recognize them and something that they should also like be very proud of. Right. So these folks just don't give up. And so that's what amazed me. So I want to see like where they are today.
Starting point is 00:02:18 And one of the things that I definitely want to discuss with Taylor is about like developer experience, like that's how they differentiate like the product, compared like to the competition out there. So yeah, I think we are going to have like a very interesting discussion about how you can approach like the problem of data pipelines in a different way. Well, let's dig in with Taylor and talk about it. Taylor, welcome to the Data Sack Show. We are so excited to talk about Meltano again.
Starting point is 00:02:53 We had Dawa on the show before, and we always say that one of the best parts is actually recording a show and then checking back in later. So we're super excited to hear about what's been happening at Meltana. Yeah, thanks for having me. Really, really excited to have the conversation. Okay, so how did you get into data? Give us the backstory. Yeah, so my background is in chemical engineering. And coming out of grad school, I decided I didn't want anything to do with that and kind of looked for a way to use the skills that I gained in grad school in an interesting way. And kind of the data side really caught my attention.
Starting point is 00:03:29 I joined a startup in Nashville that was focused on genetic testing in the healthcare space. And really, that's where I grew a lot of my data chops. Prior to that, I used MATLAB and Excel and was doing some relatively simple data modeling. But it was there that we had real business needs. That's where I fell in love with regular expressions and built my Python and SQL skills. I was there for four and a half years and then moved over to GitLab, where I started as a data engineer and was able to lead the team as the company grew from 200 people up to over well over 1000 people as it made its way
Starting point is 00:04:00 to its IPO. And there was huge for my career because we were able to be very open about everything we were doing. That's also where we started the Meltano project, which is where I'm at now. I was able to join that team in 2021 as the head of product and data. And I've been there for coming up on a year and a half now as we've grown the community, grown the company, and are really trying to make a really fantastic ELT tool. Awesome. Well, first thing I have to say is I don't actually believe you that you were doing simple things in Excel because anyone I know who's fallen in love
Starting point is 00:04:35 with regular expressions who started in Excel, my experience is that they were essentially building software in Microsoft Excel before actually discovering notebooks. And then that sort of is great freedom. Basically, yeah. It was doing things you probably shouldn't do with these tools because you're unaware of software development and the way this other industry had evolved. We were using, I think, Subversion for some of our code practices. And we literally had like four computers that were running some of the models we were doing. It was this whole world when I actually started working with actual software
Starting point is 00:05:12 engineers. I was like, oh, there's a better way to do this. Yeah, totally. No, I mean, I literally remember working with someone who would like we had a computer and they would just like run stuff in Excel overnight. And it's like, this is absolutely insane. Whatever it takes. I love it. Okay, Costas is going to last because I love when I get to ask this question. So chemical engineering background
Starting point is 00:05:35 and now you work in data. What lessons did you bring with you from chemical engineering? And do you still use any of those in your day-to-day work with data? I think so. I've talked to a lot of former chemical engineers, people who have gone from chemical engineering to other disciplines, a lot of them programming, some like went to law. The big things that I go back to from my engineering training are really about
Starting point is 00:06:01 understanding systems and understanding how these pieces fit together and things move. One of the biggest skills I learned coming from grad school in particular was really how to troubleshoot problems, how to take, you know, I'm having a bad outcome, whatever it is, maybe this result doesn't look good or this equipment isn't running. And to really have a disciplined approach to breaking problems down, subdividing them and finding, okay, is the problem, you know, before or after this step? And it seems kind of simple, but it is a practice until you kind of see it work a few times in the real world. It can be, you know, kind of foreign to some folks when they're faced with a problem on their computer, they get a stack trace in their
Starting point is 00:06:38 code. How do you then go and subdivide the problem? And that's, I think is the biggest thing. But then also just thinking, like, systematically of understanding, like, mass balances and what are my inputs, what are my outputs? Where can I see things happening? And then how can I break the problem down even further? And it's just, it's engineering. It's problem solving. It's taking, you know, what you know
Starting point is 00:06:59 and maybe learning some new things to solve interesting problems. Yeah, super interesting. Yeah. I'm always fascinated by that because you think about, and I am way outside of my expertise, but free radicals,
Starting point is 00:07:11 and when you think about chemical stuff, there's behavior that's extremely difficult to predict, even in controlled environments. It's like, oh, well, actually a lot of those same attributes are true to all sorts of data as well. Super interesting. Okay.
Starting point is 00:07:28 Well, tell us about Meltano. So, I mean, the Singer ecosystem is, you know, it's sort of a huge amount of its worth to the work that you've invested in it. It's growing. That's super exciting. When we talked to you a while ago, that was a huge focus. You're also looking at sort of the ops layer as well. So tell us, you know,
Starting point is 00:07:55 what's been going on in Meltano over the last six months, you know, from a product perspective. And then why don't you just also tell our listeners like the vision of the company? Because it's been a while. Yeah. So Meltano really exists, I think, to bring a better way of working with data to this project came in, a lot of the founding team was
Starting point is 00:08:25 from GitLab. And kind of the DevOps principle was built into how we think about things. And Meltano really was, you know, a data team should build a data platform or build their do their work modeled after software development. That meant and particularly in the GitLab framing, like one one tool that can kind of do it all. The big difference between GitLab and particularly in the GitLab framing, like one tool that can kind of do it all. The big difference between GitLab and Meltano is GitLab is like all first party stuff and Meltano has a lot of third party software that you can integrate with it. We've gone through a couple of refocusing moments in the company when DAWA took over the project in 2020, really focused on kind of the open source ELT side and saw a lot of traction with that. As we spun it out, we wanted to focus on this larger vision of becoming the foundation for your ideal data stack for any team's ideal data stack. And what that meant is
Starting point is 00:09:17 like, how do we work with the rest of the ecosystem? We're doing a really good job with making the Singer ecosystem better, enabling you to run taps and targets smoothly, orchestrate them well. But there's this whole other ecosystem of tooling that it can be hard to fit into the different parts of your stack. And so when we spun out, we started moving towards this larger vision of, okay, Meltano can be the foundation. You can bring in Airflow. You can bring in different tools, Superset, Metabase, anything really that's open source or has either a container or is Python installable. And we made specific product choices to make that happen. We introduced a new command to allow you to run composable pipelines.
Starting point is 00:09:56 It's Meltano Run. So you can chain together your tap, your target, dbt, great expectations you know, some further downstream jobs. We've also enhanced things around the Stinger ecosystem. So it's not just a tap and a target. You can also intercept data in between. It's called the stream map and filter data, anonymize it, you know, drop data, do whatever you need to do and kind of give you that level of control. And so we still very much believe in that larger vision.
Starting point is 00:10:23 But as we like would go to conferences and talk to people, people get really excited about this idea. Like, oh yeah, data ops, platform infrastructure. It's exciting. They understand eventually why people need it. But also we recognize it wasn't meeting people where they were today. We were maybe a little bit further
Starting point is 00:10:39 than a lot of folks in the industry actually are. And most problems are like, yeah, this is really cool. I would love to be able to do this. I'm still struggling with my extract and load, just like pure data movement problems. So what we've been doing here in the past few months really is just refocusing, doubling, tripling down on the ELT side of the story
Starting point is 00:10:59 and beefing up the SDK for writing taps and targets, enhancing functionality within Melpano, specifically around ELT to be a fantastic solution for that. But all the pieces are there for this larger story. And I'm excited for us to get to the point where we can earn the right to continue investing in that because I think we as a company still believe very much in that mission. Yeah, yeah. Super interesting. Costas and I were just talking about Coalesce. Costas wasn't able to join us there, but one of my big takeaways was,
Starting point is 00:11:31 as advanced as all the technology is, and you walk around the vendor booths, and there's some amazing stuff out there, when you talk to the practitioners who are doing this work on the ground, a huge number are still trying to solve the fundamental challenges. a huge number are still trying to solve the fundamental challenges. A huge number are. And so that really resonates because I think it's easy.
Starting point is 00:11:53 I mean, you work for a data vendor, you're building out product and all that sort of stuff. And it's way easier for us to look into the future because that's part of our job than for our customers right who you know certainly are doing that actually have a lot of pain points that they need to solve as part of their job today um and a lot of those problems are basic okay so i have a question for you on that like what why do you think with all of this advanced technology like why do you think the problems are still basic for a huge proportion of the practitioners and companies out there yeah this i love this question because i think it gets to the like an industry-wide challenge and i think this will change over time as more data practitioners kind
Starting point is 00:12:45 of come up through the ranks of different organizations. My hypothesis and what I've seen in several places and with folks I've talked to is like data isn't a strong consideration from early in the company's life cycle or its overall genesis, or maybe it's a really old company and they've gone through a lot of change. When data is kind of an afterthought or seen as something of just like, oh, we can pay for this, we can invest X amount of dollars and we're going to get some return with our data.
Starting point is 00:13:18 I think it really does a disservice to the people on the teams that have to implement this kind of work. And for me, data has to be kind of foundational to how you think about running more modern business, particularly tech businesses. But anything you're doing in a company is generating some form of data and you need to have that data lens. One of the reasons, not to get too highfalutin here, but one of the reasons I really fell
Starting point is 00:13:44 in love with data engineering and chose the infrastructure and the hardcore, like low-level data pieces, I felt it was so foundational to functioning and to a lot of these problems that we want to solve that one, it's like great career stability. Like people are always going to have data problems, but two, I just saw like, you can't do all these fun data science-y things unless you have a solid foundation of good data engineering best practices and workflows. So part of it, I think, is just, you know, there are people who don't maybe understand what the current state of the art or capabilities are with data and how to use it to better operationalize all parts of their business. But that's changing as people kind of come up through organizations and they get a little bit of power. They're ahead of data at a new company and they can affect that change. But people are just at
Starting point is 00:14:28 different stages of this journey of learning, Hey, I enjoy building charts, but now I need to learn a bit more about software engineering and how some of this works. So it's a maturing practice with professionals that are gaining more skills and gaining more influence across different industries every day. But does that kind of answer? Yeah, no, that's super helpful. That's super helpful. And the other thing you got to the root of it, whether it's a newer company know, a legacy, you know, or sort of like legacy enterprise that's been around for a long time and they're trying to become more data driven, you know, sort of different sides of the same coin. entire company committed to something that you work really hard at and early on, actually, it doesn't bear a lot of day-to-day fruit, right? It just seems like extra work that you're investing
Starting point is 00:15:32 for the future. And that takes a huge amount of commitment and foresight from a company to be able to do that. Yeah. And I think there's parallels in software engineering. Like, are you investing in a really good engineering culture that works well with your product team and can deliver, you know, bring back insights and have just a positive feedback loop? It's not a one-time thing where you put in some resources and you get something out where it's really functional, both on engineering, both on data. And there's just so many similarities, I think, between data teams and software engineering teams. It's that investment, that kind of of positive flywheel across the entire organization. And I think early days for a lot of companies, it is a bit of a leap of faith if they haven't seen it in practice. And I'm hopeful now we see we have more people that are true believers in a positive sense. They're informed by data and their experience.
Starting point is 00:16:21 But you are able to articulate why it's valuable to invest in data in these processes and to build that flywheel. Yep. I love it. All right, Costas. I could keep going, but please, please jump in. I know you have so many questions. Costas Pintasilauskis Yeah. Yeah.
Starting point is 00:16:41 So first of all, I'm super excited that I have someone from the product side because I can make like, you know, like some really hard questions. Like, for example, why someone should choose Meltano today instead of like something like Fivetran or Airbyte or Sysdata, right? Yeah. So yeah, why? Like, what's so much better about like Multan or Bloomberg
Starting point is 00:17:09 like, let's say, the other solutions out there? Yeah. Our focus right now is on a very particular persona. So if you are a data engineer or, you know, very data engineer adjacent
Starting point is 00:17:20 who is comfortable on the command line, isn't afraid of Python stack trace, and wants that control over your software, that's when Meltano is going to be a really good choice for you today. We've kind of saw that gap in the market where there are good point and click solutions for day one situations to move your data. When we've been talking to a lot of users, and hopefully potential customers as we build out our managed offering, the pain points that we're hearing are,
Starting point is 00:17:48 cost is rising and I don't have a good sense of why or how I could even improve it. And there are problems that crop up that I can't fix and I'm stuck in some sort of support hell as it were. And what we're aiming to do is kind of give users control back over their data platform, but in a way that we are still able to help them solve
Starting point is 00:18:10 problems, but when something goes wrong, and something will go wrong, I think that's something that other companies don't necessarily like to admit, like, oh, we've solved this problem, data's moved, don't worry about it, point and click and you're good. Something's going to change. Something about the system outside of your control is going to change, and you have to be able to adapt to it and to respond to it. So Meltano's going to be a Something about the system outside of your control is going to change, and you have to be able to adapt to it
Starting point is 00:18:25 and to respond to it. So Meltano is going to be a good choice for you when you want to understand the code that's running in your system, whether it's the tap or the target or even dbt, and have that transparency. We've also built in kind of the software development best practices into the product.
Starting point is 00:18:41 So there are YAML files that define your configuration, the state of your system. And if you've worked with software engineers and they're going to be begging for tools like that because they understand the value of version control. So that's a long-winded answer, but the day one experience of Meltano is continually improving, but Meltano is going to really excel today for the day two problems that you're going to encounter when something is changing and you need to adjust your system and you want to test it and move forward with confidence. Yeah, that makes a lot of sense. I'd love to discuss more later about the developer experience and why it's so different.
Starting point is 00:19:19 But why do you think a company 5 trillion bytes or 6 data? They didn't go after an experience that is, let's say, more native to the data engineer. Because at the end, it's not like 5 trillion bytes is used by someone else inside the organization. You will end up... The pipelines, the core of the, like the data engineer is doing. They have a lot to do with these tools, right? So why they didn't do that?
Starting point is 00:19:51 Yeah, I'm curious about that as well. And I think there's a couple of hypotheses I have around that. One is that, you know, we have the advantage of coming into the market a bit later where these companies are a bit more established. And previously, it had been data analysts that had been doing a lot of this work. I think data engineering is still relatively a new title. I don't think data engineer is ever going to be called the sexiest job of the 21st century. and as I do more product and have like these you know pseudo sales conversations and talk to users it's very easy to get pulled into the idea of oh okay you're facing this problem we'll just you know build this ui for you and you can kind of point and click problems will
Starting point is 00:20:38 will kind of be solved but you're not actually you're not actually, you're talking to like, you're talking to the buyer, but not necessarily the user all the time. The advantage that Meltano has had in the market is, I think, for three, you know, almost four years now, it's been completely open source, free to use, and has been able to organically kind of attract this audience of data engineers. And as we talk to them, you know, they're the ones implementing these products. And yeah, they want the convenience of not to worry about things. But when they do have to worry about it, they really need to solve some of these problems. And so we talk to people who are paying customers, you know, a five train of Stitch. And they're like, yeah, it works for some of these things, but I would really like, you know Meltano to come in and give them a lot of that control back and hopefully be a better experience that they can build the kind of the foundation of their entire stack on.
Starting point is 00:21:32 Yeah, it makes a lot of sense. But I mean, Meltano is still trying like to build like a SaaS business, right? Like without like a self-serve solution that you post for your customers. So you still have like to take care of, let's say, the infrastructure, the issues there. You need to run the operations around the technology itself. Obviously, someone can do it on their own. They won't like to use the open source version of it. But at the end, someone who's going to pay Meltano, they're going to be paying
Starting point is 00:22:05 like for something that's hosted by you. So, I mean, that's like what is like also the similarity with something like Fivetran or like even Airbyte, because, and I'm saying like even Airbyte, because Airbyte also have like an open source version of it, but at the end, like that's how they also make money. You go like to their hosted version and you pay for it. Right. So, and things will go wrong for you too.
Starting point is 00:22:27 Like Salesforce at some point will be like, no, we're not going like to reply on your request, like what to do, you know, and like suddenly like the pipeline breaks. Right. So what is like different in the experience that needs to be made, let's say, for a cloud-hosted product that makes it, let's say, much more convenient or native as an experience for a developer compared to, let's say, data analysis. Yeah, so a couple of thoughts there. We are doubling down on the command line interface
Starting point is 00:23:06 as the primary interface, at least initially for a managed offering. What we're talking with are kind of our early alpha users. And full transparency, we're in the process of building this. We're pre-alpha, but we have some folks lined up that are excited to use it. They're comfortable using the command line interface to interact with the product.
Starting point is 00:23:27 There will be an API as well if they need to kind of orchestrate things themselves. And the UI will come eventually at some point because we're just going to need some form of UI to check basic things and not everybody always wants to go to the command line to check things. But in terms of getting like your work done,
Starting point is 00:23:42 it's going to come from the command line interface primarily. The other piece is transparency around what's happening within the managed platform. Most likely, we will at least have like a source available version of what's what we're actually running on the managed like the code itself will be proprietary, but you can actually see like, here's the code. A lot of this is informed, I think, by our GitLab history, where GitLab is, you know, they have a free open source version of GitLab, and then everything else is their enterprise edition. But you can see all the code, and you can actually make contributions if you want. And I think that's a really exciting model, because it allows people that there are certain groups of people that will be able to say, hey, I want you to go ahead and manage it. But I'm also like smart and I can figure these things out. If I can help you
Starting point is 00:24:29 quickly figure out a bug, it's going to help me get my support ticket figured out faster. That's the second aspect. And then the third aspect is, hey, here's the actual code that's running for your tap and your target. If for whatever reason you need to fork the tap snowflake or target Postgres or whatever it happens to be, you can fork that, still run that fork on Meltano, and then we can work with you to merge it back into the main branch of whatever connector Meltano or ourselves are managing and allow people to quickly solve their own problems because there's a lot of downstream components that rely on data engineering instead of saying, hey, there's a problem with Fivetran and it's out of my hands. Some folks may want that because it does kind of shield them from whatever political pressure
Starting point is 00:25:11 they may feel inside. But for folks who are like, this is mission critical and I don't really care to worry about the deployment of the stuff, but I do like to know what code is actually running and if it's Python and if it's built on our SDK, it would be pretty quick to change it. So those are the kind of the paths that we're threading of what makes a better developer experience
Starting point is 00:25:30 and invites people into kind of how we're building this product and business. Okay. That's super interesting. So let's start with, like, the CLI experience. Why do you think, like,
Starting point is 00:25:42 CLI is, like, so important for a developer? And it's more, let's say, important than a graphical user interface? Yeah, it definitely speaks to a different audience and definitely a different persona. When you're on the command line, it's utilitarian. I think there are fun things that you can do to make the user experience more enjoyable. But there's nothing generally, if it's a well-designed command line, that's like getting in your way of getting the job done. It speaks, I think it communicates hopefully to people that were like, we're here to get the job done and kind of get out of your way. And that's why I like fell in love with dbt as a product because it I've, you know, with GitLab has never used like
Starting point is 00:26:30 dbt cloud, it's only ever used dbt core, used it from the command line, it was just a very comfortable interface. And then it also works with all of these other tools that you have on the command line in bash, built off kind of the Unix philosophy of piping things together and so i think it just it does speak that audience and it's also you know for me as i've learned more and more over my career about software engineering it's like oh if you have a good you know kind of api back end you can build whatever ui you want but you can also build this command line it's quicker you can iterate faster and if you want something, it's less work than building this whole UI.
Starting point is 00:27:08 So it enables us to kind of move and iterate faster and invites people in again to kind of contribute if they have ideas. Some of our features and flags and different commands were contributed by the community because, hey, I need to be able to add this to my project, but I don't want to install it. Cool, we took a PR for that to have a no install option.
Starting point is 00:27:24 And now it's available for everybody. So that's't want to install it. Cool. We took a PR for that to have a no install option and now it's available for everybody. That's how I think about it. Yeah. That's super, super interesting. And like, how do you, like from a product perspective, like, I mean, you know, there has been like so much work done in like research and processes around like user experience, how like to run AP tests,, to figure out what's the right color there. All the stuff that we know about building, let's say, a very graphical experience for the user.
Starting point is 00:27:55 But what about the CLI? How do you figure out what's a good experience? How do you design a CLI? How do you do that? Yeah. I think we're trying to figure that out. I think there are definitely, there's prior art that we can lean upon.
Starting point is 00:28:12 I'm, you know, for me personally, I was a data engineer prior to this, and now this is my first true product role. So there's a bit of learning on the job. But the benefits of the way I think we're building Meltano is that it's, it is in the open, it's open source. We have this community and it's a great way to,
Starting point is 00:28:27 to get that feedback. Talking to people is some of the best way that I've found to just figure this stuff out. Like my takeaway from being, you know, doing product and talking to other product managers is just like the more you can talk to your users, the better off the product will probably be because you're integrating all of
Starting point is 00:28:42 that information. We also invite people in like, well, usually have specs around, hey, this is what we're thinking for this specific functionality, whether it's like a new command and like, what are the sub commands? What are the structure? We also had fantastic engineers who bring their software engineering skills and say like, hey, this is what I would recommend. What do you think of this?
Starting point is 00:29:04 And me going okay yeah the problem we're trying to solve it does this you know here's kind of the overall ergonomics so yeah it's small iterations and then doing it in a way that it's not you know fully irreversible i think we needed to roll something back yeah i love that like i hope like one day you write like a blog or something like the experience of like building a CLI. Like I truly believe that there's nodes. I think there's like a lot of experience with people that they have built that stuff out there, but I don't think that like from the perspective of like the
Starting point is 00:29:36 product discipline, we have modified this information in a way that like people can go and like learn, right? Like and find this information out there. So I don't know if you ever do it, please let me know. I'd love to read that. but like people can go and like learn, right? Like and find this information out there. So I don't know if you ever do it, please let me know. I'd love to read that. Yeah. It's super interesting. It's something about like, I carry a lot of, so like, I'm very like curious, like personally, like how we can define like developer experience and how we can build like, we see like tools in a
Starting point is 00:30:03 more structured way tools you know yeah and more products are the way you know yeah i'm starting to you know doing it a relatively you know new job i think you you learn all the things you don't actually know so i literally i just started reading the design of everyday things i can't remember the author's name but excited to dive more into to design more broadly and just kind of bring everything to bear because a lot of like what i brought to the product job is you know at one point i was in the target persona and now i get to talk to a ton of people that are in our target persona understand where you know my experience is different from theirs and that's what has made this really enjoyable it's like i
Starting point is 00:30:39 get to build help build a product that is solving problems that you know i experienced personally in the past and that i know a lot of people are experiencing today. And yeah, that's the fun part of being in product. There are also like fun parts that are not that fun, but we'll discuss that another time. Today, let's stay positive, right? All right. So, okay. I think like we've had like a good idea of of how the experience of working with
Starting point is 00:31:05 Multan is different. One of the very interesting problems when it comes to ETL solutions that has engineering, product, and business, let's say, consequences, depending on what kind of strategies we're going to follow there, is the connector. At the end, without the connectors, there's no idea. You need to pull data from somewhere and pull the data somewhere else. And there's a lot of discussion about this.
Starting point is 00:31:35 There's a long tail of connectors out there. There are some very important connectors out there. How do you deal with that at Meltano? I see that. Like for example, like on, I would like browsing like the website, like read fast. I saw the, like the comparison between like Fivetran and Airbytes. Like you claim that you support like 300 plus like connectors, for example, compared like to, I don't know, 150 or 200 plus like the others. What does this mean?
Starting point is 00:32:06 Like how, like how do you adapt in this, like in a situation where you have like 300 connectors, like what are these? What, like, why do we need all these connectors? David Pérez de Mesa- Yeah. So that number comes from, we have our, it's called the Meltano Hub where we're listing all of these connectors. And to be super clear, this is our understanding of the larger Singer ecosystem. So when Meltano was started, Singer was already a project initially supported by Stitch, now Talon.
Starting point is 00:32:37 And when we say there's 350 plus connectors for Meltano, there are at least 300 connectors that we found in the wider community that other people have made that conform to the Singer specification. And that's where the power comes in, in these long tail connectors is you can write a connector and as long as it meets the Singer spec in terms of the data that's being output from this tap, it can be accepted by any target. We, for the longest time, really took a somewhat hands-off approach to the maintenance of the connectors themselves and said, okay, we're going to address some of these problems around transparency, around testing, around building new ones. But we haven't taken on the burden and the challenge of maintaining these as first-party connectors. That has actually shifted. We've now taken on, we're starting with a lot of the database taps and targets, but it really is like a decentralized, you know, open source community
Starting point is 00:33:31 where people say, hey, I have this connector, I'm going to build this tap and it solves my problems. Maybe it solves yours. And so you might need to fork the code. We are, you know, in an effort to be more competitive with some of these other tools. We are, like I said, taking over the maintenance of these, the database taps and targets effort to be more competitive with some of these other tools. We are, like I said, taking over the maintenance of these database taps and targets. But they are built on top of the Meltano Singer SDK, which is really a lot of people's first introduction to Meltano. They're like, oh, I need to build this custom connector for whatever reason, whether it's some, you know, weird API or they just want to pull some data internally. And then for some whatever reason, they couldn't find it. People find us a lot through the SDK. And and so we are investing heavily in and improving the sdk
Starting point is 00:34:09 we recently brought a batch message type which basically means instead of one key part of the singer spec is that every record is output on standard out in a new line json format and says like recording here's the data that's good especially when you're maybe coming from an API, but for like database sources in particular, that can obviously be very slow. So this batch message type is basically a pointer to a file where we'll say, hey, we're going to extract all the data, write it down to a file. The batch message gets sent to the target and the target knows where to go pick up that file. And we're seeing, you know, 30 to 90 times X data flow improvement doing that method. Yes, so it basically means
Starting point is 00:34:50 there's a lot of, there's an active community. I think that's one of the differences too. If you look at Fivetran, they maintain all, you can't see the code and they're going to be limited in kind of the long tail
Starting point is 00:35:00 that you can support. Airbyte is, you know, in a better place than Fivetran because they are open source. They are currently in a monorepo and so everything kind of has to be in their main repo. I don't want to completely misspeak, but I don't know that you can run forks of connectors
Starting point is 00:35:15 within the main Airbyte platform. And whereas we're just saying it's good to have a decentralized system, and that's where TanaHub comes in to show just how active the community really is. But it can be really hard to tell for someone on the ground of like, is Singer dead? I go into this Slack channel, but a lot of what you don't see is people just using it day to day, pushing gigabytes of data through these connectors because it's not as transparent. And so that's what we've really tried to do with some of the features that we've brought into the market.
Starting point is 00:35:46 Okay. That's super interesting. Okay. So how do you balance like quantity and quality of connectors, right? Because I'm pretty sure that like if you took five down, they will tell you like, yeah, they would have everything closed. But like the quantity of our connectors is like super high. When you allow like everyone like to go and contribute out there, which is the complete opposite of that, like, okay, anyone can do whatever they want, like with the code that they contribute there. So how do you balance that? Like how, let's say, Meltano as, let's say,
Starting point is 00:36:28 a coordinator of this decentralized hub of like creating connectors can help like ensure the quality of this connector. Because at the end, it is important, right? Like if I'm a new user and I see out there like five different implementations of like a connector for Salesforce. Which one do I choose and why?
Starting point is 00:36:46 Right. And what if something goes wrong? Like, is it Meltano's problem or is it like the contributor's problem? And if the contributor does not reply, you know, like you have all these open source, like standard issues, right? That you have to deal with. So how do you do that? Like as Meltan, right? Yeah.
Starting point is 00:37:07 I think, frankly, we're going to figure that out. It's absolutely going to be based on the SDK. And so what we're seeing with that is we're getting a lot of good contributions as people maybe discover weird quirks about a particular API that they're working with. They'll implement the fix in their connector and that improvement comes into the SDK. And so likely like Meltano is not going to offer support for
Starting point is 00:37:28 connectors that weren't built on Meltano SDK. But as it makes sense to say like, hey, a lot of our users are using Facebook or Google Ads, you know, a lot of the marketing ops type data sources. If they're built on the SDK, I think we will absolutely start to take on the maintenance of those. Because that solid foundation, you know, one improvement for a particular connector can spread out across all of them. I think the other balance is recognizing that people do have like different quality and stability needs. Some folks are fine with a community tap that maybe isn't fully tested, but they can just try it out and see and see what happens. One of the things that I haven't mentioned about Meltano is that it has this native understanding and built-in feature around environments. And so if you have a staging table, or if you want to write locally to DuckDB,
Starting point is 00:38:19 you can test out the quality and the capabilities of different tools, particularly, you know, taps and targets in a safe manner. And then if you like what you see, you can just run that in production and override certain configuration. And that Maltano makes it easy. And that's kind of like the software development principle of having testing
Starting point is 00:38:39 and continuous integration and things defined in code is you can have the safe space to test things. So I think for us, as we actually build out manage, actually start to onboard customers, we'll have these conversations around like, well, what are the data sources that you want? And we'll just kind of we'll kind of go from there. But the thing that's interesting is a lot of these connectors actually work really well for the majority of people's use cases. And it's only when you start to like really push the boundaries hard on some of the data volumes that it starts to maybe be challenging for some particular data teams.
Starting point is 00:39:13 And so I'm just, I'm excited to have those conversations and see what we need to do. But like, it's absolutely going to be based on the SDK. I actually have a question that for both of you, because one thing that's interesting, because both of you have such deep experience in this world. But one interesting thing is, if you need something, let's say, you know, modified or custom that isn't offered out of the box by a black box sas provider a la you know five train or whatever like one of the challenges i think a lot of companies run into is like okay well we'll run sort of these like core pipelines and like a five train and use the interface and set it and forget it but then you go from there and it's like you you build something custom or even use open source technology to manage something custom and so now you're managing the same basic data flow across two like very
Starting point is 00:40:11 different ecosystems and but it's basically the same process orchestration becomes hard like there are a number of challenges there one thing that's interesting to me just hearing you talk to that, Taylor, is that, okay, so you have, like, let's say, supported connectors that are, you know, or taps that are like core or whatever. But if I need to develop something custom, I'm not actually going to a completely different ecosystem. That's like, fairly compelling. Is that part of the thesis? And Costas, does that make sense to you? Like having built similar technology? So I would say absolutely part of the thesis is if you are quickly able to solve your own problem
Starting point is 00:40:54 and then fork the code and run it, as long as it conforms to the Singer spec, and I'm sure we'll have some guardrails around that where validating it outputs Singer data. But you should be able to run that with them like the managed Meltano platform because you could run it
Starting point is 00:41:08 with self-hosted Meltano. So with a managed platform, you should be able to run that. And that way, you aren't forced to either go, I'm going to go buy
Starting point is 00:41:15 another SaaS tool that happens to randomly do this or I'm just going to stand up some random Python script. Yeah. We can help you like have those best practices
Starting point is 00:41:24 while quickly solving your problems. And then once it's up and running, you can kind of behind the scenes, like incrementally bring it back into the fold of like the well-maintained mature data process. And you don't have to like breach for these other tools. Yeah. For me, what is like very interesting with that, and just to add to what Taylor was saying about the developer experience, if you want to define developer experience, you have two very important interfaces.
Starting point is 00:41:59 One is the CLI, and the other one is the SDK. And there is a reason that the developers need access to both of them. Like, okay, we can chat a little about that. But having access to an SDK that you can use to modify the behavior of the system in a predictable and like safe way, it's super important when we are talking about like something that it's consumed and it's used as a system by a developer. Now, obviously like a developer will prefer to have the connector there working, right, like not have like to write that, or wouldn't like that, right?
Starting point is 00:42:37 But that's why you're an engineer because there are edge cases, there are like issues that you only care about, that's why you're in the company, and you might have to be able to extend the behavior of the system that you are working with. And that's, I think, a very big difference between developer experience and user experience is that user experience is like super guardrails, right, like what you can do on a user interface is defined by the visual components that are there with predefined behavior. While when you're talking about developers, you need also to give them,
Starting point is 00:43:10 let's say, the tools to extend or change somehow the behavior of the system. Right? And yeah, it makes total sense when you're working with this persona. Now we can debate if this persona is like the best persona for this problem, which is moving the data around. My opinion is that it is. But someone else might have like, I don't know, like, I might have like a different opinion and that's like fair, right?
Starting point is 00:43:35 That's why we're competing out there. But yeah, like I think it, for me, it's like a very interesting approach of like solving the problem because always like traditional, like a big problem, like I think for me, it's like a very interesting approach of like solving the problem. Because always like traditional, like a big problem, like with these platforms was that, okay, this is an open set of connectors. Like, how do you maintain that? Like, that's not scalable. Like, you cannot have like an organization with an army of developers out there who are maintaining like every silly like connector for an API answer. And by the way, it's super hard to find people who want to do that job. Anyone who has tried to hire developers who are going to maintain connectors,
Starting point is 00:44:17 they know how hard it is to do that. So building this developer experience, I think is like a response to like how we can build like a scalable solution to the problem like moving data around so yeah I think the point that really stuck out to me
Starting point is 00:44:32 what you were saying was like the modular and like being able to extend it and it's definitely you know how we kind of built Meltano generally recently we've taken an
Starting point is 00:44:42 effort and this is moving away from the singer side a little bit but out of the box with Meltano, you can run dbt, you can run Airflow, and we've been, that's been pretty consistent for a while now. But now we've developed what we're calling an EDK, an extension developer kit, and basically solving the problem of, if I wanted to change how Airflow or even dbt was integrated with Meltano previously it took a lot of effort to do that you had to understand both the code in the Meltano code base and then like what other like weird repos we might have had for how dbt gets installed or how Airflow
Starting point is 00:45:16 gets installed and then also like the Airflow DAG generator that we had the edk comes in to basically have a single repo have a you know similar developer experience to the SDK to make it easy to add new components that run well in Multano. So we've rebuilt, they're in kind of preview mode and they probably won't be in GA for a while, for Superset, and we have the community contributions around Dagster, Elementary, and a couple of other tools that are built with the EDK, give you basically the wrapper around how this tool interfaces with Meltano. And I'm really excited about it
Starting point is 00:45:57 because it paves the way for the future for this longer data ops platform that we've talked about and hinted at. And with our managed offering, like you'll be able to run dbt on cloud as well. It's not just for the Yale side of things, even though that's what we're focused on. So that's all in an effort to make it, you know,
Starting point is 00:46:15 your data stack like more composable and a really good developer experience. That's super, super exciting. Okay, I'm going to stop asking questions about developer experience and connectors because we can continue doing that like for days. And I have like one last question and then I'll give like the meatball to Eric.
Starting point is 00:46:35 So you mentioned like a number of additional tools out there outside of like the ATL and the LTO, like the connectors. So there is this new concept of like DataOps, right? And I would assume that it's the context of DataOps that like includes also like orchestration and like quality or like modeling and like all that stuff. So I want to ask you like, what is DataOps for you, like for Meltano and how it relates to Meltano itself as a product. Yeah. So DataOps, I think I really give a lot of credit to the folks from Data Kitchen because they have their DataOps manifesto, which I've looked at a number of times across my
Starting point is 00:47:20 career. And frankly, I think it does a fairly good job of describing the idea and the philosophy on it. The majority of the pieces that are or the items that are listed, I think they have like 18 or something like that. A lot of them recognize that the DataOps term is really about processes around people. A small part of DataOps is a technological solution. But the problem I think that DataOps as a term kind of addresses is just about recognizing that a lot of data problems have people problems and that there is a technological component to it and that there's a way of working that enables you to achieve the outcomes you want faster, more stably, with a higher level of quality,
Starting point is 00:48:05 and frankly, in a way that's maybe more enjoyable to do. I think the reductive way of talking about data ops is that, oh, it's just it's DevOps for data that doesn't fully recognize that there are stark differences in working with data, particularly around orchestration, managing state, and that things like CICD are great, but can be way more challenging when you're talking about working with a Snowflake database or working with multiple terabytes of data. So for me, DataOps, I think simply is just a bit of a marketing term talking about a way to work better as data professionals,
Starting point is 00:48:47 recognizing that building your data platform and building your data practice is a lot more akin to software engineering than it is to maybe another discipline. For Meltana specifically, I think we really lean into that software engineering side of things of building your data platform like it was a software engineering product. And I think that manifests in how the features of the product look and how people experience them through the YAML files for the command line interface. But yeah, I
Starting point is 00:49:14 think in a lot of conversations I've had with folks, people like, they've heard about DataOps and they get excited, but again it comes back to like, what problems are you experiencing? And for us, it's there are better ways of working. And we believe a lot of those are working more like software engineers than working like another type of, you know, tech worker.
Starting point is 00:49:33 Henry Suryawirawanacke... That's great. I think Eric, we should like try to have an episode about data ops and like just chat about that, like get some, uh, people to like... Eric Bozdaf get some... I think it would be awesome. Yeah. Yeah. And you should be part of the panel there. Like we should do that. I think it's very interesting, like when we have like new terms entering like an
Starting point is 00:49:56 industry and being able like to, you know, like clarify, like make it more clear of like what this thing is, right. Because that's the, that's the problem you see, like, and that's, by the way, a problem that is caused a lot by marketing because the terms themselves, like, okay, they have their own meaning. Like whenever like a new term arises, I think there is a reason for that. But marketing is trying to like really aggressively capitalize on that and use it as a way like to communicate something.
Starting point is 00:50:28 And many times like problems arise from that. I've seen like a lot with like concept like data mesh, for example, right? Which is like, okay, like if you read like at the end what the data mesh is, it's okay, like make, make sense what you are reading there, right? But you have like such an aggressive and in some cases also like bad marketing happening around them that like it really like destroys like the semantics behind it that are communicated to people and that hurts the industry at the end right so i feel like if we can have like discussions with people that
Starting point is 00:51:01 you know they are like experienced and they have like a very honest like approach and not, again, it's not, I'm not going against marketing here, right. But just trying to describe reality. I think it's going to be very beneficial, like for the people who are like listening to the show too. I think we should do that. Putting our product hat on, I think just like focusing on the problems that people are having and that data mesh, data ops, data contracts are tools that are trying to solve problems. And I just like being honest that like a tool is not going to magically solve your problem. There is always going to be some sort of people aspects that you have to deal with. But I do believe that technology can enable better ways of working. And so I don't know.
Starting point is 00:51:44 I don't know that conversation. We would have the full definition of this is what DataOps is forever and always. But inviting people in to understand these are the problems we're trying to solve and this is how this came about, I think would be very beneficial. Yeah, let's do that. Eric, all yours. I love it. Well, we're at the buzzer. So I have several more things to discuss,
Starting point is 00:52:04 but we're going to have to do it on another episode. I will say right here at the end, though, this episode has confirmed my theory, Costas, which I opined to you about in a recent Shop Talk episode about logic moving further and further down the stack. And I think CLI is the best example of that, right? It's going lower and lower. So it's been very validating for me in terms of that theory about business logic being expressed as code.
Starting point is 00:52:34 So thank you, Taylor, for validating one of my wild theories. And congrats on all the work you've done at Meltano. What an ecosystem. I mean, amazing contributions and best of luck as you continue to build. Thank you so much for having me on. I really enjoyed the conversation and glad I could confirm your hypothesis around the industry. What a fascinating product. And my big takeaway is that you don't hear this very often, but Meltano as a company has a huge vision for being a data ops layer for the stack. But they really listened to their customers
Starting point is 00:53:23 and went back to the main pain point that their customers had, which is actually on the pipeline side of things. And so I just think that takes a lot of courage as a company to say, we have this grand vision of what we set out to build, but we're probably too early for that. And so we're going to listen to our customers
Starting point is 00:53:46 and go back to those components of the product and make them better so that we can better serve those customers. And I was just really impressed by that. I think that's such a refreshing thing to hear. It doesn't sound as cool as, you know, we're breaking new ground with a data ops layer,
Starting point is 00:54:04 which they actually are doing that. But they're also just making a lot of things way better about their core product and the core problem they solved and what they're hearing from customers. And so I just really appreciated that. Yeah, a hundred percent. I think what you just described is let's say a proof of like the quality of the people that they run both the business and the product, the company, so that's not easy to achieve. And like I think we should congratulate them for that, right?
Starting point is 00:54:40 And I think it's also like, you can see like how valuable it is to have someone leading your product function who comes like with a very deep knowledge and understanding the problem space and makes it awesome that this is happening here because Taylor was a practitioner. Like he was dealing with this. Like, so he can empathize with the user and he can build something that iterate much faster on like, you know, like converging the solution to like much, much faster to the solution, like compared to other like products out there.
Starting point is 00:55:17 So yeah, that was like super refreshing and super encouraging. And like, it was like lovely to chat with him and hear all the opinions and share the knowledge that he has about how to build a product that is going to be successful in the long term and not just trying to capitalize on the hype today, which is great. Yep, I love it.
Starting point is 00:55:41 Well, if you enjoyed that, many more great episodes and guests to come. Subscribe if you haven't and we'll catch you on the next one. Eric Dodds at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.