Programming Throwdown - 142: Data Ops with Douwe Maan

Starting point is 00:00:00 Hey everybody! with Dawa Maan. Take it away, Jason. Hey, everybody. So pretty excited. We have an awesome guest to talk to us about a really interesting topic that is really on the vanguard of things. It's data ops, which a lot of folks might have not heard of, but we're going to cover that topic in depth right here today. And I'm so excited that I have Dawa Maan on the show to, you know, he's really an expert in this field to go through it with us. Thanks so much for coming on the show, Dawa. Thanks for having me. Excited to help your audience learn about this cool new merge of data and DevOps. Cool. Great. Really excited to have you here.

Starting point is 00:01:00 Before we get started, why don't we jump into your background as an engineer and technologist. How did you get into the field? And how did you get your first job? And how did you get to where you're at right now? All right. Yeah, we can go back quite a way. So when I was growing up, there were always computers around the house. My father had been the first person in his family to have a computer. And this wasn't just a Windows machine, Internet Explorer. It was always Linux of some kind. So I grew up with Mandrake and Red Hat and SUSE Linux.

Starting point is 00:01:35 And there's a whole list of distributions that were tried out. And very quickly, I learned that computers were not just this device that you use for email or work processing, but it's something malleable. It's something to debug at times, you know, Linux being Linux. And it was something that I saw as this whole world that potentially opened up for me as an avenue for creativity. Yeah. Were you worried about bricking your computer? Because I mean, that was one thing that I was worried about when I was in high school, that I would somehow irrevocably damage this machine. Did you have that fear or were you kind of inoculated from that? I think that because we had different Linuxes, so all these different times anyway, the fear of just

Starting point is 00:02:15 messing up the operating system wasn't that high because we were used to reinstalling stuff every couple of weeks or months anyway. So as long as the hard drive with the data was fine, we would be fine. And actually bricking a computer is still pretty hard, or at least it was back then. I guess in today's slightly more closed, controlled environment there, that possibility is larger, but also less anyway.

Starting point is 00:02:37 Anyway, I know that was never a fear, but I learned really quickly that there was something where I could unleash my creativity. So when I was nine years old, a friend realized that when you go into Microsoft Word and you hit the save as HTML button, it would then open in your web browser rather than the Microsoft Word application. And this unlocked a whole world for us because this web browser, which had previously been the domain of big brands and professionals and adults, was suddenly something that we, being little nine-year-old kids, could manipulate and

Starting point is 00:03:03 get our own stuff to show up. So this went from local HTML files, and then we realized that there's such a thing as a web host. And of course, to us, it was crazy to ask my father for permission to spend like five bucks a year or something for a domain name, but he relented. And we were able to actually have a website that we could share with our friends, so they could sort of see our musings.

Starting point is 00:03:22 And pretty quickly, this went from html and css to learning php through some tutorial that was available online and pretty much every day after high school or primary school initially when we would come home and our friends would go and play video games we would go and program and learn new technologies and build little nifty tools to scratch our own itch and this led to pretty quickly at age 11 so just at the beginning of high school finding this online community where people in the netherlands which is where i grew up were posting kind of projects that they wanted to solicit basically bids on they wanted to find developers to implement whatever their company needed or some small php script is this like a

Starting point is 00:04:02 like a physical bulletin board or is there a website with job listings like how did this actually materialize so it was a website web forum called site deals where there was one of the sub forums was about like i need this to be done and then people could just respond with i'll do it for this amount at this time and us being 11 year olds we could basically undercut everything working with with our prices. So we ended up working for, you know, the equivalent of about $5 an hour all the way through high school. How did you masquerade as adults?

Starting point is 00:04:31 Like, how did that work? I mean, there's no way they were actually fooled because we didn't have any kind of tax ID. It was just, you know, money being wired straight into our bank account. We were 11-year-olds, and I'm sure that our writing reflected that to a degree.

Starting point is 00:04:44 But we still had to learn how to do contract negotiation and and requirements and and you know due dates delivery deliverables multiple stages of of delivery and payment and everything else so we learned pretty quickly how to be pretty convincing not so much at being adults as at being professionals who could actually build stuff that worked and that was the only requirement people had for us at the time. So all through high school, it was a lot of that freelance web development. And then at some point when the iPhone came out and the iPad came out, I was really intrigued by that platform. And I taught myself Objective-C as well and Cocoa Touch. I saw something really interesting about Objective-C. I don't know if

Starting point is 00:05:24 this will blow your mind, like it blew my mind, but you know how in iOS there's NS everything, like NS object, NS socket. Next step, right? Yeah, that's right. It's from next step. So short tangent here, but Steve Jobs left Apple, was kind of kicked out of Apple, started a company called Next and built this machine. Was the machine called the Next Step? I mean, it's starting to hit the limits of what I know here. But anyway, so they started calling everything NS because of Next Step, because of the Next machines. And then when Apple acquired Next

Starting point is 00:05:56 and Steve Jobs got back in, that has stuck around to this day. So that blew my mind. I've seen NS object everywhere, but I never knew why there's NS on everything. Yeah, it's really interesting. Like some of the, you know, of course, Apple during those years, Steve Jobs was away. Well, we're not doing as well as they are today, for example.

Starting point is 00:06:14 And the technology that really was at the foundation of everything from the modern Mac OS to the iPhone and everything, definitely it came from Steve Jobs' other company, Next Step. Super, super interesting. But yeah, I taught myself that. I taught myself all the NS prefixes and this really wacky, you know, square bracket syntax that Objective-C came with. But this led to me building, you know, an iPhone app for my high school that pretty

Starting point is 00:06:36 much every student and teacher ended up having installed on their phones, which allowed them to really quickly check the schedule, where they had to go for the next class, but especially also the changes in schedule when, you know, the room was changed, or when some kind of class fell through, because the teacher was out sick. And I didn't, I never did freelance iPhone development. But being sort of active in this Twitter, iPhone development, teenager space, led me to getting into contact with some people in the Netherlands who were setting up a Mac and iPhone development studio where they were going to be building productivity applications basically for the Mac and the iPhone on the App Store. And I ended up joining them and then really quickly becoming the lead developer at age 16,

Starting point is 00:07:17 that must have been. So for a few years there, I was after school, every Monday after school, I would go into the office and then work with them there until like 3 a.m um and then try during the week try to fit in a few more hours of work and that was really exciting a few of our applications got uh got got featured on the app store but after a few years we realized that it was actually pretty hard to make a lot of money on the app store in the beginning when your application is new you can drive a lot of traffic you might get featured blogs might cover you but then after that it's hard to keep that up especially if your first version is already pretty much exactly what you had in mind

Starting point is 00:07:51 and there's a ton of surface area for improvements or additional features so they decided to wind that company down and then one of my bosses at the time he had decided uh he was exploring this new startup with a previous business partner of theirs and they decided that they wanted me to join as co-founder and cto so at age 18 we co i co-founded a company with them called stingo where we built a web-based platform for bed and breakfast owners to manage their entire online presence so that includes their website their guest communication their reservation calendar and basically providing a whole tool set for what an old school bed and breakfast, not just a room in Airbnb, but one of these mom and pop shops that really care about their house style, and they make you breakfast

Starting point is 00:08:32 and all that stuff, everything they need to sort of enter into this modern area or era. And I was the CTO and built the entire Ruby on Rails monolithic application that ended up powering this platform. So that was super exciting. I did at the same time, you know, coming out of high school, decide to go to college. I'd already realized that I didn't necessarily need to get a job, but I did still want that sort of college opportunity.

Starting point is 00:08:57 And the nice thing was that most of what I had learned was already meant that I could skip a lot of studying for stuff in college. And I did get to sort of deepen my knowledge in areas like cryptography and database internals and big O notation, data structures, data algorithms, and all of that stuff, which hasn't necessarily been useful every single day in my later life in industry. But it did give me the confidence that I wasn't just making sort of newbie mistakes in writing my code without any kind of understanding of the fundamentals that made one path slowly performing and another path much more optimized.

Starting point is 00:09:38 That makes sense. How did you deal with that? I mean, so you're CTO of this company and a company is at its early stage. We're trying to find product market fit and a, and a bigger market, right. And everything like that. And you have to balance that with a full course load. And I mean, at some point you're probably thinking if this company fails, because I really wanted to pass calc three, like I kind of messed up, but on the other hand, it's like, you know, you're, once you have the college diploma you have it for life and you can go through a hundred companies you know

Starting point is 00:10:10 and you'll always have that college diploma you won't always have that job right so so it's like how do you how do you strike that balance i mean it's a it's a i feel like it's a really difficult situation yeah that's a really good question. Although my situation with that company, Stingo, it wasn't really the typical startup, let's co-found it, let's give it 100 hours a week, let's go for the stars. The situation was such that I was 18 and my co-founders were early 30s and early 50s. They were very different stages of their lives. 50-year-old who was essentially the CEO, he had a big background in this whole bed and breakfast world. He had a ton of connections in the Amsterdam bed and breakfast scene, essentially. So we were able to onboard a lot of users and customers through his

Starting point is 00:10:54 network immediately. But since we were all in these very different places in our lives, this ended up being initially bootstrapped by this guy I was just talking about. We didn't raise any venture funding or anything like that. And we all saw it a little bit more as this is a lifestyle business. We're going to spend as much time on this as we can get away with, which in my case was about 20 to 30 each week. And we were just going to build something for this market that we knew existed in Amsterdam.

Starting point is 00:11:19 But actually, because of these very different places in our lives and me being eager to like, okay, let's go for it. Let's make this take off. Let's build a billion dollar company. I was eager to go all in. And they were in different situations. One of them was pretty close to retirement and starting to think about, okay, how can I cash out and like do other stuff? The other guy was, he had some kids, so he was more interested in some sort of stable income, right?

Starting point is 00:11:41 Then instead of the high risk, high reward startup situation. And that's actually what led us to decide after about three years running that company to wind it down. And I was still in college at that time. I was also pretty active in the sort of Ruby on Rails meetups in Amsterdam and Utrecht, this college town where I went.

Starting point is 00:12:00 And through one of these and a conference that I went to in Athens, in Greece, I ended up running into this guy. His name is Sitsi Brandi. If you're very aware of companies that IPO'd last year, you might know where this story is going. But this was a guy who was building essentially a company around an open source clone of GitHub called GitLab, which was this open source project that had come up in Ukraine in around 2011. And he realized that there was a business opportunity around it to start offering support or sort of a hosted edition or enterprise editions. And at the time I met him,

Starting point is 00:12:36 this was just four or five people in the Netherlands. And we kept running into each other a few times. And interestingly enough, his parents had a bed and breakfast in the north of the Netherlands. So his parents became customers of this platform I had built single-handedly from the code side of things. What happened to that platform? So when you say you wound it down,

Starting point is 00:12:55 did you sell it to another private equity or what happened? We did briefly consider looking for buyers, but we ultimately decided that since the platform was pretty much done, people were really happy with that. We could just keep those paying customers going for a number of years. So we stopped all development and we stopped onboarding new users,

Starting point is 00:13:16 but it stayed up for a while. And it was only last year, about five years after we sort of stopped investment, that we wound it down when the user count had dropped below, you know, a mark where it wasn't sustainable anymore. But for a number of years, this was passive income, which was, of course, very nice for all of us

Starting point is 00:13:31 while we were working on new projects and new job opportunities. Oh, that makes sense. Very cool. So I met this Sitsi Brandi who was building this company called GitLab. We kept running into each other at different meetups around Europe.

Starting point is 00:13:45 And eventually when I was in this position where I was looking for new products again, I reached out to him. He had already previously told me like, hey, Dawa, if you're ever interested, come see if you can come join us at GitLab. At that point, it was just one startup out of many that were sort of in my vicinity.

Starting point is 00:13:59 I don't even remember exactly what it was that attracted me to the project, although open source and building something for developers was of course massively attractive being a developer myself and i ended up joining gitlab as employee number 10 just when they were going through y combinator uh that's 2015 i think while the entire team was in the mountain view house and i was you know i wasn't able to join because my professors wouldn't allow me to take that much time off. But I did get to join the company at that stage. Well, let me see if I can understand this.

Starting point is 00:14:29 So at this point, are you a bachelor student or a master student? Bachelor student. Yeah. So I originally joined GitLab part time. Got it. And so when you say you're a professor, you mean like one of the people who's teaching a course that you were taking? Yeah. One of the lecturers.

Starting point is 00:14:43 Exactly. Got it. Okay. He wouldn't let you move to Mountain view and do everything remote yeah so they were pretty understanding when it came to oh you know i had to go to san francisco for a week at some point for a course that was fine but to say i'm going to take two or three months off and like i have to catch up on all of the tests afterwards we've done remotely that definitely wasn't an option at the time so you know in academics in the netherlands as well

Starting point is 00:15:05 there's a little bit of they don't quite understand sort of the tech startup industry silicon valley world of course they look up to it to a degree but they also think of it as less than because they are like professors pushing the limits and it's different than um oh no no we're not going to give you three months off to work on some startup that's not real anyway you're just going to work for some kind of consultancy in the Netherlands. That was unfortunately very much the mindset of some of the lecturers there. Not to say anything bad generally about the education. It's really great.

Starting point is 00:15:33 It's just a very different attitude and focus. Yeah. I want to jump into one thing really quick. I'm assuming, so like, you know, you finish college and then you go to Mountain View to work for GitLab? No. So GitLab from day one had been an all remote company. It started as an open source project

Starting point is 00:15:48 that was founded in Ukraine. And then they attracted open source contributors from all around the world. So there were already hundreds of active contributors by the time that this Dutch guy Sid decided, hey, there's an opportunity to build a business around this project. And at that point,

Starting point is 00:16:01 it didn't make sense to start an office somewhere and then hire some of these contributors and ask them to move. Rather, we were a tiny little pocket of like six full-time people in this community of hundreds, and we were working with them. So we were on GitHub initially, using all of these sort of asynchronous tools to collaborate with these people, no matter where they might be. And from that day, GitLab never changed from its remote work policy. And GitLab, just before the pandemic at least, was the largest all remote company in the world with 2000 people across 68 different countries and territories. So I joined GitLab when I was still in college, still in the Netherlands, but had my team or had the team I worked on was also in the Netherlands. The rest of them were located elsewhere. And I was able to combine that with college for about a year and a half. By the time that that was done, GitLab had grown from 10 when I joined to probably 100 or so. And I stayed working there for a while from the Netherlands.

Starting point is 00:16:57 But since it was all remote, I also jumped on the opportunity to go and travel and meet these colleagues around the world. So in 2016, actually, like a month after I finished my college bachelor's degree and was sort of no longer more to one particular place, I went on a six-month trip where I visited 49 of these colleagues in 20 different cities in 14 countries on five continents in the space of six months, which was just an amazing experience for myself, but also the people being visited. Getting to work with them for five days,

Starting point is 00:17:27 seeing their natural habitat, their local coffee place, their home, their kids. And then taking Saturday to see the city that they lived in with an actual local tour guide. And then Sunday we would be on to the next location, which was an amazing experience. Yeah, so when you were still moored to the Netherlands, we have people listening to your show from all over the world. And a lot of them say, will write in and say

Starting point is 00:17:50 something like, you know, I don't live in a tech capital. You know, I don't live in San Francisco or Miami or Austin or New York or any of this. You know, how do I meet people locally, you know, and have that face to face with people who are interested in these kind of things? I think you think you, you kind of, uh, touch on a little bit of that, but what sort of advice would you give to those people who live in, and I'm not putting Kansas city down. Kansas city could be amazing, but I was just like Kansas city, you know, what, what advice would you give to, uh, to people who live in Nebraska or something? Well, I mean, I'm not going to touch, uh, any prejudices about us States, but you'll find those like-minded people if you just look hard enough. What made a really big difference for me is finding this community in that evidence called Young Creators, which was basically all high school aged kids who had somehow been a little bit more entrepreneurial than their peers.

Starting point is 00:18:41 Either they've been designers or programmers or they've been starting little lemonade stands companies from a very young age. And it really wasn't until I found that group who had, I think, monthly meetups in Amsterdam at the time that I had to take like a one-hour train ride to get to until I felt, okay, I'm surrounded by people like me and we can chat about the same stuff and we have the same sort of dreams that go beyond

Starting point is 00:19:04 just the space where we grew up. And did you find that group twitter honestly like i said earlier this job opportunity at this iphone and mac development studio came just from being active in the sort of twitter ios development space and then through that i came into contact with some dutch people of course there's a good amount of like really great dutch ios developers as well so just by following those and interacting with that sort of part of the internet, I met those like-minded people. Now, Patrick, you have to make a Twitter account. See, this is why. Oh, I don't use Twitter anymore. Oh, no.

Starting point is 00:19:34 He doesn't need it anymore. I'm off the hook. Oh, I just shot myself in the foot. So, okay, what would you do now? Let's say you were 16 years old over again, but in 2022, how would you go find those young creators nowadays? Snapchat. TikTok, apparently. I was talking to some Gen Z kids last week, and it just was, I mean, I'm 28.

Starting point is 00:19:57 I don't think of myself as old, but I was so out of touch with whatever the world today looks like to them. So probably just TikTok. But honestly, sort of tying this back again to how GitLab also worked and how GitLab built this massive all remote team around the world, contributing to open source projects

Starting point is 00:20:13 is an amazing way to interact with and sort of learn from the best and show your work to people who might actually hire you remotely. Like the first couple dozen engineers at GitLab were all open source contributors who either they applied or we reached out and said like, hey, do you want to do this full time?

Starting point is 00:20:31 Because by the time that they had proven themselves with high quality contributions and really great sort of async rhythm communication skills, it was a known quantity. And it was so much easier for us to bring some of those in than to go through a whole hiring process and not know what you get. So I would say if you don't just want to meet like-minded people, but also find a way to

Starting point is 00:20:47 potentially get a job opportunity out of it, or at least build code that can sort of form your portfolio, joining the open source community for a commercial open source project is a really great way to start. And GitLab is a great example of that. But Meltano, just to sort of, you know, throw in the name of my company, is another one. It's an open source platform for data teams to work more effectively on all of their data movement and data transformation challenges with the software development best practices built in. And it's an open source Python project. And a good amount of our current team of about 17 people started out as open source contributors or generally users of this tool. And they, too, are based around the world. We don't have an office either.

Starting point is 00:21:27 So we're very much sort of following in GitLab's footsteps there. And that's a way in which the world today is also very different from when I started, you know, in terms of job experience almost 19 years ago, when remote work was still weird and people didn't really trust paying someone across the internet. They only saw through Zoom. GitLab was a pioneer in that, the pandemic of course has made that extremely mainstream and there really is no reason for you wherever you're born not to tap into those kinds of job opportunities if you are able to get on their radar and show your work and open source

Starting point is 00:21:57 contributions are an amazing way to do that yeah i mean the whole remote work thing is still unfolding in really interesting ways. There was an article Malcolm Gladwell wrote, I think yesterday, I mean, very recently, where he was saying, basically, he was saying remote work is unhealthy, unproductive, etc, etc. I don't agree with that at all. So let me just put my own opinion out there. But I mean, that's somebody who has a ton of respect around the world and has clearly

Starting point is 00:22:27 fallen on one side of the fence there. We haven't achieved any type of consensus. We're still, I think, as a global community trying to figure that out. But I do think it's fascinating. I think that, as you said, the pandemic really forced the question on the world because I think there were a lot of companies that were very against remote work and the pandemic forced them to embrace it for at least a year and so that now it's like well what do you where did they they're on much more shaky ground with

Starting point is 00:22:56 their with that argument so um so yeah the whole thing is fascinating yeah and I mean I can talk about this some more like GitLab was and is one of the companies that had made remote work work at scale by far the best. And anyone who has worked remote during the pandemic should not think that they now know what remote work is like. It's extremely different if you set up a company intentionally from day one to be remote.

Starting point is 00:23:22 You design all your processes in this sort of async compatible text-based way primarily with all kinds of processes to get that social interaction despite the geographical distribution. Very different from if a company has to suddenly change. It's not super motivated to change its way because you think it's just temporary. The people that are forced to do it didn't choose to do it. That also makes a big difference. Remote work doesn't work for everyone. It also definitely doesn't work for every industry but if you're a company that is

Starting point is 00:23:49 building software all of the tools we use the githubs the git labs the slacks the you know whatever else we have on our computers it's all built around this async stuff anyway and even if you're working on a computer you're probably in an office you're probably talking with your you know person one office over over all of these anyway. So it is an industry that is particularly well suited to it. But then also on the people side, it's really important that the people know what they're getting into and know that this would work for them. If you need that constant social interaction in order to feel connected to your company or to feel productive, then it's never going to be a good experience.

Starting point is 00:24:19 But if people apply for a remote work job intentionally, and they know I prefer to work from home, I want to be able to spend more time with my kids, I don't want to have a commute, and I know by myself that I'm pretty good at just working on a computer for a number of hours without seeing people, then it makes a massive difference too if everyone is sort of into it and equally bought into the concept.

Starting point is 00:24:38 And then also the intentionality of the processes you design and the things you introduce to balance out some of that lack of in-face-to-face interaction make a huge difference. Like at GitLab and also now at Naltano, it's so important to still have people meet occasionally.

Starting point is 00:24:54 So every nine months, we fly the entire company into one place and we get like five days or so of quality in-person time, which really builds those relationships and that rapport and that sort of foundation of understanding on top of which you can work productively and feel like you can give each other constructive feedback

Starting point is 00:25:09 without being unsure about how that will be received. And then also at the same time, like I mentioned earlier, allowing people to travel and visit each other makes a huge difference. Those six months that I spent traveling around, visiting all of these colleagues, made a massive difference for me and the people being visited. And this actually turned into a policy at GitLab where if you travel to visit your colleagues the company will pay most of the time most of your travel expenses. That's a really good point. Yeah it's a really good point you know if a company isn't paying

Starting point is 00:25:38 to keep a desk for you somewhere that money can get redirected to these type of events which are probably much better uh you know because you're you're going with intent exactly it's intentional so yeah you save money from the office you save money from you know all the humanities you need to have there you save people time because of the lack of commutes and you can make up for all of this in doing these one-off events and funding the social interaction that just does need to take place. So that six-month trip is something that became a policy and a lot of people in GitLab started doing where they did like a Euro trip and then they visited, you know, however many people over the course of three months.

Starting point is 00:26:14 And in every city where we ended up having a significant amount of people, like five plus, we ended up having monthly co-working days where everyone from that region would just come in one Friday of the month, work together for the whole day and then have a shared group dinner and this also just meant that the people in our region felt great yeah where would they go we would rent a co-working space or some kind of meeting room somewhere and just all be sitting around a meeting space together and then you know go to a restaurant and of course this would be covered by the company as well because it is part of that social fabric that is really valuable to making a company work at scale. And then combining these two things, the travel and these co-working days, you ended up having people do a Euro trip where they would try to hit all of the co-working days they could within the time that they were traveling. amsterdam brussels lisbon you know rome um and visit seven different european countries with a

Starting point is 00:27:06 local tour guide basically to show you around and this opportunity to immediately meet 10 plus people there and that's stuff that made it work which of course during the pandemic was impossible because no one could see each other at all so all these folks all you folks out there who are going for your bachelor's right now and you're taking algorithms and data structures too and you're learning about traveling salesmen and you're like i will never use the traveling salesman problem you might find yourself at gila needing to go eat like uh drink beers at four different cities in a month yeah yeah exactly it's uh it's not bad but of course it not only gets up anymore which is great there's so many remote only companies now or remote first or you know remote compatible although we do believe that the hybrid model doesn't work nearly as well as the all remote model or the all in office model

Starting point is 00:27:54 because with the hybrid thing you do have sort of a tier system where the people who get to actually see their boss face to face every day just do have higher chance of promotions and stuff because it does make a difference for that foundation of trust. But if everyone is in the same exact spot, it works much better. But there's a lot of companies now, and if you look at like the most recent YC, Y Combinator batch of startups, a good percentage of those are just all remote

Starting point is 00:28:17 because why not? And why limit yourself to only the talent that is willing to move to one particular metro area if you could hire everyone from around the US or even from around the globe or from around a certain time zone. At Meltano, our team is distributed between Mexico where I'm based,

Starting point is 00:28:30 the US, Canada, the UK, Germany. And we're willing to keep adding places to that as long as we have enough time zone overlap where we can make it work. And there's a lot of companies doing that now, including a number of them founded by fellow GitLab alumni like myself. and actually Sid, his next project essentially after GitLab is this open core venture capital firm where they specifically invest in open core, which means open source

Starting point is 00:28:56 projects with a commercial business model around them. And a good amount of those are starting out remote as well. So yeah, it's a fascinating space. And GitLab has, you know, we've been front runners on both the fields of commercial open source and remote work for years now. And that's all stuff that we bring into Meltano as well. Cool. So that's a good way to pivot to the topic at hand, which is data ops.

Starting point is 00:29:20 So when you were at, you went from GitLab to Meltano, but when you're at GitLab, you must have seen something that made you say DataOps is really important. Was DataOps like a cornerstone to GitLab success or what's the connection there? Yeah, the connection is a strong connection. So it was actually not me who realized the opportunity in DataOps. It was a number of people in GitLab who were seeing how the data team at GitLab was working and the kind of tools they were using and how far away that seemed to be from the types of workflows and collaboration tools available to the software developers working at GitLab. And of course, also using GitLab itself, since it was a product, you know, it's a software development collaboration platform, essentially a DevOps platform. And it was realized that in this data space, these people are technical enough, and they

Starting point is 00:30:13 have similar enough needs of collaboration and high reliability and stability and being confident in the results of their work, that applying more of these software development best practices to the work that data teams do could make them a lot more efficient, effective, and really sort of level up that whole profession. And it was uniquely GitLab who realized the opportunity for open source and DevOps best practices applied to the data space. And it was in 2018 that a dedicated team was set up inside GitLab to essentially build a better data platform for the GitLab data team that would maybe eventually,

Starting point is 00:30:51 initially as an open source project, maybe eventually become an open source business in its own right. So let's unpack that. What is the data team at GitLab? What does that mean? What is the skill set? What does people do? Yeah, good question. So people have been saying things like data is the new oil or whatever else people have been saying in these very nondescript phrases that do sort of signal the importance of it. If you're a company, you're trying to, or if you're any organization, really, you're trying to accomplish something, you have goals. And you want to know what the right way or the best way is to get to that goal. You also want to see how well are we currently doing

Starting point is 00:31:29 or how am I even going to measure my success? And being able to measure your success and being able to come up with a plan where you're pretty confident that this is probably the best way to get there requires you to look at the data and make predictions based on where you are now and what you see in the data as big opportunities and giving yourself some kind of goal to go after.

Starting point is 00:31:48 And this is the same whether you are an educational institution or you're a non-profit or you're a massive company that's just looking at the bottom line. In any way, the more you learn from the data available to you to tell you how well you're currently doing and how much better you could be doing, the better. So in a company like GitLab, which is an online SaaS platform with a ton of users and of course also a product and it has a marketing arm, some of the data you're talking about is just usage of the product itself. Like which features are people using?

Starting point is 00:32:14 If they use this feature, do they make it to the next step? Are there features that are not seeing any use at all? What are the flows people take through the product to find particular corners of the product? How can we surface those more effectively to make people get more value out of them? Or something like A-B testing, where you don't know quite yet which way of talking about something or which way of presenting a feature is going to get most people value out of it. You want to be able to compare those.

Starting point is 00:32:38 Or with marketing, if you have a Facebook ads campaign or you have Google ads or whatever you might have, you want to be able to measure the impact of those. Like, did this specific phrasing actually lead to more users? Did advertising alongside this type of TikTok content or whatever work better than other type of content? You want to be able to compare that and use that data. And similarly, when you're talking about your customer base, you want to learn, is there a particular segment of the industry that has a

Starting point is 00:33:05 far shorter time to close than another? And should we double down on those? Or do we see that people with particular characteristics are far more likely to churn and stop paying us at some point? So this means that you have this data that could help you as a company be successful in all of these different tools like Zendesk or HubSpot when it's about support or CRM stuff, or in your own product and you might be using a platform like Mixpanel to track that data, or even if you want to track the efficiency or the happiness of the employees within your company just to sort of make sure that you're not falling short there. These are a ton of data sources, and it's up to the data team to build all the pipelines that get the data from all of these different APIs or databases and bring them into a place where analysis can take place. And analysis uses a tool, a BI, business intelligence tool, that allows you, usually with SQL, to write these queries and build these little dashboards that show you

Starting point is 00:33:58 how you're doing on certain metrics. But these first steps in the process are data movement and data transformation. Getting this data from all the places where it's currently hidden, getting it out usually through APIs or file dumps in FTP folders or S3 buckets or something. And then transformation means taking that raw data schema, which was optimized for an application, into a schema that is more appropriate for analysis. And for the types of questions you want to ask it and the types of queries you want to run against it, being able to do those effectively often requires you to change the schema and transform the data in a way to aggregate things or to de-anonymize things also, for example, in case you don't want to mess with PII.

Starting point is 00:34:41 So this challenge of data movement and transformation has been known for decades, but the tools used to do it are, from the eyes of a software developer, still pretty... I mean, it feels like legacy tooling. It feels like when I was nine years old and I was building a website and I would FTP into the web server and make live changes to PHP files and then go check in the browser just to see I didn't break something. So you're always working in production. Every change you make could immediately affect the user go check in the browser just to see I didn't break something. So you're always working in production. Every change you make could immediately affect the user. And in the case of the data world, that means that if you hit the save button, you might accidentally break the dashboard that your CFO is about to present to the board or the

Starting point is 00:35:16 thing that the CEO is looking at to make sure what it should be worried about today. And that approach of just hitting the save button and crossing your fingers, as a software development community, we've moved past that with DevOps, version control, continuous integration and deployment. And all of these best practices could also really help data teams. And it was GitLab that saw that opportunity and started building that for its own data team, who, of course, had been especially exposed to how software development teams worked and realized like, hey, we want some of that. So in 2018, the team in Getlast started working on this tooling. In 2019, I ended up joining that project as development lead when there were four engineers and a general manager on the Meltano project.

Starting point is 00:35:57 And in 2020, the headcount of that project was brought down from six down to one because we hadn't quite been hitting the numbers and the growth numbers and the contributor activity that we were looking for so i was left by myself on the meltano team and throughout 2020 i managed to identify a more narrow sort of description of the problem we solved and a way to really reach the audience we wanted to find and it was through 2020 that this started taking off as an open source project with hundreds and thousands of users. And in early 2021, we spun it out as an independent startup from GitLab with

Starting point is 00:36:32 seed funding from GV, formerly known as Google Ventures. And since then, we've been on our own independent startup journey. We raised money again this year, and we are really sort of building out this massive vision of building data tooling that that adopts from the ground up all of these things that have made the development teams software teams

Starting point is 00:36:52 so effective and we're seeing a lot of interest in that fortunately yeah so so let's double click on this so it's a data engineer or is that is that a orthogonal thing no so data engineer is one of the titles that's most common for these people that are challenged or tasked with these data movement and transformation challenges so our target audience are data engineers exactly got it i see so so someone writes an app like we can go back to your days making iphone apps right someone makes some iphone app they're writing all this objective c they're writing they're creating ns objects they're wondering why there's NSs everywhere.

Starting point is 00:37:26 And then they send their app out and it gets, you know, three stars or two stars or one star. And people say, oh, you know, it's crashing all the time. So now some of that app will provide. So Apple has the crash handler and you can go and look at the logs and everything. So you fix all the crashes. You say, OK, I'm done. Now you get instead of getting one star,

Starting point is 00:37:45 you get two stars. They're like, okay, the app is stable, but it's just not what I wanted. And so you have to sift through these reviews. It's pretty painful. And you've already kind of lost a big opportunity there, right? So that is bad. That is not the way to launch an app.

Starting point is 00:38:01 But a better way to launch an app would be to start some really limited beta, call it a dark launch, where you don't do a lot of advertising or anything. You get a few people, and now you don't just want their reviews, because a lot of people won't write reviews. You might get everyone writing about the same complaint. You want to get really in-depth data. Did they see which of my screens in my app did they look at? If nobody looked at screen four, well, why did I even build it, right? Or maybe I can't get to screen four, you know, whatever. You know, if everyone's spending all their time on screen two, that's

Starting point is 00:38:38 what I need to spend my time as a developer focused on, right? And so you can continue to like subdivide and subdivide and subdivide down to more and more nuanced data. What are people in Mexico City doing? What are people in Texas doing? And so you end up asking all of these questions and needing to slice in all these different axes, right? And so that is the sort of business value of kind of data ops and data engineering. I'll say of the whole data engineering, data science kind of process, right? And so then data engineers and data scientists are going off and trying to build the infrastructure and sort of the semantics to answer those questions. But you're saying they're doing it an equivalent of like us, you know, SSHing into a machine and like doing all the coding in nano or something, right? So yeah, I'm touching the production code, a production system all the time. Like, oops, I forgot a semicolon. And now everyone who goes to my website just gets a error 500 for the next like three minutes until I realize, right? And so Meltano is an attempt

Starting point is 00:39:45 to make it more productive and safer to answer all of those questions we just talked about. Today's sponsor is Mergify. Mergify is a tool for GitHub that prioritizes, tunes, automatically merges, comments, rebases, updates, labels, backports, closes and assigns your pull requests. Mergeify features allow you to automate what you would normally do manually.

Starting point is 00:40:14 You can secure your code using a merge queue, automatically merge it, and many more features. By saving time, you and your team can focus on projects that matter. Mergify can coordinate with any CI and is fully integrated into GitHub. They have a startup program that could give your company a 12-month credit to leverage Mergify. That's up to $21,000 of value. Start saving time. Visit Mergify.com to sign up for a demo and get started. Or just follow the link in the show notes.

Starting point is 00:40:47 Back to the episode. Is that a good summary? Yeah, that's a really great summary. I would say, you know, you started talking about, you know, new app, dark launch. Of course, the more data you're working with, the more valuable it will be to try to automate this with processes and have data engineers that write pipelines rather than just going in manually and doing the things that don't scale. But yeah, definitely. By the time you have a lot of data like that and you actually have processes or people in your company relying on it, you also want to make sure that when you make a change, you don't accidentally break the thing that's live. And you don't then want to have to scramble to fix it live because then you're even more likely to make mistakes or not fix it in the best way. And data teams that are becoming more and more relied on

Starting point is 00:41:29 by their business, and sometimes this data is also fed back into machine learning processes or just goes back into other company systems and it kind of spreads within the network of all the connected parts of the business. If there's a mistake in there, it's a problem. And if you are hitting save and crossing

Starting point is 00:41:45 your fingers and hoping to break something, that just doesn't work anymore in this day and age. And in software development, a lot of this has been solved. We know a lot of the things you can introduce, workflows, practices, tools into Teams to help. And the goals are the same. You want to have high confidence in whatever is live. You want to be able to experiment safely, try stuff out locally, have the feature branches in your Git repo with experiments, and not feel like you have to limit yourself from making changes because you might accidentally do something you cannot work anymore. Even being able to roll back to a password conversion is not a given in a lot of the data tooling that exists today. So if you are building a pipeline

Starting point is 00:42:24 today, especially if you are coming from a software development background, it seems like a no-brainer to have something that can be version controlled and something where you can have CICD tell you if you are accidentally going to break something. And this increases the efficiency, effectiveness, velocity, and innovation of the team that is working to solve these challenges, same as we've seen in software. And Meltano's approach is one in which we have identified that there are a lot of really great open source tools for different steps of the data lifecycle. But these tools themselves don't necessarily embrace the software development best practices. And at the level of your entire data platform or your data stack, when you're putting together

Starting point is 00:43:01 these three or four tools that together solve the data movement and transformation problem, there's no way to manage their configuration in a version controlled way or to manage changes that span multiple tools. If you need to update the configuration for how to get data out of some SaaS API, and at the same time you want to modify your transformation script, you want to do that at the same time and you want to be able to validate that that combined change didn't break something. You cannot see those in isolation, but the current world in data engineering expects those changes to be fully separate from themselves from each other and that is just not the reality so meltano allows you to bring together every aspect of your data platform from start to finish and all

Starting point is 00:43:40 these different tools that you use to solve uh sort of the incremental problems along the way in one place with a consistent approach to version control, configuration management, and end-to-end testing. So if I'm talking to a data engineer, I would explain all of this in context of the specific benefits because we cannot expect them to already know all of these software development terms. But if you are a software developer and at your new company, you see your data team scrambling with their work, if you want to make them as effective as you have been as a software developer, then Meltano is the tool that will help them do so. Yeah, that makes sense. So going back to our app example, so we have, you know, let's say thousands of people running our app, right?

Starting point is 00:44:18 And so we want the data from all of them, you know, in one place so that we could do all this analysis. So kind of walk us through, you know, what that place so that we could do all this analysis. So kind of walk us through, you know, what that looks like. So how do we go from NS log in my app to, you know, we all have a we have a some dashboard with everyone's data on it. What are the tools that data engineers use? What does that look like? Yeah, so I haven't been in the iOS ecosystem for long enough to exactly know how to get that NS log statement into some kind of SaaS product. But where it usually starts from the Miltana perspective, or for the data engineer perspective, is that this information we want already lives in some kind of SaaS system. So that might be, you know, Apple's own error crash report

Starting point is 00:45:05 sort of interface, which hopefully has an API, if not feature requests. And then the next step is, how do I get that out? So you can, of course, learn the API documentation, write your own little Python scripts

Starting point is 00:45:16 or whatever language you like, but Python is sort of the lingua franca of the data world. And then pull out that data. And then you want to get into it, into a data warehouse, which is not just a database like Postgres or MySQL, but it is specifically optimized

Starting point is 00:45:29 for analytical workloads and use cases. And we call those OLAP databases. What's an example of a data warehouse? Yeah, so BigQuery, Google BigQuery is a big one. Amazon Redshift is another. But then Snowflake has really taken over the world by storm over the last decade or so. And that is what we see our users using most.

Starting point is 00:45:48 So it is a database that's optimized for running analytical queries that require a lot more compute. They're really complicated. Tons and tons of joins, for example. It's still SQL, but under the hood, it's expecting more complicated SQL. And so the engine is better. They're columnar data stores. So they don't store data in rows.

Starting point is 00:46:08 They store data per column. So if you do aggregation over a column, all of that data is already, the memory locality is much higher because it's literally all packed together instead of being spread out over these rows with different offsets into each row. It has sort of the columns instead of the rows

Starting point is 00:46:23 as the primary way of storing things. But SQL with some extensions, which are then dependent on the specific framework you're using or platform you're using, is still the main language to pull this data out. Yes. Got it. I see. So we have some way of getting the data from, you know, it could be a website, an app, from whatever, a video game, from whatever it is into this data warehouse, which is similar to a database as an end user, but under the hood, much more efficient for what we want to do next. And then what happens once we have everything in the data warehouse? Yeah, so to be clear, you could, as I just described, write your own little script to get the data from the Apple Crash Report API and load it into your data warehouse. But this is a problem called extract and load or data movement, data integration, data ingestion.

Starting point is 00:47:16 Those are sort of the terms you want to Google, which a lot of companies have tried to solve by building these connectors for different data warehouses, different SaaS APIs, and having you pay some kind of subscription fee to do that work. So one of the things that makes Meltano different is that we have embraced an open source library of connectors for data sources, which today counts more than 300 different sources and destinations that are supported. And this standard is called Singer, named after the sewing machine, actually.

Starting point is 00:47:44 And it is this Singer library of connectors that we with Meltano have embraced and built a platform around so that you don't have to do this work of writing your own Python script again each time. And you can leverage the work that the wider community has already done. And you are able to self-manage this

Starting point is 00:48:00 and improve the open source scripts instead of fully relying on some kind of proprietary sas extractor node offering so that's the data ingestion step the next step then is to take that raw data which usually matches the database schema structure of the application which means you know columns you know tables rows primary keys and turning it into something that gets closer to the types of questions you want to ask for the analytical side. So if you just have a single database or a single data source, usually the original schema is going to be good enough. You might want to drop some PII

Starting point is 00:48:36 columns or which have personally identifying information, or you want to hash them, for example, so that you can still tell the difference between different people without actually being able to tell who they were, and then run queries against it. But especially once you start having data from multiple sources, like you were saying earlier, when you want to find out whether someone from Mexico City gets stuck earlier or something, or whether some marketing campaign on Google Ads or Facebook Ads gets people further into the user flow or makes them more likely to convert into paying customers than some other campaign, you got to be able to combine this data.

Starting point is 00:49:07 And those database joints are not something you want to do every single time you ask the question of the data warehouse because joints are expensive. So you want to do some amount of pre-aggregation at the data transformation layer where you combine these tables into analytics data tables that have exactly the data already together in the same table that you're going to want to compare or combine.

Starting point is 00:49:31 So in the challenge of answering questions about any kind of data you have in various APIs, data extraction and loading, then data transformation, and then writing the actual analytics queries are sort of the steps of the process. And in all of these steps, you're essentially dealing with code, whether it is the SQL queries that handle your analytics questions

Starting point is 00:49:51 or the Python scripts or the SQL queries that define the transformations or just the configuration of the extractor node pipeline, which might change over time as you change the schema you want. These are things that we can think of as different versions or iterations or revisions of the code.

Starting point is 00:50:10 So version control sort of directly applies. Yeah, that makes sense. Yeah, I want to pause just for one quick moment here and talk about how unbelievably important this is. I was listening to a game development podcast and these were, sorry sorry game design podcasts and these are like pure game designers a lot of them don't write any code or know how to program or anything like that they're purely on the design side there was one person talking about how from testing they

Starting point is 00:50:38 found that when the hero you know sort of hit the enemy, they originally had it. So there was this, the enemy was this knight, this fully armored knight. And when you hit the shield of the knight, you know, the knight didn't take any damage. It just sort of blocked it. And you made this, you heard this little clink sound. But then when you hit the knight, the knight like flashed red, you know, showed that you bypassed the shield. You got through the shield. But they had the same clink sound because it's like all metal it's like shields metal armors metal and what they found through a b testing was that that was unbelievably unsatisfying like people wanted like or like ah

Starting point is 00:51:16 or something like that you know like and when they switched to from uh from using the same sound for both of those which is you know is physically accurate to a different sound. Maybe it was a different type of clank or a grunt. I don't know. The user engagement won't weigh up. And so it's one of these things, it's like nobody could foresee that, right? It's only something you can see post hoc. And they basically started down this path because they found this one boss to be much less satisfying than the rest from data, right?

Starting point is 00:51:47 So no matter what you're building, you're building it for other people. And those people have a collective consciousness that you cannot be fully aware of, right? It's impossible. Even Steve Jobs, as much as people talk about Steve Jobs, you know, is constantly relying on data and iteration and feedback loops. And so, and so, so you're going to have to learn all of these things we're talking about to build anything that's successful, especially nowadays, which where it's so incredibly competitive. There's so many great apps out there in the app store, The Unity Store is so full of games, right? So this is absolutely critical to understand.

Starting point is 00:52:28 And I think, Doan, you're doing an amazing job kind of walking us through this. So we have this ETL system in place. We've got everything in a data warehouse, and then we've transformed it to something that allows us to do the analysis we want efficiently. How do we go from that to a website with a pie chart on it? We used to throw out a new term ETL, which might be new to the audience. So that stands for extract transform load. And I'm calling this out because the sort of the more modern version of this is actually ELT, where the transformation happens after the load process. And this is relevant because, you know, for those who are interested in the history, if you do the extracts, you have a script that basically notes all of the data from the sources into memory.

Starting point is 00:53:09 You can do the transformation into memory before even writing into the database. But that has a lot of, you know, heavy compute and memory requirements. But it is sort of traditional model where you use Python code or all kinds of algorithms to change that schema before it ever lands in the database. But one of the things that these new analytics databases are just also really good at is those kinds of transformations efficiently. So if you can define your transformation, not in terms of a Python script, but in terms of a SQL query, where it targets the raw data, and then the select query you write outputs the new query

Starting point is 00:53:46 or the new schema rather. You can define your transformations in SQL in a way that's easily version controllable and an analyst can actually help with and do it all in the data warehouse so that you can change your transformation over time. And instead of having to redo the entire extract pipeline,

Starting point is 00:54:03 you can just run it against the raw data again and again and iterate quicker that way. So we think of it as ELT, not ETL. But then, yes, we've now talked about a tool for EL, extractor loads. Something for transformation, which in our open source land is typically DBT. Like I mentioned, for the EL side, Stinger is this really great, amazing technology for extractor load connectors. DBT stands for data build tool, which is this way of transformations with SQL.

Starting point is 00:54:30 And then the last step that you'll definitely need is like you said, getting some kind of pie chart somewhere. Well, if you're going to do a pie chart somewhere, you can just write your, you know, Python code

Starting point is 00:54:40 or your Jupyter notebook and directly target the data warehouse. What's more typical for data organizations is that they have a BI tool of some sort, which might be Looker or Tableau or Power BI or Superset and Metabase.

Starting point is 00:54:53 BI is business intelligence, by the way. Yeah, business intelligence, exactly. Those are the types of tools that allow you to define dashboards and reports and give you all kinds of choices and visualization methods. And you don't need to write code for it, so it's no pie chart involved. It's sort of point and click. Because in most cases, these data analysts are less technical than the programmers who are at the beginning of the data journey. Data engineers are quite technical. They know code.

Starting point is 00:55:17 They're pretty comfortable or getting comfortable with version control. Data analysts most of the time come from an Excel world where they are used to just looking at the data in a tabular form and then writing the queries to get the sort of the metrics and aggregates that we're looking for. And in the open source space, Superset and Metabase are two really great BI tools. Outside of that, like I mentioned earlier, Tableau, Looker, Mode, there's a whole long list of them that teams use. But yes, by then you are at the point where you have a dashboard and it shows you the number and you can see whether this month was better than last month or you can see whether campaign A

Starting point is 00:55:49 did better than campaign B or like you were saying in the A-B testing scenario with the video game, you can see how much more time people spent on the game with the grunt versus the, you know, the clink sounds on the sword.

Starting point is 00:56:01 But all of these things, they are obviously really intertwined. Like if you want to ask a new type of question of your data, or you want to have new data involves, you got to go to the data engineer and ask them to write a new EL pipeline to do a new transformation. And then you can write a new query. But these currently live in siloed off little environments, the people that use these tools might not actually be talking all that much, it's more throw a request over the wall, and then two days later, hopefully you've solved it. And we think of this very much as one change set

Starting point is 00:56:30 that happens to be spread across different tools, but it's something that you should be thinking of as one change to your data platform rather than separate changes to each tool involved. And we bring those changes together in one repository. We bring those applications, those different tools together in one repository. And we essentially bring the entire data team together in one place so that they can also collaborate more effectively and know what the other is doing

Starting point is 00:56:51 and give feedback on each other's work and get to a place where a data analyst feels confident actually suggesting a change to the data pipeline through a pull request, knowing that they can make dumb mistakes, but they won't accidentally break production because it's going to have to go through that code review and CICD process anyway. So the sort of siloing we're seeing or the situation where a data team might not even allow a junior engineer to go into the system and make changes because the chance of accidentally breaking something that's just too high is, of course, limiting the effectiveness of these

Starting point is 00:57:20 teams and their ability to quickly iterate and improve. And like you called out earlier, your data platform is a massive competitive advantage. If you know better how you can make your customers or your users paying customers or make them stick with you for longer or do exactly the new feature that they want, you're going to do better than your competitors. And right now, data tooling is not built in a way that actually allows data teams to get the most of it. And that's where Montana comes in.

Starting point is 00:57:45 And like I said, it's open source. So everyone can just download it and give it a try today. And we are building commercial offering around it, but we want to make the barrier for people getting value really low. And that also means that we have a large community of users and contributors that, as I said earlier, you listening today can also become part of

Starting point is 00:58:02 if you want to get job opportunities potentially through this field, or you want to be at the vanguard the forefront of this modern approach to data engineering cool yeah that makes sense so yeah one thing that's not totally clear is how you handle the fact that there's data in the loop so for we talked about soft parallels to software development and in software development you have version control, you have all of these things. It's not clear how data fits in. Like if I'm writing a software application, I have a bunch of.ini files

Starting point is 00:58:32 or.json files that are specific on my config and they just go straight into source control with everything else. Yeah, I can't imagine putting all the user data into Git or I don't think it's going to work. So yeah, how do you handle the fact that now there's data involved with this? Like, how do you do versioning here?

Starting point is 00:58:51 That's a really good question. And DataOps is sort of orthogonal in that there is the part where we're applying these iteration strategies and tools to the actual code or technology that powers all the data flows, which is what we've been focused on so far, because that's where a ton of gain is to be had.

Starting point is 00:59:09 We have not ourselves looked into versioning the data itself or data validation. Although data validation, adding some kind of testing pipeline that will tell you if the data is suddenly looking different than you expected it to, or if instead of a hundred records per day, you're only seeing 20. There's other tools that focus on that,

Starting point is 00:59:28 and we support those on Meltano as components that can be brought into your Meltano project. And we can then help you version the testing criteria themselves. The data itself is not typically versioned, in large part because you don't always need the old versions, and it can be extremely expensive if you have a lot of that data. And part of the point of these data platforms

Starting point is 00:59:48 is that you want them to be sort of reproducible, where your data pipeline defines the entire flow from the API to the desired format and graphs, which means that the data itself, as long as it's still in the SAS APIs, you can upgrade your platform and just rerun the pipeline again, and you'll pull out all the new data and be at the latest state. If you want to version your data, that's

Starting point is 01:00:09 not something we are focused on. There's other tools that have been working on that. But we think that that's the next step. Once people are comfortable with this concept of versioning and different feature branch pipelines, et cetera, versioning the data itself is the next step, which we will probably get to eventually. But what we will likely do is look for the most promising open source technology that is working on that problem and adopt those as supportive components on top of the Maldano platform. That makes sense. So if we do this ELT, where we get the data into the data warehouse, and then we apply transformations to it. How does that handle the changes on the software side? So I load today's data and I rename a field.

Starting point is 01:00:55 And so my field is called time, but I spelt it T-H-Y-M-E. And then I realized like three months in, oh, shoot, it should be T-I-M-E. And so I change it. And so now I have this data that has two different schemas, right? And so how do you handle the fact that the app developers are constantly making changes? What happens in that case is that you need to modify your data transformation SQL queries at the same time as the upstream changes, essentially. So your analytics queries are written to an analytics schema,

Starting point is 01:01:30 which is derived from the raw schema that comes from the transactional database that your application developers are working on. So when the application developers change their schema, then the data pipelines would break because the source they're targeting doesn't match anymore. But you can update the data transformation SQL queries, still have the same output format, so that your analytics queries and your dashboards don't need to change,

Starting point is 01:01:53 and just modify the way that that same output schema is derived from the input data. But this is also where a lot of data teams run into problems, because they are dependent on upstream data providers, which might be APIs, which are relatively slow moving. They have clear change logs. They have different versions in many cases. But you might also have application developers within your own company who are not really aware of what's happening downstream of that data.

Starting point is 01:02:18 So one of the things that Meltano allows is for the CICD pipeline of the main application code and the CICD pipeline of the data application code and the CICD pipeline of the data platform to also be connected so that the application developers can be informed when their change would break something so that the data team gets a chance to accommodate that change in the same pull request or in a related pull request to the upstream change so that they can be deployed at the same time. Instead of the data engineer finding out that suddenly their production dashboard is broken, and then having to scramble to fix it while their CFO is telling them like, hey, I've got this work meeting tomorrow,

Starting point is 01:02:51 like, why isn't this done yet? So being able to combine or to bring the data platform into the same development workflow where the complete downstream impact of any changes is validated before stuff goes live is part of the value you get from building your data platform like a software project and bringing in these CICD benefits. Got it. I see. That makes sense. So you have a pull request that fixes a column name in your application. That pull request kicks off some kind of continuous integration job, which is going to run your app, generate some data, then try to ingest that data and produce, let's say, a really sparse dashboard.

Starting point is 01:03:35 And when it does that, it's going to fail because the part that's producing that dashboard is expecting a different column name. And so then now you realize, oh, this change actually requires a complimentary change on the data engineering side to say, you know, if the date is newer than this, then or if the version of the software is newer than this, then use this name for the column, otherwise that name. Now you get the dashboard working again. And maybe after a year or so, you can go and get rid of that if statement once you've purged all that old data or something.

Starting point is 01:04:10 Yeah, that's exactly right. And the goal here, just like on the software development world, is for production to never be broken, or at least for if production is broken, having the ability to roll back quickly. And on the data side, the more this becomes the brain of the organization and the more people rely on this, the more important it is that they have the same confidence in those data dashboards never breaking as the organization has about their main, you know, web-based product or the app that they have live somewhere.

Starting point is 01:04:37 Cool. So, okay. So if somebody is, let's say, a college student, I mean, you know, you can use your early self as an example. You know, someone is a college student, I mean, you can use your early self as an example. Someone is a college student, they're starting a small company, or they have a senior design project, and that involves sort of a closed loop with a lot of customers out there. What is the bare bones, or not bare bones, I say, what is the sort of cheapest way that a college student can get a pipeline like this and data ops off the ground? The answer is definitely Meltano. And specifically, you can run Meltano anywhere you like.

Starting point is 01:05:18 Like it's open source software. We are building some commercial functionality around it, but definitely if you're a color student or you're working on a small startup, you can download the code and get a pipeline running on your local machine in a matter of 30 minutes, all the way from having a dashboard up and running based on some data that was previously just hidden in some SaaS API.

Starting point is 01:05:38 And we have a number of demos and speed run examples of this on our website. Running it on your local machine, super easy and cheap. Then if you want to actually run it continuously somewhere, you can use even something like GitHub Actions or GitLab CI to run that pipeline in a recurring fashion. If you want to host the dashboard somewhere, you do need some way of spinning up a Docker container and exposing that in a web interface. But that's where Amazon and GCP and Azure have all of their own container scheduling functionality.

Starting point is 01:06:08 And you can even just take a Linux box somewhere and use Docker Compose to spin up the web interface, which is also something I have on my own home lab, for example. Yeah, so what connectors would you recommend? So we talked about Singer. So if someone's making a video game, let's say, so they're writing code in Unity. I'm assuming there's like a Singer.

Starting point is 01:06:30 Singer has a way of getting, you know, I guess, blobs of JSON from Unity into a data warehouse. I mean, I don't know well enough where Unity would store that data. If that data is on an API somewhere, you can write a connector or one might already exist. If you can get that data into S3 or into some kind of FTP folder, then you can use the existing connectors for those. But the expectation is that the data is currently somewhere where you can get it out of it, usually with an API or some kind of database query.

Starting point is 01:07:02 Okay, got it, got it, got it. So, okay, let me take a step back here. Okay, so someone needs to use like Amazon Kinesis or one of these things, you know, you have a million users out there who are all running something, website, app, game, et cetera. You need to get it in one place. And so that will be outside of Meltana.

Starting point is 01:07:21 There'll be something and there'll be Kinesis or there'll be some, you know, Kafka, some type of like endpoint where you could put this data in. It doesn't even need to be that low level. Like there's a lot of, you don't even need to pick

Starting point is 01:07:33 a technology like that per se. There's a lot of products that have libraries for iOS and JavaScript and whatever else that allow you to track these user events. And those tools like mix panel or segment those all have apis that you can pull the information from into your data pipeline yeah okay

Starting point is 01:07:51 got it i got it okay so there's some like yeah let's say like user breadcrumb if you if you're a college student go out there punch in like you know user breadcrumb and then whatever you're doing like video game app whatever there's some service that you can use that's free that will allow you to kind of put that data there. And then Singer will connect to that and put it into a data warehouse. Correct. And for the data warehouse right now, the most popular solutions in sort of real life production use cases are some of these paid products like Snowflake, BigQuery, Redshift. But you can start simple with just a Postgres database, which, like I just said earlier,

Starting point is 01:08:30 is not necessarily going to handle massive analytical workloads. But for your college project, that's more than enough. And you can get a full data pipeline end-to-end with the dashboard and pulling stuff from various APIs up and running, like I mentioned, in a matter of 30 minutes to an hour. But what I would suggest, if I mentioned, in a matter of 30 minutes to an hour. But what I would suggest, if this is something that interests you and you want to play around with it, is to identify maybe not even a business source, but something fun in your life.

Starting point is 01:08:53 Like if you do a lot of cycling, you could use Strava or you can use whatever other fitness app you're using, or you could even start with tracking your personal finances and see if your bank has an API or a crypto platform, why not? But if you find a data source in your own life, some kind of tool you use that has an API, you can build a Singer connector for that using our Meltano SDK, and then build a pipeline with Meltano and Singer and BBT and some of these other projects I've mentioned to create a customized

Starting point is 01:09:20 dashboard for whatever metric in your life you're trying to track. But there's a lot of hobbyists using it for these sort of quantified life, quantified self use cases, which is honestly a far more fun demo than yet another business platform or business use case. Yeah, that makes sense. One caveat there, one word of caution, if you're going to use this to track the value of your NFTs, you really have to watch out for those underflow errors. An integer can only hold, what is it, negative 2 billion. So once you start losing more money than that, you're in trouble. What about on the dashboard side? You mentioned Metabase and Superset. Do you have a particular preference, especially if you're a beginner? Is there one that you'd recommend more than another?

Starting point is 01:10:03 Yeah. So Superset and Metabase, they're both open source business intelligence solutions. Superset, I think, is a little bit more mature when it comes to competing with some of these paid BI products. And there's a lot of different visualization methods. But the user interface is also a little bit more difficult. The learning curve is slightly higher. With Metabase, it's really easy to start exploring your data and seeing if you can pull some graphs out of it. And it definitely has enough visualization functionality for any sort of hobbyist or student use cases. But large businesses, we see using Superset more often. Smaller projects, Metabase is a really

Starting point is 01:10:41 great start. Cool. That's awesome. And so Meltano is a platform that underpins a lot of these things. So you could start with Meltano and you start adding these connectors. As you do, they become kind of pull requests that grow into this amalgam that you end up with, where you can now run your app maybe yourself. And then you can look at the pie chart and see your own data and say okay this is working and now i can send this out to a bigger audience yeah exactly meltano is the foundation essentially of your of your data platform it is the project that lets you build this repository that then brings together all of these components that you can add one by one as you need

Starting point is 01:11:25 them from the connectors to the transformation tool like dbt and then to a visualization tool like superset and then what you end up with is one repository that holds every aspect of your end-to-end data story that can be deployed as a single docker container onto any sort of docker compatible platform including your local machine if you're using something like Docker Compose. And it's Meltano that standardizes the configuration of all these components, allows all of their assets and configuration to be version-controlled,

Starting point is 01:11:52 and helps with the deployment of the entire thing as well. Cool, that makes sense. So Meltano, the company, why don't you tell us a little bit about how that got started? I understand it was a spin-out of GitLab, but at what point was there sort of a decision made that, yes, this should be its own company? How does a company decide on spinning something out? And what was that story all about?

Starting point is 01:12:18 Yeah, good question. So, like I said earlier, from the beginning, GitLab realized that there's a big opportunity here. This is a product that by itself could revolutionize a lot of the data industry and how people think about building their data platforms. But there was always this idea of either this is going to be an internal business unit, a second product of the big GitLab company, or it will spin out at some point to kind of go its own way. And by the time that Meltano was starting to show the kind of traction and growth and then community activity that warranted real growing out the team again and then spreading its wings, GitLab was a 2,000 people organization where every single person, except for myself,

Starting point is 01:12:58 was working on this one product, one customer persona, one everything. And there were all kinds of tools in place that are really appropriate for a 2,000-person company, but not for a tiny little open-source project that was still before product-market fit, needed to start building out its team, needed to figure out a business model

Starting point is 01:13:14 around this open-source technology. So we realized pretty quickly that a lot of the process and administrative overhead in GitLab that was appropriate for a massive, you know, about to IPO organization was holding us back, slowing us down. And from GitLab's perspective,

Starting point is 01:13:30 the estimation was made that the value of Meltano as a spun out company in which it maintained the stake would be larger in terms of expected value over time than keeping it inside and sort of limiting its ability to spread its wings and grow. So this was really, yeah, a decision between myself and the GitLab CEO, Sid, who I referred to earlier. And considering that we had seen some of the difficulty around hiring people and the types

Starting point is 01:13:59 of compensation that is typical for an early startup versus a later stage company like GitLab made us think like this is the only route. And that's when we started looking for outside funding to get a seed round together, which we completed in June last year. So we're now a year and two months or so into our independent journey. And we've grown to 17 people right now.

Starting point is 01:14:20 Like I mentioned earlier, we raised funding again earlier this year, just before the sort the market downturn so that was really good timing. I was wondering about that. And now we're in a position where our runway extends into 2024 so we've got a good amount of time to

Starting point is 01:14:36 work out exactly how we can convert a good amount of our open source user base into paying customers as well because we love the open source and the fact that this is by the people, for the people. Engineers can help it out. The barrier to using it is super low. So a lot of our audience today will just be able to give this a try.

Starting point is 01:14:53 But in order for that growth and reaching as many people as possible, we, of course, need some money to keep building this too. So the commercial open source challenge is one that we have to focus on in the coming months. And we are going to be launching a managed version of Meltano for those people that don't want to self-manage their deployments. They don't want to learn Terraform or Kubernetes

Starting point is 01:15:14 or Docker. They don't want to have to deal with how do you have a production environment and a staging environment? How do you have feature branch deployments? That's all stuff that we can automate away and charge for, which will be our sort of first foray into being a commercial open source business more than just a commercial open source project.

Starting point is 01:15:33 Cool. That makes sense. Yeah, I feel like liability is an area where if you're an open source project and you can reduce liability, that's something that companies will pay for and simultaneously you know hackers and and individual developers aren't as interested in so you're not really taking a lot away from them and you're still providing a lot to the people who want that yeah exactly there's a big difference between what small teams organizations individuals need and the sort of requirements that every enterprise will have.

Starting point is 01:16:06 Any kind of company larger than 100 people or so will need different stuff than the open source. And you could reimplement some of that stuff yourself if you really want to go with the open source and the self-managed approach forever. But of course, there's a lot of things we can make significantly easier than you having to hire your own engineer to build all this infrastructure around Meltano if we can handle it for you. Yeah, definitely.

Starting point is 01:16:30 Cool. So is Meltano hiring? If so, are they hiring like full-time, intern, both, neither? What's the status there? Good question. It's not a full hiring freeze, but we have slowed down our hiring plan a little bit considering the broader climate.

Starting point is 01:16:47 We do have two roles on the meltana.com slash jobs page. One of them is a SRE, platform architectural SRE, standing for Site Reliability Engineer, that will help us build out this managed platform into something super reliable that we can build the business around. And we are also looking for a UI and UX designer because so far we've had a very developer first approach where everything is in the CLI command line interface and YAML files, and we want to invest more on the user interface side of things as we built out this web based sort of interface around Notano. And this is full time. It's all remote. So if this sounds like you, then a

Starting point is 01:17:27 great way to sort of show off your skills is through the Open Source project. But if you already meet some of those requirements of the roles we're looking for, then we'd also love to talk. But generally, I would recommend that you join our Slack community, which has more than 2,600 people right now, which you can find through matano.com slash slack, which is where you can learn about Meltano, ask questions from other experts in the space, and also get ideas. You might want to contribute yourself to make Meltano even better and start essentially building that portfolio of real life code to use that actual companies to power their data pipelines, which might get you a job at Meltano or any of Meltano's users, of which, like I said, there are thousands.

Starting point is 01:18:06 Very, very cool. Any other places that people should go to if they're interested? I mean, we'll definitely post the Meltano website and the Slack. The GitHub repo shouldn't be missed. GitHub.com slash Meltano slash Meltano. We'll find you all the codes.

Starting point is 01:18:21 And also, if you're curious about all of the data sources that Meltano supports, you can find these on the Hub, which is at hub.meltano.com, which has more than 300 different SaaS APIs and databases that Meltano can load data from or put data into. But that's a good start if you want to figure out your first data pipeline and whether you have some data in one of those SaaS applications that you want to build some dashboards around. Cool. And if people want to communicate about Meltano, I guess the Slack you mentioned seems like a clear place.

Starting point is 01:18:54 Is there a presence on Twitter or any other social media? Is Slack really the place that people should be on? If you want to talk with people about Meltano, then Slack is the place to be. Of course, we have Twitter as well, twitter.com.data, where you can learn about what we're up to and the new releases we have and just chat about Meltano to your current community.

Starting point is 01:19:13 But definitely, if you want to speak to the experts, then Slack is the way to go. Cool. All right, Patrick, you don't need to get a Twitter account. You're off the hook. You just need to get a Slack account. Actually, Slack isn't an account thing, right? It's an account per workspace.

Starting point is 01:19:27 Per workspace. Yeah, exactly. So if you go to montana.com slash Slack, it will take you through the signup flow where you create a Montana-specific account to interact with all of us. Cool. Great.

Starting point is 01:19:38 Yeah, folks out there, definitely do that. Definitely check out the repository. This is a no-brainer, something you can set up yourself easily we talked about earlier uh eks and uh how amazon is a incredibly generous free tier for college students so you could easily run meltano on a on like a t1 a kubernetes cluster of t1 micros or something totally doable um and uh yeah if you do use meltono for anything definitely shoot us an email and we will uh pass it along and so or we'll even post it on twitter and you

Starting point is 01:20:13 know kind of highlight what you've been up to it'd be uh really great it's always great to see people uh using the technology that uh we talked about in the show yeah we'd love to see that. And like I said, I would not be where I was today if it wasn't for all of this open source technology that was available from a very young age and being able to become really great and hireable in a field just through your own perusing of the internet

Starting point is 01:20:39 and open source projects is an amazing way into the tech industry. So reading open source code, contributing it and making the most of all the free content online is the best possible start to a tech career as far as I'm concerned. Yeah, that is really special.

Starting point is 01:20:52 Any last words for the audience out there? Any last bits of advice? Let's say somebody has went to college not for CS, so they went to college for economics or something like that how do you how would you recommend that person get into the field yeah so for me it was always just a matter of picking a a problem in my life some kind of itch i wanted to scratch something that i knew could be done with code but didn't exist yet and just not giving up until it was done and learning

Starting point is 01:21:23 all the technologies along the way yeah but that has definitely become more complicated because when i started all you needed to build a website was html css and php and now you've got to learn 20 different javascript frameworks just to get you know hello world to show up so i would these days strongly recommend following some kind of course but there's a lot of amazing free content as well just so you start off feeling a little more confident than every website you find mentioning five terms you've never heard of before and just being discouraged that way pretty quickly but in terms of just writing really great code learning from the best is one of the great things about open source and and you can become

Starting point is 01:22:00 a really great engineer just by reading code other people have written and seeing how it's done. Yeah, that makes a ton of sense. So yeah, I think we can put a bookmark in that. Thank you, Dawit, for coming on the show. We really appreciate it. Really interesting episode. I think it's a new topic. It's on the vanguard. It's something that people are going to be, it's going to be sort of a household name in a few years. So we were able to catch this really early, which is really special. And thanks, everybody, for supporting us out there on Patreon and through Audible. Thanks for subscribing to the show. The subscriber count is growing a ton, which is amazing. I actually am guilty of not doing very good data ops.

Starting point is 01:22:43 So I honestly don't know why the subscriber count is growing. That's on me. But I should definitely get an El Tano instance up and running so I can figure this out. But regardless, we have a lot of folks, new folks to the show and welcome. I really appreciate all of you kind of coming in, listening, sending us emails, offering your support. And we will catch everyone in two weeks. Music by Eric Barneller.

Starting point is 01:23:23 Programming Throwdown is distributed under a Creative Commons Attribution Music by Eric Barndeller.

Programming Throwdown - 142: Data Ops with Douwe Maan

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.