Programming Throwdown - 142: Data Ops with Douwe Maan
Episode Date: September 12, 2022Douwe Maan’s journey sounds too fantastic to be true, yet the tale that Meltano’s founder shares with Jason and Patrick today is very, very real. Whether it’s about doing software devel...opment by 11, joining Gitlab while juggling college responsibilities, or building his own company during today’s challenging times, he has quite the story to tell. In today’s episode, he speaks on Twitter, his perspective on remote work, and why data operations are a critical part of developer stacks in today’s world.00:01:00 Introductions00:03:44 Hustling online at 1100:08:08 From iOS to web-based development00:10:20 How Douwe balanced school and work00:12:05 Sid Sijbrandij00:19:13 Why Twitter was integral in Douwe’s journey00:21:01 What Meltano offers for data teams00:22:01 Remote work00:30:59 Gitlab’s data team and what they do00:44:40 What tools do data engineers use00:47:40 Singer00:50:26 Game designer travails00:58:59 Where data operations come in01:05:12 Getting started with Meltano01:12:00 Meltano as a company01:22:09 FarewellsResources mentioned in this episode:Douwe Maan:Website: https://douwe.me/Twitter: https://twitter.com/douwemGitLab: https://github.com/DouweMMeltano:Website: https://meltano.com/Careers: https://boards.greenhouse.io/meltanoSinger:Website: https://www.singer.io/Mergify:Website: https://mergify.com/If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/Reach out to us via email: programmingthrowdown@gmail.comYou can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Hey everybody! with Dawa Maan. Take it away, Jason. Hey, everybody. So pretty excited. We have an
awesome guest to talk to us about a really interesting topic that is really on the vanguard
of things. It's data ops, which a lot of folks might have not heard of, but we're going to
cover that topic in depth right here today. And I'm so excited that I have Dawa Maan on the show
to, you know, he's really an expert in this field to go through it with us.
Thanks so much for coming on the show, Dawa.
Thanks for having me. Excited to help your audience learn about this cool new merge of data and DevOps.
Cool. Great. Really excited to have you here.
Before we get started, why don't we jump into your background as an engineer
and technologist. How did you get into the field? And how did you get your first job? And how did
you get to where you're at right now? All right. Yeah, we can go back quite a way. So when I was
growing up, there were always computers around the house. My father had been the first person
in his family to have a computer.
And this wasn't just a Windows machine, Internet Explorer.
It was always Linux of some kind.
So I grew up with Mandrake and Red Hat and SUSE Linux.
And there's a whole list of distributions that were tried out.
And very quickly, I learned that computers were not just this device that you use for email or work processing,
but it's something malleable. It's something to debug at times, you know, Linux being Linux.
And it was something that I saw as this whole world that potentially opened up for me as an avenue for creativity. Yeah. Were you worried about bricking your computer? Because
I mean, that was one thing that I was worried about when I was in high school, that I would
somehow irrevocably damage this machine.
Did you have that fear or were you kind of inoculated from that?
I think that because we had different Linuxes, so all these different times anyway, the fear of just
messing up the operating system wasn't that high because we were used to reinstalling stuff every
couple of weeks or months anyway. So as long as the hard drive with the data was fine, we would be fine.
And actually bricking a computer is still pretty hard,
or at least it was back then.
I guess in today's slightly more closed,
controlled environment there,
that possibility is larger,
but also less anyway.
Anyway, I know that was never a fear,
but I learned really quickly that there was something
where I could unleash my creativity.
So when I was nine years old,
a friend realized that when you go into Microsoft Word and you hit the save as HTML button, it would then open in
your web browser rather than the Microsoft Word application. And this unlocked a whole world for
us because this web browser, which had previously been the domain of big brands and professionals
and adults, was suddenly something that we, being little nine-year-old kids, could manipulate and
get our own stuff to show up.
So this went from local HTML files,
and then we realized that there's such a thing as a web host.
And of course, to us, it was crazy to ask my father for permission to spend like five bucks a year or something for a domain name,
but he relented.
And we were able to actually have a website
that we could share with our friends,
so they could sort of see our musings.
And pretty quickly, this went from html and css to
learning php through some tutorial that was available online and pretty much every day
after high school or primary school initially when we would come home and our friends would go and
play video games we would go and program and learn new technologies and build little nifty tools to
scratch our own itch and this led to pretty quickly at age 11 so just at the beginning
of high school finding this online community where people in the netherlands which is where i grew up
were posting kind of projects that they wanted to solicit basically bids on they wanted to find
developers to implement whatever their company needed or some small php script is this like a
like a physical bulletin board or is there a website
with job listings like how did this actually materialize so it was a website web forum called
site deals where there was one of the sub forums was about like i need this to be done and then
people could just respond with i'll do it for this amount at this time and us being 11 year
olds we could basically undercut everything working with with our prices. So we ended up working for, you know,
the equivalent of about $5 an hour
all the way through high school.
How did you masquerade as adults?
Like, how did that work?
I mean, there's no way they were actually fooled
because we didn't have any kind of tax ID.
It was just, you know, money being wired
straight into our bank account.
We were 11-year-olds,
and I'm sure that our writing
reflected that to a degree.
But we still had to
learn how to do contract negotiation and and requirements and and you know due dates delivery
deliverables multiple stages of of delivery and payment and everything else so we learned pretty
quickly how to be pretty convincing not so much at being adults as at being professionals who
could actually build stuff that worked and that was the only requirement people had for us at the time. So all through high school,
it was a lot of that freelance web development. And then at some point when the iPhone came out
and the iPad came out, I was really intrigued by that platform. And I taught myself Objective-C
as well and Cocoa Touch. I saw something really interesting about Objective-C. I don't know if
this will blow your mind, like it blew my mind, but you know how in iOS there's NS everything, like NS object,
NS socket. Next step, right? Yeah, that's right. It's from next step. So short tangent here, but
Steve Jobs left Apple, was kind of kicked out of Apple, started a company called Next and built
this machine. Was the machine called the Next Step?
I mean, it's starting to hit the limits of what I know here.
But anyway, so they started calling everything NS
because of Next Step, because of the Next machines.
And then when Apple acquired Next
and Steve Jobs got back in,
that has stuck around to this day.
So that blew my mind.
I've seen NS object everywhere,
but I never knew why there's NS on everything.
Yeah, it's really interesting.
Like some of the, you know, of course, Apple during those years, Steve Jobs was away.
Well, we're not doing as well as they are today, for example.
And the technology that really was at the foundation of everything from the modern Mac
OS to the iPhone and everything, definitely it came from Steve Jobs' other company, Next
Step.
Super, super interesting.
But yeah, I taught myself that.
I taught myself all the NS prefixes and this really wacky, you know, square bracket syntax
that Objective-C came with.
But this led to me building, you know, an iPhone app for my high school that pretty
much every student and teacher ended up having installed on their phones, which allowed them
to really quickly check the schedule, where they had to go for the next class, but especially also the changes in schedule when, you know, the room was changed, or when some kind of
class fell through, because the teacher was out sick. And I didn't, I never did freelance iPhone
development. But being sort of active in this Twitter, iPhone development, teenager space,
led me to getting into contact with some people in the
Netherlands who were setting up a Mac and iPhone development studio where they were going to be
building productivity applications basically for the Mac and the iPhone on the App Store.
And I ended up joining them and then really quickly becoming the lead developer at age 16,
that must have been. So for a few years there, I was after school, every Monday after school,
I would go into the office and then
work with them there until like 3 a.m um and then try during the week try to fit in a few more hours
of work and that was really exciting a few of our applications got uh got got featured on the app
store but after a few years we realized that it was actually pretty hard to make a lot of money
on the app store in the beginning when your application is new you can drive a lot of traffic
you might get featured blogs might cover you but then after that it's hard to
keep that up especially if your first version is already pretty much exactly what you had in mind
and there's a ton of surface area for improvements or additional features so they decided to wind
that company down and then one of my bosses at the time he had decided uh he was exploring this
new startup with a previous business partner of theirs and they decided that
they wanted me to join as co-founder and cto so at age 18 we co i co-founded a company with them
called stingo where we built a web-based platform for bed and breakfast owners to manage their
entire online presence so that includes their website their guest communication their reservation
calendar and basically providing a whole tool set for what an old school bed and breakfast, not just a room in Airbnb, but one of
these mom and pop shops that really care about their house style, and they make you breakfast
and all that stuff, everything they need to sort of enter into this modern area or era. And I was
the CTO and built the entire Ruby on Rails monolithic application that ended up powering
this platform.
So that was super exciting.
I did at the same time, you know, coming out of high school,
decide to go to college.
I'd already realized that I didn't necessarily need to get a job,
but I did still want that sort of college opportunity.
And the nice thing was that most of what I had learned
was already meant that I could skip a lot of studying for stuff in college.
And I did get to sort of deepen my knowledge in areas like cryptography and database internals
and big O notation, data structures, data algorithms, and all of that stuff,
which hasn't necessarily been useful every single day in my later life in industry.
But it did give me the confidence that I wasn't just making sort of newbie mistakes in writing
my code without any kind of understanding of the fundamentals that made one path slowly
performing and another path much more optimized.
That makes sense.
How did you deal with that?
I mean, so you're CTO of this company and a company is at its early stage. We're trying to find product market fit and a, and a bigger market,
right. And everything like that. And you have to balance that with a full course load. And I mean,
at some point you're probably thinking if this company fails, because I really wanted to pass
calc three, like I kind of messed up, but on the other hand, it's like, you know, you're,
once you
have the college diploma you have it for life and you can go through a hundred companies you know
and you'll always have that college diploma you won't always have that job right so so it's like
how do you how do you strike that balance i mean it's a it's a i feel like it's a really difficult
situation yeah that's a really good question. Although my situation with that company, Stingo,
it wasn't really the typical startup, let's co-found it, let's give it 100 hours a week,
let's go for the stars. The situation was such that I was 18 and my co-founders were early 30s
and early 50s. They were very different stages of their lives. 50-year-old who was essentially
the CEO, he had a big background in this whole bed and breakfast world. He had a ton of connections in the Amsterdam bed and
breakfast scene, essentially. So we were able to onboard a lot of users and customers through his
network immediately. But since we were all in these very different places in our lives,
this ended up being initially bootstrapped by this guy I was just talking about. We didn't
raise any venture funding or anything like that.
And we all saw it a little bit more as this is a lifestyle business.
We're going to spend as much time on this as we can get away with,
which in my case was about 20 to 30 each week.
And we were just going to build something for this market
that we knew existed in Amsterdam.
But actually, because of these very different places in our lives
and me being eager to like, okay, let's go for it.
Let's make this take off.
Let's build a billion dollar company.
I was eager to go all in.
And they were in different situations.
One of them was pretty close to retirement and starting to think about, okay, how can I cash out and like do other stuff?
The other guy was, he had some kids, so he was more interested in some sort of stable income, right?
Then instead of the high risk, high reward startup situation.
And that's actually what led us to decide
after about three years running that company
to wind it down.
And I was still in college at that time.
I was also pretty active
in the sort of Ruby on Rails meetups in Amsterdam
and Utrecht, this college town where I went.
And through one of these
and a conference that I went to in Athens, in Greece,
I ended up running into this guy.
His name is Sitsi Brandi.
If you're very aware of companies that IPO'd last year, you might know where this story is going.
But this was a guy who was building essentially a company around an open source clone of GitHub called GitLab, which was this open source project that had come up in
Ukraine in around 2011. And he realized that there was a business opportunity around it to start
offering support or sort of a hosted edition or enterprise editions. And at the time I met him,
this was just four or five people in the Netherlands. And we kept running into each
other a few times. And interestingly enough, his parents had a bed and breakfast
in the north of the Netherlands.
So his parents became customers
of this platform I had built single-handedly
from the code side of things.
What happened to that platform?
So when you say you wound it down,
did you sell it to another private equity
or what happened?
We did briefly consider looking for buyers,
but we ultimately decided
that since the platform was pretty much done,
people were really happy with that.
We could just keep those paying customers going for a number of years.
So we stopped all development and we stopped onboarding new users,
but it stayed up for a while.
And it was only last year, about five years after we sort of stopped investment,
that we wound it down when the user count had dropped below,
you know, a mark
where it wasn't sustainable anymore.
But for a number of years,
this was passive income,
which was, of course, very nice for all of us
while we were working on new projects
and new job opportunities.
Oh, that makes sense.
Very cool.
So I met this Sitsi Brandi
who was building this company called GitLab.
We kept running into each other
at different meetups around Europe.
And eventually when I was in this position
where I was looking for new products again,
I reached out to him.
He had already previously told me like,
hey, Dawa, if you're ever interested,
come see if you can come join us at GitLab.
At that point, it was just one startup out of many
that were sort of in my vicinity.
I don't even remember exactly what it was
that attracted me to the project,
although open source and building something
for developers was of course massively attractive being a developer myself
and i ended up joining gitlab as employee number 10 just when they were going through y combinator
uh that's 2015 i think while the entire team was in the mountain view house and i was you know i
wasn't able to join because my professors wouldn't allow me to take that much time off. But I did get to join the company at that stage.
Well, let me see if I can understand this.
So at this point, are you a bachelor student or a master student?
Bachelor student.
Yeah.
So I originally joined GitLab part time.
Got it.
And so when you say you're a professor, you mean like one of the people who's teaching a course that you were taking?
Yeah.
One of the lecturers.
Exactly.
Got it.
Okay.
He wouldn't let you move to Mountain view and do everything remote yeah so they were pretty
understanding when it came to oh you know i had to go to san francisco for a week at some point
for a course that was fine but to say i'm going to take two or three months off and like i have
to catch up on all of the tests afterwards we've done remotely that definitely wasn't an option at
the time so you know in academics in the netherlands as well
there's a little bit of they don't quite understand sort of the tech startup industry
silicon valley world of course they look up to it to a degree but they also think of it as less than
because they are like professors pushing the limits and it's different than um oh no no we're
not going to give you three months off to work on some startup that's not real anyway you're just
going to work for some kind of consultancy in the Netherlands.
That was unfortunately very much the mindset of some of the lecturers there.
Not to say anything bad generally about the education.
It's really great.
It's just a very different attitude and focus.
Yeah.
I want to jump into one thing really quick.
I'm assuming, so like, you know, you finish college and then you go to Mountain View to
work for GitLab?
No.
So GitLab from day one had been an all remote company.
It started as an open source project
that was founded in Ukraine.
And then they attracted open source contributors
from all around the world.
So there were already hundreds of active contributors
by the time that this Dutch guy Sid decided,
hey, there's an opportunity to build a business
around this project.
And at that point,
it didn't make sense to start an office somewhere
and then hire some of these contributors and ask them to move. Rather, we were a tiny little pocket of like six full-time people in this community of hundreds, and we were working with them. So we were on GitHub initially, using all of these sort of asynchronous tools to collaborate with these people, no matter where they might be. And from that day, GitLab never changed from its remote work policy. And GitLab, just before the pandemic at least, was the largest all remote company in
the world with 2000 people across 68 different countries and territories. So I joined GitLab
when I was still in college, still in the Netherlands, but had my team or had the team
I worked on was also in the Netherlands. The rest of them were located elsewhere.
And I was able to combine that with
college for about a year and a half. By the time that that was done, GitLab had grown from 10 when
I joined to probably 100 or so. And I stayed working there for a while from the Netherlands.
But since it was all remote, I also jumped on the opportunity to go and travel and meet these
colleagues around the world. So in 2016, actually, like a month after I finished my college bachelor's degree
and was sort of no longer more to one particular place,
I went on a six-month trip where I visited 49 of these colleagues
in 20 different cities in 14 countries on five continents
in the space of six months, which was just an amazing experience for myself,
but also the people being visited.
Getting to work with them for five days,
seeing their natural habitat,
their local coffee place, their home, their kids.
And then taking Saturday to see the city that they lived in
with an actual local tour guide.
And then Sunday we would be on to the next location,
which was an amazing experience.
Yeah, so when you were still moored to the Netherlands, we have people
listening to your show from all over the world. And a lot of them say, will write in and say
something like, you know, I don't live in a tech capital. You know, I don't live in San Francisco
or Miami or Austin or New York or any of this. You know, how do I meet people locally, you know,
and have that face to face with people who are interested in these kind of things? I think you think you, you kind of, uh, touch on a little bit of that, but what sort
of advice would you give to those people who live in, and I'm not putting Kansas city down. Kansas
city could be amazing, but I was just like Kansas city, you know, what, what advice would you give
to, uh, to people who live in Nebraska or something? Well, I mean, I'm not going to touch,
uh, any prejudices about us States, but you'll find those like-minded people if you just look hard enough.
What made a really big difference for me is finding this community in that evidence called Young Creators, which was basically all high school aged kids who had somehow been a little bit more entrepreneurial than their peers.
Either they've been designers or programmers or they've been starting little lemonade stands companies
from a very young age.
And it really wasn't until I found that group
who had, I think, monthly meetups in Amsterdam at the time
that I had to take like a one-hour train ride to get to
until I felt, okay, I'm surrounded by people like me
and we can chat about the same stuff
and we have the same sort of dreams that go beyond
just the space where we grew up. And did you find that group twitter honestly like i
said earlier this job opportunity at this iphone and mac development studio came just from being
active in the sort of twitter ios development space and then through that i came into contact
with some dutch people of course there's a good amount of like really great dutch ios developers
as well so just by following those and interacting with that sort of part of the internet, I met those like-minded people.
Now, Patrick, you have to make a Twitter account. See, this is why.
Oh, I don't use Twitter anymore.
Oh, no.
He doesn't need it anymore.
I'm off the hook.
Oh, I just shot myself in the foot. So, okay, what would you do now? Let's say you were
16 years old over again, but in 2022,
how would you go find those young creators nowadays?
Snapchat.
TikTok, apparently.
I was talking to some Gen Z kids last week, and it just was, I mean, I'm 28.
I don't think of myself as old, but I was so out of touch
with whatever the world today looks like to them.
So probably just TikTok.
But honestly, sort of tying this back again
to how GitLab also worked
and how GitLab built this massive
all remote team around the world,
contributing to open source projects
is an amazing way to interact with
and sort of learn from the best
and show your work to people
who might actually hire you remotely.
Like the first couple dozen engineers at GitLab
were all open source contributors
who either they applied or we reached out and said like,
hey, do you want to do this full time?
Because by the time that they had proven themselves
with high quality contributions
and really great sort of async rhythm communication skills,
it was a known quantity.
And it was so much easier for us to bring some of those in
than to go through a whole hiring process
and not know what you get.
So I would say if you don't just want to meet like-minded people, but also find a way to
potentially get a job opportunity out of it, or at least build code that can sort of form your
portfolio, joining the open source community for a commercial open source project is a really great
way to start. And GitLab is a great example of that. But Meltano, just to sort of, you know, throw in the name of my company, is another one.
It's an open source platform for data teams to work more effectively on all of their data movement and data transformation challenges with the software development best practices built in.
And it's an open source Python project.
And a good amount of our current team of about 17 people started out as open source contributors or generally users of this tool.
And they, too, are based around the world.
We don't have an office either.
So we're very much sort of following in GitLab's footsteps there.
And that's a way in which the world today is also very different from when I started,
you know, in terms of job experience almost 19 years ago, when remote work was still weird
and people didn't really trust paying someone across the internet.
They only saw through Zoom.
GitLab was a pioneer in that, the pandemic of course has made that extremely
mainstream and there really is no reason for you wherever you're born not to tap into those kinds
of job opportunities if you are able to get on their radar and show your work and open source
contributions are an amazing way to do that yeah i mean the whole remote work thing is still
unfolding in really interesting ways.
There was an article Malcolm Gladwell wrote, I think yesterday, I mean, very recently,
where he was saying, basically, he was saying remote work is unhealthy, unproductive, etc,
etc.
I don't agree with that at all.
So let me just put my own opinion out there.
But I mean, that's somebody who has a ton of respect around the world and has clearly
fallen on one side of the fence there.
We haven't achieved any type of consensus.
We're still, I think, as a global community trying to figure that out.
But I do think it's fascinating.
I think that, as you said, the pandemic really forced the question on the world because I
think there were a lot of companies that were
very against remote work and the pandemic forced them to embrace it for at least a year
and so that now it's like well what do you where did they they're on much more shaky ground with
their with that argument so um so yeah the whole thing is fascinating yeah and I mean I can talk
about this some more like GitLab was and is one of the companies
that had made remote work work at scale by far the best.
And anyone who has worked remote during the pandemic
should not think that they now know
what remote work is like.
It's extremely different if you set up a company
intentionally from day one to be remote.
You design all your processes
in this sort of async compatible text-based way primarily
with all kinds of processes to get that social interaction despite the geographical distribution.
Very different from if a company has to suddenly change.
It's not super motivated to change its way because you think it's just temporary.
The people that are forced to do it didn't choose to do it.
That also makes a big difference.
Remote work doesn't work for everyone. It also definitely doesn't work for every industry but if you're a company that is
building software all of the tools we use the githubs the git labs the slacks the you know
whatever else we have on our computers it's all built around this async stuff anyway and even if
you're working on a computer you're probably in an office you're probably talking with your you
know person one office over over all of these anyway. So it is an industry that is particularly well suited to it.
But then also on the people side, it's really important that the people know what they're
getting into and know that this would work for them.
If you need that constant social interaction in order to feel connected to your company
or to feel productive, then it's never going to be a good experience.
But if people apply for a remote work job intentionally, and they know I prefer to work
from home, I want to be able to spend more time with my kids,
I don't want to have a commute, and I know
by myself that I'm pretty good at just working on a computer
for a number of hours without seeing people,
then it makes a massive difference too
if everyone is sort of into it and
equally bought into the concept.
And then also the intentionality
of the processes you design and the
things you introduce to balance
out some of that
lack of in-face-to-face interaction
make a huge difference.
Like at GitLab and also now at Naltano,
it's so important to still have people meet occasionally.
So every nine months,
we fly the entire company into one place
and we get like five days or so
of quality in-person time,
which really builds those relationships
and that rapport
and that sort of foundation of understanding
on top of which you can work productively and feel like you can give each other constructive feedback
without being unsure about how that will be received.
And then also at the same time, like I mentioned earlier,
allowing people to travel and visit each other makes a huge difference.
Those six months that I spent traveling around, visiting all of these colleagues,
made a massive difference for me and the people being visited.
And this actually turned into a policy at GitLab where if you travel to
visit your colleagues the company will pay most of the time most of your travel expenses.
That's a really good point. Yeah it's a really good point you know if a company isn't paying
to keep a desk for you somewhere that money can get redirected to these type of events which are probably much
better uh you know because you're you're going with intent exactly it's intentional so yeah you
save money from the office you save money from you know all the humanities you need to have there
you save people time because of the lack of commutes and you can make up for all of this
in doing these one-off events and funding the social interaction that just does need to take place.
So that six-month trip is something that became a policy and a lot of people in GitLab started
doing where they did like a Euro trip and then they visited, you know, however many
people over the course of three months.
And in every city where we ended up having a significant amount of people, like five
plus, we ended up having monthly co-working days where everyone from that region would
just come in one Friday of the month, work together for the whole day and then have a shared group dinner and this also just
meant that the people in our region felt great yeah where would they go we would rent a co-working
space or some kind of meeting room somewhere and just all be sitting around a meeting space
together and then you know go to a restaurant and of course this would be covered by the company as
well because it is part of that social fabric that is really valuable to making a company work at scale.
And then combining these two things, the travel and these co-working days, you ended up having people do a Euro trip where they would try to hit all of the co-working days they could within the time that they were traveling. amsterdam brussels lisbon you know rome um and visit seven different european countries with a
local tour guide basically to show you around and this opportunity to immediately meet 10 plus people
there and that's stuff that made it work which of course during the pandemic was impossible because
no one could see each other at all so all these folks all you folks out there who are going for
your bachelor's right now and you're taking algorithms and data structures too and you're learning about traveling salesmen and you're like i will never use the traveling
salesman problem you might find yourself at gila needing to go eat like uh drink beers at four
different cities in a month yeah yeah exactly it's uh it's not bad but of course it not only
gets up anymore which is great there's so many remote only companies now or remote first or you know remote compatible although we do believe that
the hybrid model doesn't work nearly as well as the all remote model or the all in office model
because with the hybrid thing you do have sort of a tier system where the people who get to actually
see their boss face to face every day just do have higher chance of promotions and stuff because it
does make a difference for that foundation of trust.
But if everyone is in the same exact spot,
it works much better.
But there's a lot of companies now,
and if you look at like the most recent YC, Y Combinator batch of startups,
a good percentage of those are just all remote
because why not?
And why limit yourself to only the talent
that is willing to move to one particular metro area
if you could hire everyone from around the US
or even from around the globe
or from around a certain time zone.
At Meltano, our team is distributed
between Mexico where I'm based,
the US, Canada, the UK, Germany.
And we're willing to keep adding places to that
as long as we have enough time zone overlap
where we can make it work.
And there's a lot of companies doing that now,
including a number of them founded
by fellow GitLab alumni like myself. and actually Sid, his next project essentially after GitLab is this open core
venture capital firm where they specifically invest in open core, which means open source
projects with a commercial business model around them.
And a good amount of those are starting out remote as well.
So yeah, it's a fascinating space.
And GitLab has, you know, we've been front runners on both the fields of commercial open
source and remote work for years now.
And that's all stuff that we bring into Meltano as well.
Cool.
So that's a good way to pivot to the topic at hand, which is data ops.
So when you were at, you went from GitLab to Meltano, but when you're at GitLab, you must have seen something that made you say DataOps is really important.
Was DataOps like a cornerstone to GitLab success or what's the connection there?
Yeah, the connection is a strong connection.
So it was actually not me who realized the opportunity in DataOps. It was a number of people in GitLab who were seeing how the data team at GitLab was working
and the kind of tools they were using and how far away that seemed to be from the types
of workflows and collaboration tools available to the software developers working at GitLab.
And of course, also using GitLab itself, since it was a product, you know, it's a software development collaboration platform, essentially a DevOps platform.
And it was realized that in this data space, these people are technical enough, and they
have similar enough needs of collaboration and high reliability and stability and being
confident in the results of their work, that applying more of these software development
best practices to the work that data teams do could make them a lot more efficient, effective,
and really sort of level up that whole profession.
And it was uniquely GitLab who realized the opportunity for open source and DevOps best
practices applied to the data space.
And it was in 2018 that a dedicated team was set up inside GitLab to essentially build
a better data platform for the GitLab data team that would maybe eventually,
initially as an open source project, maybe eventually become an open source
business in its own right. So let's unpack that. What is the data team at GitLab? What does that
mean? What is the skill set? What does people do?
Yeah, good question. So people have been saying things like data is the new oil or whatever else people have been saying in these very nondescript phrases that do sort of signal the importance of
it. If you're a company, you're trying to, or if you're any organization, really, you're trying to
accomplish something, you have goals. And you want to know what the right way
or the best way is to get to that goal.
You also want to see how well are we currently doing
or how am I even going to measure my success?
And being able to measure your success
and being able to come up with a plan
where you're pretty confident
that this is probably the best way to get there
requires you to look at the data
and make predictions based on where you are now
and what you see in the data as big opportunities and giving yourself some kind of goal to go after.
And this is the same whether you are an educational institution or you're a non-profit or you're
a massive company that's just looking at the bottom line.
In any way, the more you learn from the data available to you to tell you how well you're
currently doing and how much better you could be doing, the better.
So in a company like GitLab, which is an online SaaS platform with a ton of users and of course
also a product and it has a marketing arm, some of the data you're talking about is just
usage of the product itself.
Like which features are people using?
If they use this feature, do they make it to the next step?
Are there features that are not seeing any use at all?
What are the flows people take through the product to find particular corners of the
product?
How can we surface those more effectively to make people get more value out of them?
Or something like A-B testing, where you don't know quite yet which way of talking about
something or which way of presenting a feature is going to get most people value out of it.
You want to be able to compare those.
Or with marketing, if you have a Facebook ads campaign or you have Google ads or whatever
you might have, you want to be able to measure the impact of those.
Like, did this specific phrasing actually lead to more users?
Did advertising alongside this type of TikTok content or whatever work better than other
type of content?
You want to be able to compare that and use that data.
And similarly, when you're talking about your customer base, you want to learn, is there
a particular segment of the industry that has a
far shorter time to close than another? And should we double down on those? Or do we see that people
with particular characteristics are far more likely to churn and stop paying us at some point?
So this means that you have this data that could help you as a company be successful
in all of these different tools like Zendesk or HubSpot when it's about support or CRM stuff, or in your own product and you might be using a platform like Mixpanel to track that data, or even if you want to track the efficiency or the happiness of the employees within your company just to sort of make sure that you're not falling short there.
These are a ton of data sources, and it's up to the data team to build all the pipelines that get the data from all of these different APIs or databases and bring them into a place where analysis can take place.
And analysis uses a tool, a BI, business intelligence tool,
that allows you, usually with SQL, to write these queries
and build these little dashboards that show you
how you're doing on certain metrics.
But these first steps in the process are data movement
and data transformation.
Getting this data from all the places where it's currently hidden, getting it out usually through APIs or file dumps in FTP folders or S3 buckets or something.
And then transformation means taking that raw data schema, which was optimized for an application, into a schema that is more appropriate for analysis. And for the types of questions you want to ask it and the types of queries you want to
run against it, being able to do those effectively often requires you to change the schema and
transform the data in a way to aggregate things or to de-anonymize things also, for example,
in case you don't want to mess with PII.
So this challenge of data movement and transformation has been known for decades, but the tools used to do it are, from the eyes of a software developer,
still pretty... I mean, it feels like legacy tooling. It feels like when I was nine years old
and I was building a website and I would FTP into the web server and make live changes to PHP files
and then go check in the browser just to see I didn't break something. So you're always working
in production. Every change you make could immediately affect the user go check in the browser just to see I didn't break something. So you're always working in production.
Every change you make could immediately affect the user.
And in the case of the data world, that means that if you hit the save button, you might
accidentally break the dashboard that your CFO is about to present to the board or the
thing that the CEO is looking at to make sure what it should be worried about today.
And that approach of just hitting the save button and crossing your fingers, as a software development community, we've moved past that with DevOps, version control, continuous integration and deployment.
And all of these best practices could also really help data teams.
And it was GitLab that saw that opportunity and started building that for its own data team, who, of course, had been especially exposed to how software development teams worked and
realized like, hey, we want some of that.
So in 2018, the team in Getlast started working on this tooling.
In 2019, I ended up joining that project as development lead when there were four engineers
and a general manager on the Meltano project.
And in 2020, the headcount of that project was brought down from six down to one because
we hadn't quite been hitting the numbers and the growth numbers and the contributor activity that we were looking for
so i was left by myself on the meltano team and throughout 2020 i managed to identify a more
narrow sort of description of the problem we solved and a way to really reach the audience
we wanted to find and it was through 2020 that this started taking off as an open source project with hundreds
and thousands of users. And in early
2021, we spun it out as
an independent startup from GitLab with
seed funding from GV, formerly
known as Google Ventures. And since then,
we've been on our own independent
startup journey. We raised money again this
year, and we are really sort of
building out this massive vision of
building data tooling that that
adopts from the ground up all of these things that have made the development teams software teams
so effective and we're seeing a lot of interest in that fortunately yeah so so let's double click
on this so it's a data engineer or is that is that a orthogonal thing no so data engineer is
one of the titles that's most common
for these people that are challenged or tasked with these data movement and transformation
challenges so our target audience are data engineers exactly got it i see so so someone
writes an app like we can go back to your days making iphone apps right someone makes some iphone
app they're writing all this objective c they're writing they're creating ns objects they're
wondering why there's NSs everywhere.
And then they send their app out and it gets, you know, three stars or two stars or one
star.
And people say, oh, you know, it's crashing all the time.
So now some of that app will provide.
So Apple has the crash handler and you can go and look at the logs and everything.
So you fix all the crashes.
You say, OK, I'm done.
Now you get instead of getting one star,
you get two stars.
They're like, okay, the app is stable,
but it's just not what I wanted.
And so you have to sift through these reviews.
It's pretty painful.
And you've already kind of lost a big opportunity there, right?
So that is bad.
That is not the way to launch an app.
But a better way to launch an app
would be to start some really
limited beta, call it a dark launch, where you don't do a lot of advertising or anything.
You get a few people, and now you don't just want their reviews, because a lot of people
won't write reviews. You might get everyone writing about the same complaint. You want to
get really in-depth data. Did they see which of my screens in my app did they look at? If nobody
looked at screen four, well, why did I even build it, right? Or maybe I can't get to screen four,
you know, whatever. You know, if everyone's spending all their time on screen two, that's
what I need to spend my time as a developer focused on, right? And so you can continue to
like subdivide and subdivide and
subdivide down to more and more nuanced data. What are people in Mexico City doing? What are people
in Texas doing? And so you end up asking all of these questions and needing to slice in all these
different axes, right? And so that is the sort of business value of kind of data ops and data engineering.
I'll say of the whole data engineering, data science kind of process, right?
And so then data engineers and data scientists are going off and trying to build the infrastructure and sort of the semantics to answer those questions.
But you're saying they're doing it an equivalent of like us, you know, SSHing into a machine and like doing all the coding in nano or something, right? So yeah, I'm touching the production code, a production system all the time. Like, oops, I forgot a semicolon. And now everyone who goes to my website just gets a error 500 for the next like three minutes until I realize, right? And so Meltano is an attempt
to make it more productive and safer
to answer all of those questions we just talked about.
Today's sponsor is Mergify.
Mergify is a tool for GitHub that prioritizes,
tunes, automatically merges, comments,
rebases, updates, labels, backports,
closes and assigns your pull requests.
Mergeify features allow you to automate what you would normally do manually.
You can secure your code using a merge queue, automatically merge it, and many more features.
By saving time, you and your team can focus on projects that matter.
Mergify can coordinate with any CI and is fully integrated into GitHub.
They have a startup program that could give your company a 12-month credit to leverage Mergify.
That's up to $21,000 of value.
Start saving time.
Visit Mergify.com to sign up for a demo and get started.
Or just follow the link in the show notes.
Back to the episode.
Is that a good summary?
Yeah, that's a really great summary. I would say, you know, you started talking about, you know, new app, dark launch.
Of course, the more data you're working with, the more valuable it will be to try to automate this with processes and have data engineers that write pipelines rather than just going in manually and doing the things that don't scale.
But yeah, definitely. By the time you have a lot of data like that and you actually have processes or people in your company relying on it, you also want to make sure that when you make a change, you don't accidentally break the thing that's live.
And you don't then want to have to scramble to fix it live because then you're even more likely to make mistakes
or not fix it in the best way.
And data teams that are becoming more and more relied on
by their business,
and sometimes this data is also fed back
into machine learning processes
or just goes back into other company systems
and it kind of spreads within the network
of all the connected parts of the business.
If there's a mistake in there, it's a problem.
And if you are hitting save and crossing
your fingers and hoping to break something, that just doesn't work anymore in this day and age.
And in software development, a lot of this has been solved. We know a lot of the things you can
introduce, workflows, practices, tools into Teams to help. And the goals are the same. You want to
have high confidence in whatever is live. You want to be able to experiment safely,
try stuff out locally, have the feature branches in your Git repo with experiments,
and not feel like you have to limit yourself from making changes because you might accidentally do
something you cannot work anymore. Even being able to roll back to a password conversion
is not a given in a lot of the data tooling that exists today. So if you are building a pipeline
today, especially if you are coming from a software development background, it seems like a
no-brainer to have something that can be version controlled and something where you can have CICD
tell you if you are accidentally going to break something. And this increases the efficiency,
effectiveness, velocity, and innovation of the team that is working to solve these challenges,
same as we've seen in software. And Meltano's approach is one in which we have identified that there are a lot of really
great open source tools for different steps of the data lifecycle.
But these tools themselves don't necessarily embrace the software development best practices.
And at the level of your entire data platform or your data stack, when you're putting together
these three or four tools that together solve the data movement and transformation problem, there's no way to manage their configuration
in a version controlled way or to manage changes that span multiple tools.
If you need to update the configuration for how to get data out of some SaaS API, and
at the same time you want to modify your transformation script, you want to do that at the same time
and you want to be able to validate that that combined change didn't break something.
You cannot see those in isolation, but the current world in data engineering expects those
changes to be fully separate from themselves from each other and that is just not the reality so
meltano allows you to bring together every aspect of your data platform from start to finish and all
these different tools that you use to solve uh sort of the incremental problems along the way
in one place with a consistent approach to version control, configuration management,
and end-to-end testing. So if I'm talking to a data engineer, I would explain all of this in
context of the specific benefits because we cannot expect them to already know all of these software
development terms. But if you are a software developer and at your new company, you see your
data team scrambling with their work, if you want to make them as effective as you have been as a software developer,
then Meltano is the tool that will help them do so. Yeah, that makes sense. So going back to our
app example, so we have, you know, let's say thousands of people running our app, right?
And so we want the data from all of them, you know, in one place so that we could do all this
analysis. So kind of walk us through, you know, what that place so that we could do all this analysis. So kind of walk us
through, you know, what that looks like. So how do we go from NS log in my app to, you know, we all
have a we have a some dashboard with everyone's data on it. What are the tools that data engineers
use? What does that look like? Yeah, so I haven't been in the iOS ecosystem for long enough to exactly know how
to get that NS log statement into some kind of SaaS product. But where it usually starts from
the Miltana perspective, or for the data engineer perspective, is that this information we want
already lives in some kind of SaaS system. So that might be, you know, Apple's own error crash report
sort of interface,
which hopefully has an API,
if not feature requests.
And then the next step is,
how do I get that out?
So you can, of course,
learn the API documentation,
write your own little Python scripts
or whatever language you like,
but Python is sort of the lingua franca
of the data world.
And then pull out that data.
And then you want to get into it,
into a data warehouse,
which is not just a database like Postgres or MySQL,
but it is specifically optimized
for analytical workloads and use cases.
And we call those OLAP databases.
What's an example of a data warehouse?
Yeah, so BigQuery, Google BigQuery is a big one.
Amazon Redshift is another.
But then Snowflake has really taken over the world
by storm over the last decade or so.
And that is what we see our users using most.
So it is a database that's optimized for running analytical queries that require a lot more
compute.
They're really complicated.
Tons and tons of joins, for example.
It's still SQL, but under the hood, it's expecting more complicated SQL.
And so the engine is better.
They're columnar data stores.
So they don't store data in rows.
They store data per column.
So if you do aggregation over a column,
all of that data is already,
the memory locality is much higher
because it's literally all packed together
instead of being spread out over these rows
with different offsets into each row.
It has sort of the columns instead of the rows
as the primary way of storing
things. But SQL with some extensions, which are then dependent on the specific framework you're
using or platform you're using, is still the main language to pull this data out. Yes.
Got it. I see. So we have some way of getting the data from, you know, it could be a website,
an app, from whatever, a video game, from whatever it is into this data warehouse, which is similar to a database as an end user, but under the hood, much more efficient for what we want to do next.
And then what happens once we have everything in the data warehouse?
Yeah, so to be clear, you could, as I just described, write your own little script to get the data from the Apple Crash Report API and load it into your data warehouse.
But this is a problem called extract and load or data movement, data integration, data ingestion.
Those are sort of the terms you want to Google, which a lot of companies have tried to solve by building these connectors for different data warehouses, different SaaS APIs, and having you pay some kind of subscription fee to do that work.
So one of the things that makes Meltano different
is that we have embraced an open source library
of connectors for data sources,
which today counts more than 300 different sources
and destinations that are supported.
And this standard is called Singer,
named after the sewing machine, actually.
And it is this Singer library of connectors
that we with Meltano have embraced
and built a platform around
so that you don't have to do this work
of writing your own Python script again each time.
And you can leverage the work
that the wider community has already done.
And you are able to self-manage this
and improve the open source scripts
instead of fully relying on some kind of
proprietary sas extractor node offering so that's the data ingestion step the next step then is to
take that raw data which usually matches the database schema structure of the application
which means you know columns you know tables rows primary keys and turning it into something that
gets closer to the types of questions you
want to ask for the analytical side. So if you just have a single database or a single data
source, usually the original schema is going to be good enough. You might want to drop some PII
columns or which have personally identifying information, or you want to hash them,
for example, so that you can still tell the difference between different people without
actually being able to tell who they were, and then run queries against it.
But especially once you start having data from multiple sources, like you were saying earlier,
when you want to find out whether someone from Mexico City gets stuck earlier or something,
or whether some marketing campaign on Google Ads or Facebook Ads gets people further into
the user flow or makes them more likely to convert into paying customers than some other campaign,
you got to be able to combine this data.
And those database joints are not something
you want to do every single time you ask the question
of the data warehouse because joints are expensive.
So you want to do some amount of pre-aggregation
at the data transformation layer
where you combine these tables into analytics data tables
that have exactly the data already together in the same table
that you're going to want to compare or combine.
So in the challenge of answering questions about any kind of data
you have in various APIs, data extraction and loading,
then data transformation, and then writing the actual analytics queries
are sort of the steps of the process.
And in all of these steps,
you're essentially dealing with code,
whether it is the SQL queries
that handle your analytics questions
or the Python scripts or the SQL queries
that define the transformations
or just the configuration of the extractor node pipeline,
which might change over time
as you change the schema you want.
These are things that we can think of
as different versions or iterations
or revisions of the code.
So version control sort of directly applies.
Yeah, that makes sense.
Yeah, I want to pause just for one quick moment here
and talk about how unbelievably important this is.
I was listening to a game development podcast
and these were, sorry sorry game design podcasts and these
are like pure game designers a lot of them don't write any code or know how to program or anything
like that they're purely on the design side there was one person talking about how from testing they
found that when the hero you know sort of hit the enemy, they originally had it. So there was this, the enemy was this knight, this fully armored knight.
And when you hit the shield of the knight, you know, the knight didn't take any damage.
It just sort of blocked it.
And you made this, you heard this little clink sound.
But then when you hit the knight, the knight like flashed red, you know, showed that you bypassed the shield.
You got through the shield.
But they had the same clink sound because it's like all metal it's like shields metal armors metal and what they found
through a b testing was that that was unbelievably unsatisfying like people wanted like or like ah
or something like that you know like and when they switched to from uh from using the same sound for
both of those which is you know is physically accurate to a different sound.
Maybe it was a different type of clank or a grunt.
I don't know.
The user engagement won't weigh up.
And so it's one of these things, it's like nobody could foresee that, right?
It's only something you can see post hoc.
And they basically started down this path because they found this one boss to be much less satisfying than the rest from data, right?
So no matter what you're building, you're building it for other people.
And those people have a collective consciousness that you cannot be fully aware of, right?
It's impossible.
Even Steve Jobs, as much as people talk about Steve Jobs, you know, is constantly relying on data and iteration and feedback loops. And so,
and so, so you're going to have to learn all of these things we're talking about to build anything
that's successful, especially nowadays, which where it's so incredibly competitive. There's so
many great apps out there in the app store, The Unity Store is so full of games, right?
So this is absolutely critical to understand.
And I think, Doan, you're doing an amazing job kind of walking us through this.
So we have this ETL system in place.
We've got everything in a data warehouse, and then we've transformed it to something that allows us to do the analysis we want efficiently.
How do we go from that to a website with a pie chart on it?
We used to throw out a new term ETL, which might be new to the audience. So that stands for extract
transform load. And I'm calling this out because the sort of the more modern version of this is
actually ELT, where the transformation happens after the load process. And this is relevant
because, you know, for those who are interested in the history, if you do the extracts, you have a script that basically notes all of the data from the sources into memory.
You can do the transformation into memory before even writing into the database.
But that has a lot of, you know, heavy compute and memory requirements.
But it is sort of traditional model where you use Python code or all kinds of algorithms to change that schema before it ever
lands in the database. But one of the things that these new analytics databases are just also really
good at is those kinds of transformations efficiently. So if you can define your
transformation, not in terms of a Python script, but in terms of a SQL query, where it targets the
raw data, and then the select query you write
outputs the new query
or the new schema rather.
You can define your transformations in SQL
in a way that's easily version controllable
and an analyst can actually help with
and do it all in the data warehouse
so that you can change your transformation over time.
And instead of having to redo
the entire extract pipeline,
you can just run it against the raw data again and again and iterate quicker that way.
So we think of it as ELT, not ETL.
But then, yes, we've now talked about a tool for EL, extractor loads.
Something for transformation, which in our open source land is typically DBT.
Like I mentioned, for the EL side, Stinger is this really great, amazing technology for extractor load connectors.
DBT stands for data build tool,
which is this way of
transformations with SQL.
And then the last step
that you'll definitely need
is like you said,
getting some kind of pie chart somewhere.
Well, if you're going to do
a pie chart somewhere,
you can just write your,
you know, Python code
or your Jupyter notebook
and directly target the data warehouse.
What's more typical
for data organizations
is that they have a BI tool of some
sort, which might be Looker or Tableau
or Power BI or
Superset and Metabase.
BI is business intelligence, by the way.
Yeah, business intelligence, exactly.
Those are the types of tools that allow
you to define dashboards and reports and give
you all kinds of choices and visualization methods.
And you don't need to write code for it, so it's no pie chart involved. It's sort of point and
click. Because in most cases, these data analysts are less technical than the programmers who are
at the beginning of the data journey. Data engineers are quite technical. They know code.
They're pretty comfortable or getting comfortable with version control. Data analysts most of the
time come from an Excel world where they are used to just looking at the data in a tabular form and then writing the queries to get the sort of the metrics and aggregates that we're looking for.
And in the open source space, Superset and Metabase are two really great BI tools.
Outside of that, like I mentioned earlier, Tableau, Looker, Mode, there's a whole long list of them that teams use.
But yes, by then you are at the point where you have a dashboard and it shows you the number
and you can see whether this month
was better than last month
or you can see whether campaign A
did better than campaign B
or like you were saying
in the A-B testing scenario
with the video game,
you can see how much more time
people spent on the game
with the grunt versus the,
you know, the clink sounds on the sword.
But all of these things,
they are obviously really intertwined. Like if
you want to ask a new type of question of your data, or you want to have new data involves,
you got to go to the data engineer and ask them to write a new EL pipeline to do a new
transformation. And then you can write a new query. But these currently live in siloed off
little environments, the people that use these tools might not actually be talking all that much,
it's more throw a request over the wall, and then two days later, hopefully you've solved it.
And we think of this very much as one change set
that happens to be spread across different tools,
but it's something that you should be thinking of
as one change to your data platform
rather than separate changes to each tool involved.
And we bring those changes together in one repository.
We bring those applications,
those different tools together in one repository.
And we essentially bring the entire data team together in one place so that they can also collaborate more effectively and know what the other is doing
and give feedback on each other's work and get to a place where a data analyst feels
confident actually suggesting a change to the data pipeline through a pull request,
knowing that they can make dumb mistakes, but they won't accidentally break production
because it's going to have to go through that code review and CICD process
anyway.
So the sort of siloing we're seeing or the situation where a data team might not even
allow a junior engineer to go into the system and make changes because the chance of accidentally
breaking something that's just too high is, of course, limiting the effectiveness of these
teams and their ability to quickly iterate and improve.
And like you called out earlier, your data platform is a massive competitive advantage.
If you know better how you can make your customers or your users paying customers or make them
stick with you for longer or do exactly the new feature that they want, you're going to
do better than your competitors.
And right now, data tooling is not built in a way that actually allows data teams to get
the most of it.
And that's where Montana comes in.
And like I said, it's open source.
So everyone can just download it and give it a try today.
And we are building commercial offering around it,
but we want to make the barrier
for people getting value really low.
And that also means that we have a large community
of users and contributors that, as I said earlier,
you listening today can also become part of
if you want to get job opportunities
potentially through this field, or you want to be at the vanguard the forefront of this modern approach
to data engineering cool yeah that makes sense so yeah one thing that's not totally clear is how you
handle the fact that there's data in the loop so for we talked about soft parallels to software
development and in software development you have version control,
you have all of these things.
It's not clear how data fits in.
Like if I'm writing a software application, I have a bunch of.ini files
or.json files that are specific on my config
and they just go straight into source control
with everything else.
Yeah, I can't imagine putting all the user data
into Git or I don't think it's going to work.
So yeah, how do you handle the fact
that now there's data involved with this?
Like, how do you do versioning here?
That's a really good question.
And DataOps is sort of orthogonal
in that there is the part
where we're applying these iteration strategies
and tools to the actual code or technology
that powers all the data flows,
which is what we've been focused on so far,
because that's where a ton of gain is to be had.
We have not ourselves looked into versioning the data itself
or data validation.
Although data validation,
adding some kind of testing pipeline that will tell you
if the data is suddenly looking different
than you expected it to,
or if instead of a hundred records per day, you're only seeing 20.
There's other tools that focus on that,
and we support those on Meltano as components
that can be brought into your Meltano project.
And we can then help you version the testing criteria themselves.
The data itself is not typically versioned,
in large part because you don't always need the old versions,
and it can be extremely expensive
if you have a lot of that data.
And part of the point of these data platforms
is that you want them to be sort of reproducible,
where your data pipeline defines the entire flow
from the API to the desired format and graphs,
which means that the data itself,
as long as it's still in the SAS APIs,
you can upgrade your platform
and just rerun the pipeline again,
and you'll pull out all the new data and be at the latest state. If you want to version your data, that's
not something we are focused on. There's other tools that have been working on that. But we
think that that's the next step. Once people are comfortable with this concept of versioning and
different feature branch pipelines, et cetera, versioning the data itself is the next step, which we will probably get to
eventually. But what we will likely do is look for the most promising open source technology
that is working on that problem and adopt those as supportive components on top of the Maldano
platform. That makes sense. So if we do this ELT, where we get the data into the data warehouse,
and then we apply transformations to it.
How does that handle the changes on the software side? So I load today's data and I rename a field.
And so my field is called time, but I spelt it T-H-Y-M-E.
And then I realized like three months in, oh, shoot, it should be T-I-M-E.
And so I change it.
And so now I have this data that has two different schemas, right?
And so how do you handle the fact that the app developers are constantly making changes?
What happens in that case is that you need to modify your data transformation SQL queries
at the same time as the upstream changes, essentially.
So your analytics queries are written to an analytics schema,
which is derived from the raw schema
that comes from the transactional database
that your application developers are working on.
So when the application developers change their schema,
then the data pipelines would break
because the source they're targeting doesn't match anymore.
But you can update the data transformation SQL queries, still have the same output format,
so that your analytics queries and your dashboards don't need to change,
and just modify the way that that same output schema is derived from the input data.
But this is also where a lot of data teams run into problems,
because they are dependent on upstream data providers, which might be APIs,
which are relatively slow moving.
They have clear change logs.
They have different versions in many cases.
But you might also have application developers within your own company who are not really
aware of what's happening downstream of that data.
So one of the things that Meltano allows is for the CICD pipeline of the main application
code and the CICD pipeline of the data application code and the CICD pipeline of the data platform
to also be connected so that the application developers can be informed when their change
would break something so that the data team gets a chance to accommodate that change in
the same pull request or in a related pull request to the upstream change so that they
can be deployed at the same time.
Instead of the data engineer finding out that suddenly their production dashboard is broken, and then having to scramble to
fix it while their CFO is telling them like, hey, I've got this work meeting tomorrow,
like, why isn't this done yet?
So being able to combine or to bring the data platform into the same development workflow
where the complete downstream impact of any changes is validated before stuff goes live
is part of the value you get from
building your data platform like a software project and bringing in these CICD benefits.
Got it. I see. That makes sense. So you have a pull request that fixes a column name
in your application. That pull request kicks off some kind of continuous integration job, which is going to run your app, generate some data, then
try to ingest that data and produce, let's say, a really sparse dashboard.
And when it does that, it's going to fail because the part that's producing that dashboard
is expecting a different column name.
And so then now you realize, oh, this change actually
requires a complimentary change on the data engineering side to say, you know, if the date
is newer than this, then or if the version of the software is newer than this, then use this name
for the column, otherwise that name. Now you get the dashboard working again. And maybe after a
year or so, you can go and get rid of that if statement
once you've purged all that old data or something.
Yeah, that's exactly right.
And the goal here, just like on the software development world,
is for production to never be broken,
or at least for if production is broken,
having the ability to roll back quickly.
And on the data side,
the more this becomes the brain of the organization
and the more people rely on this, the more important it is that they have the same confidence in those data dashboards never breaking as the organization has about their main, you know, web-based product or the app that they have live somewhere.
Cool.
So, okay.
So if somebody is, let's say, a college student, I mean, you know, you can use your early self as an example. You know, someone is a college student, I mean, you can use your early self as an example. Someone is
a college student, they're starting a small company, or they have a senior design project,
and that involves sort of a closed loop with a lot of customers out there. What is the bare bones,
or not bare bones, I say, what is the sort of cheapest way that a college student can get a pipeline like this and data ops off the ground?
The answer is definitely Meltano.
And specifically, you can run Meltano anywhere you like.
Like it's open source software.
We are building some commercial functionality around it, but definitely if you're a color student or you're working on a small startup,
you can download the code
and get a pipeline running on your local machine
in a matter of 30 minutes,
all the way from having a dashboard up and running
based on some data
that was previously just hidden in some SaaS API.
And we have a number of demos
and speed run examples of this on our website.
Running it on your local machine,
super easy and cheap.
Then if you want to actually run it continuously somewhere, you can use even something like GitHub Actions or GitLab CI to run that pipeline in a recurring fashion. If you want to host the
dashboard somewhere, you do need some way of spinning up a Docker container and exposing
that in a web interface. But that's where Amazon and GCP and Azure
have all of their own container scheduling functionality.
And you can even just take a Linux box somewhere
and use Docker Compose to spin up the web interface,
which is also something I have on my own home lab, for example.
Yeah, so what connectors would you recommend?
So we talked about Singer.
So if someone's making a video game, let's say,
so they're writing code in Unity.
I'm assuming there's like a Singer.
Singer has a way of getting,
you know, I guess, blobs of JSON
from Unity into a data warehouse.
I mean, I don't know well enough
where Unity would store that data.
If that data is on an API somewhere,
you can write a connector or one might already exist. If you can get that data into S3 or into some kind of FTP folder, then you can use the existing connectors for those.
But the expectation is that the data is currently somewhere where you can get it out of it, usually with an API or some kind of database query.
Okay, got it, got it, got it.
So, okay, let me take a step back here.
Okay, so someone needs to use like Amazon Kinesis or one of these things,
you know, you have a million users out there
who are all running something,
website, app, game, et cetera.
You need to get it in one place.
And so that will be outside of Meltana.
There'll be something and there'll be Kinesis
or there'll be some, you know, Kafka,
some type of like endpoint
where you could put this data in.
It doesn't even need
to be that low level.
Like there's a lot of,
you don't even need to pick
a technology like that per se.
There's a lot of products
that have libraries for iOS
and JavaScript and whatever else
that allow you to track
these user events.
And those tools like mix panel or
segment those all have apis that you can pull the information from into your data pipeline yeah okay
got it i got it okay so there's some like yeah let's say like user breadcrumb if you if you're
a college student go out there punch in like you know user breadcrumb and then whatever you're
doing like video game app whatever there's some service that you can use
that's free that will allow you to kind of put that data there. And then Singer will connect to
that and put it into a data warehouse. Correct. And for the data warehouse right now, the most
popular solutions in sort of real life production use cases are some of these paid products like
Snowflake, BigQuery, Redshift.
But you can start simple with just a Postgres database, which, like I just said earlier,
is not necessarily going to handle massive analytical workloads. But for your college
project, that's more than enough. And you can get a full data pipeline end-to-end with the dashboard
and pulling stuff from various APIs up and running, like I mentioned, in a matter of 30 minutes to an
hour. But what I would suggest, if I mentioned, in a matter of 30 minutes to an hour.
But what I would suggest,
if this is something that interests you and you want to play around with it,
is to identify maybe not even a business source,
but something fun in your life.
Like if you do a lot of cycling,
you could use Strava
or you can use whatever other fitness app you're using,
or you could even start with tracking your personal finances
and see if your bank has an API
or a crypto platform, why not?
But if you find a data source in your own life, some kind of tool you use that has an API,
you can build a Singer connector for that using our Meltano SDK, and then build a pipeline with Meltano and Singer and BBT and some of these other projects I've mentioned to create a customized
dashboard for whatever metric in your life you're trying to track. But there's a lot of hobbyists using it for these sort of quantified life, quantified self
use cases, which is honestly a far more fun demo than yet another business platform or business
use case. Yeah, that makes sense. One caveat there, one word of caution, if you're going to
use this to track the value of your NFTs, you really have to watch out for those underflow
errors. An integer can only hold,
what is it, negative 2 billion. So once you start losing more money than that, you're in trouble.
What about on the dashboard side? You mentioned Metabase and Superset. Do you have a particular
preference, especially if you're a beginner? Is there one that you'd recommend more than another?
Yeah. So Superset and Metabase, they're both open source business intelligence solutions.
Superset, I think, is a little bit more mature when it comes to competing with some of these
paid BI products. And there's a lot of different visualization methods. But the user interface is
also a little bit more difficult. The learning curve is slightly higher. With Metabase, it's
really easy to start
exploring your data and seeing if you can pull some graphs out of it. And it definitely has
enough visualization functionality for any sort of hobbyist or student use cases.
But large businesses, we see using Superset more often. Smaller projects, Metabase is a really
great start. Cool. That's awesome.
And so Meltano is a platform that underpins a lot of these things. So you could start with Meltano and you start adding these connectors.
As you do, they become kind of pull requests that grow into this amalgam that you end up with,
where you can now run your app maybe yourself.
And then you can look at the pie chart and see your own data and say okay this is working and now i can send this out to a
bigger audience yeah exactly meltano is the foundation essentially of your of your data
platform it is the project that lets you build this repository that then brings together all
of these components that you can add one by one as you need
them from the connectors to the transformation tool like dbt and then to a visualization tool
like superset and then what you end up with is one repository that holds every aspect of your
end-to-end data story that can be deployed as a single docker container onto any sort of docker
compatible platform including your local machine if you're using something like Docker Compose.
And it's Meltano that standardizes
the configuration of all these components,
allows all of their assets and configuration
to be version-controlled,
and helps with the deployment of the entire thing as well.
Cool, that makes sense.
So Meltano, the company,
why don't you tell us a little bit about
how that got started?
I understand it was a spin-out of GitLab, but at what point was there sort of a decision made that, yes, this should be its own company?
How does a company decide on spinning something out?
And what was that story all about?
Yeah, good question.
So, like I said earlier, from the beginning, GitLab realized that there's a big opportunity here. This is a product that by itself could revolutionize a lot of the data industry and how people think about building their data platforms.
But there was always this idea of either this is going to be an internal business unit, a second product of the big GitLab company, or it will spin out at some point to kind of go its own way. And by the time that Meltano was starting to show the kind of traction and growth
and then community activity
that warranted real growing out the team again
and then spreading its wings,
GitLab was a 2,000 people organization
where every single person, except for myself,
was working on this one product,
one customer persona, one everything.
And there were all kinds of tools in place
that are really appropriate for a 2,000-person company,
but not for a tiny little open-source project
that was still before product-market fit,
needed to start building out its team,
needed to figure out a business model
around this open-source technology.
So we realized pretty quickly
that a lot of the process
and administrative overhead in GitLab
that was appropriate for a massive,
you know, about to IPO organization
was holding us back, slowing us down.
And from GitLab's perspective,
the estimation was made that the value of Meltano
as a spun out company in which it maintained the stake
would be larger in terms of expected value over time
than keeping it inside
and sort of limiting its ability to spread its wings and grow.
So this was really, yeah, a decision between myself and the GitLab CEO, Sid, who I referred
to earlier.
And considering that we had seen some of the difficulty around hiring people and the types
of compensation that is typical for an early startup versus a later stage company like
GitLab made us think like this is the only route.
And that's when we started looking for outside funding
to get a seed round together,
which we completed in June last year.
So we're now a year and two months or so
into our independent journey.
And we've grown to 17 people right now.
Like I mentioned earlier,
we raised funding again earlier this year,
just before the sort the market downturn
so that was really good timing.
I was wondering about that.
And now we're in a position where
our runway extends into 2024
so we've got a good amount of time to
work out exactly how we can
convert a good amount of our open
source user base into paying customers
as well because we love
the open source and the fact that this is by the people, for the people.
Engineers can help it out.
The barrier to using it is super low.
So a lot of our audience today will just be able to give this a try.
But in order for that growth and reaching as many people as possible, we, of course,
need some money to keep building this too.
So the commercial open source challenge is one that we have to focus on in the coming
months. And we
are going to be launching a managed
version of Meltano for those people
that don't want to self-manage their deployments.
They don't want to learn Terraform or Kubernetes
or Docker. They don't want to have
to deal with how do you have a
production environment and a staging environment?
How do you have feature branch deployments?
That's all stuff that we can automate away
and charge for,
which will be our sort of first foray into being a commercial open source business
more than just a commercial open source project.
Cool. That makes sense.
Yeah, I feel like liability is an area where if you're an open source project
and you can reduce liability, that's something that companies will pay for and
simultaneously you know hackers and and individual developers aren't as interested in so you're not
really taking a lot away from them and you're still providing a lot to the people who want that
yeah exactly there's a big difference between what small teams organizations individuals need
and the sort of requirements that every enterprise
will have.
Any kind of company larger than 100 people or so will need different stuff than the open
source.
And you could reimplement some of that stuff yourself if you really want to go with the
open source and the self-managed approach forever.
But of course, there's a lot of things we can make significantly easier than you having
to hire your own engineer to build all this infrastructure around Meltano
if we can handle it for you.
Yeah, definitely.
Cool.
So is Meltano hiring?
If so, are they hiring like full-time, intern, both, neither?
What's the status there?
Good question.
It's not a full hiring freeze,
but we have slowed down our hiring plan a little bit
considering the broader climate.
We do have two roles on the meltana.com slash jobs page.
One of them is a SRE, platform architectural SRE, standing for Site Reliability Engineer, that will help us build out this managed platform into something super reliable that we can build the business around.
And we are also looking for a UI and UX designer because so far we've had a very
developer first approach where everything is in the CLI command line interface and
YAML files, and we want to invest more on the user interface side of things as we
built out this web based sort of interface around Notano.
And this is full time. It's all remote.
So if this sounds like you, then a
great way to sort of show off your skills is through the Open Source project. But if you
already meet some of those requirements of the roles we're looking for, then we'd also love to
talk. But generally, I would recommend that you join our Slack community, which has more than
2,600 people right now, which you can find through matano.com slash slack, which is where you can
learn about Meltano, ask questions from other experts in the space, and also get ideas. You
might want to contribute yourself to make Meltano even better and start essentially building that
portfolio of real life code to use that actual companies to power their data pipelines, which
might get you a job at Meltano or any of Meltano's users, of which, like I said, there are thousands.
Very, very cool.
Any other places that people should go to
if they're interested?
I mean, we'll definitely post the Meltano website
and the Slack.
The GitHub repo shouldn't be missed.
GitHub.com slash Meltano slash Meltano.
We'll find you all the codes.
And also, if you're curious about all of the data sources
that Meltano supports, you can find these on the Hub, which is at hub.meltano.com, which has more than 300
different SaaS APIs and databases that Meltano can load data from or put data into. But that's
a good start if you want to figure out your first data pipeline and whether you have some data in
one of those SaaS applications that you want to build some dashboards around.
Cool.
And if people want to communicate about Meltano, I guess the Slack you mentioned seems like
a clear place.
Is there a presence on Twitter or any other social media?
Is Slack really the place that people should be on?
If you want to talk with people about Meltano, then Slack is the place to be.
Of course, we have Twitter as well,
twitter.com.data,
where you can learn about what we're up to
and the new releases we have
and just chat about Meltano to your current community.
But definitely, if you want to speak to the experts,
then Slack is the way to go.
Cool. All right, Patrick,
you don't need to get a Twitter account.
You're off the hook.
You just need to get a Slack account.
Actually, Slack isn't an account thing, right?
It's an account per workspace.
Per workspace.
Yeah, exactly.
So if you go to montana.com slash Slack,
it will take you through the signup flow
where you create a Montana-specific account
to interact with all of us.
Cool.
Great.
Yeah, folks out there, definitely do that.
Definitely check out the repository.
This is a no-brainer,
something you can
set up yourself easily we talked about earlier uh eks and uh how amazon is a incredibly generous
free tier for college students so you could easily run meltano on a on like a t1 a kubernetes
cluster of t1 micros or something totally doable um and uh yeah if you do use meltono for anything definitely
shoot us an email and we will uh pass it along and so or we'll even post it on twitter and you
know kind of highlight what you've been up to it'd be uh really great it's always great to see people
uh using the technology that uh we talked about in the show yeah we'd love to see that. And like I said,
I would not be where I was today
if it wasn't for all of this open source technology
that was available from a very young age
and being able to become really great
and hireable in a field
just through your own perusing of the internet
and open source projects
is an amazing way into the tech industry.
So reading open source code,
contributing it
and making the most of all the free content online
is the best possible start to a tech career
as far as I'm concerned.
Yeah, that is really special.
Any last words for the audience out there?
Any last bits of advice?
Let's say somebody has went to college not for CS,
so they went to college for economics
or something like that how do you
how would you recommend that person get into the field yeah so for me it was always just a matter
of picking a a problem in my life some kind of itch i wanted to scratch something that i knew
could be done with code but didn't exist yet and just not giving up until it was done and learning
all the technologies along the way yeah but that has definitely become more complicated because when i started all you
needed to build a website was html css and php and now you've got to learn 20 different javascript
frameworks just to get you know hello world to show up so i would these days strongly recommend
following some kind of course but there's a lot of amazing free content as well just so you start
off feeling a little
more confident than every website you find mentioning five terms you've never heard of
before and just being discouraged that way pretty quickly but in terms of just writing really great
code learning from the best is one of the great things about open source and and you can become
a really great engineer just by reading code other people have written and seeing how it's done. Yeah, that makes a ton of sense. So yeah, I think we can put a bookmark in that.
Thank you, Dawit, for coming on the show. We really appreciate it. Really interesting episode.
I think it's a new topic. It's on the vanguard. It's something that people are going to be,
it's going to be sort of a household name in a few years. So we were able to catch this really early, which is really special.
And thanks, everybody, for supporting us out there on Patreon and through Audible.
Thanks for subscribing to the show.
The subscriber count is growing a ton, which is amazing.
I actually am guilty of not doing very good data ops.
So I honestly don't know why the subscriber count is growing.
That's on me.
But I should definitely get an El Tano instance up and running so I can figure this out.
But regardless, we have a lot of folks, new folks to the show and welcome.
I really appreciate all of you kind of coming in, listening, sending us emails, offering
your support.
And we will catch everyone in two weeks.
Music by Eric Barneller.
Programming Throwdown is distributed under a Creative Commons Attribution Music by Eric Barndeller.