The Data Stack Show - 49: MLops - The Finalization of the Data Stack with Ben Rogojan of Facebook
Episode Date: August 18, 2021Topics in this conversation include: Ben's background and his shift to data engineering (2:19)Trends in the data space: finding the most efficient tools, the Snowflake phenomenon, and keeping up with... new functionalities (5:33)Key differences in data practices in small companies and Facebook-sized companies (12:38)Having to build tools specifically designed for Facebook because of SaaS product limitations (16:00)Team structure at Facebook (18:17)Developing more robust systems that are resistent to pipeline failure (19:50)Defining data stacks (24:01)A sample data stack for a young company (28:37)Why Redshift and Snowflake have trended in the opposite direction (33:02)BigQuery and Snowflake comparisons (36:06)MLOps and whose responsibility is it (39:12)Feast, Tecton, and feature stores (45:40)Having a good community around an open source product (49:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rutterstack, the CDP for developers.
You can learn more at rutterstack.com.
Welcome back to the show. Today's guest is Ben Rogozhan, and he is known online as the Seattle
data guy. Many of you may follow him on Twitter. He has lots of followers, and he has done a lot
of work with data in his career, and he has a really interesting set
of things that he's doing now.
So he works as a consultant.
So he helps companies figure out their data stacks.
And then he's also a data engineer at Facebook.
And so that leads me right into my big question.
You know, I love asking consultants about what they're seeing on the ground just because
they have a wide field of view.
But I want to hear about the difference between what he sees on the ground
as a consultant with smaller companies and then what he's dealing with at Facebook.
Facebook is one of the fang companies.
It's so large that I think a lot of us have a hard time comprehending
just even the types of problems that they deal with.
So I just want to hear him talk about the differences there.
And I think it'll be really interesting to hear
just about the difference between those
two experiences.
Well, I don't think that I'm going to differentiate much from your questions, Eric.
I'll just maybe go a little bit deeper on the technical side of things.
But yeah, I find it like very, very interesting to see what the difference are between like
an organization like Facebook and the rest of the companies out there. And I think that Ben is the perfect person to talk about that stuff.
Great. Well, let's jump in and chat with Ben.
Let's do it.
Ben, thanks so much for joining us on the Dataset show. We've followed you on social for a while,
love your content, have learned a lot from your content, and are really excited to have you as a
guest on the show.
Thank you. Thanks so much, Eric. I really appreciate you guys having me on today.
Cool. Well, let's start where we always start and give us, tell us a little bit about your
background and kind of what led you to what you're doing today and then talk about what you are doing
today. Yeah, no. Yeah. So just to give kind of a background of like where things started in my
journey, I guess, on data really started back, I think, in college before I even knew,
I think data engineering or even data science in some regards was a thing.
And I took like an epidemiology course. And then I think I was also taking some computer
science courses at the time. And I was kind of enthralled by like just how you could use data
to like drive just results. Like you learn about Jon Snow in epidemiology.
I think that's the first thing most people learn about is how he kind of
used data to figure out cholera and things of that nature.
And I was like, man, if only there was a way you could like combined
statistics and programming.
And then I think it took like three more months for me to figure out that
that was a whole kind of rising, I think field at the time, right?
Like 2012 was the year of that whole Harvard article coming out about the
sexiest
job being data science. And so that's where I originally started was like, I was like really
into data science. And then eventually I think as I started working and kind of maturing and
figuring out what I liked, I tended to like more of the engineering side of things.
So I started flowing more towards the like data engineering, data architecture side
of building things. And so then from there, it was like working at some healthcare companies and startups, working on their kind of data flows and data stacks
and data pipelines. I at the same time had like started a consulting company in data that was,
again, initially started in data science, but kind of shifted over to more data engineering
and data architecture, and then eventually shifted over into working in big tech at Facebook as a
data engineer. So that's kind of been the whole kind of quick flow for me from school to kind of career.
And then again, now I'm kind of doing the Facebook thing while also having a consulting
company that I've been operating for a little bit at this point.
And we've got a few clients where, again, we kind of just help them develop their whole
data stack and whatever that, again, we can kind of discuss what that means here in a
second, but helping them figure out what tools fit best for them, whether
it be data viz, data storage, kind of going through different options, especially right
now with so many tools that are really coming out.
And I think changing a lot of the game, Snowflake's been out for a while, but I think the more
and more I play with it, the more and more I see people use it.
So things like that, just other data warehousing tools, tools like Rudderstack and things of
that nature.
So yeah, I think that's kind of where I'm at right now, where I'm really enjoying both of those
fields or both of those kind of roles and getting a chance just to see a lot of different perspectives
and how people are using data and trying to use data. Awesome. And it's interesting starting out
in the healthcare space, there are lots of data concerns there that are the sharp end of dealing
with some of the issues around governance and compliance.
And so I'm sure that was just a really good
entree into dealing with some of the more difficult challenges around data.
I'd love to dig into actually both sides of your work that you're doing as a consultant.
And then at Facebook, we've talked with several consultants on the show before,
and we love hearing about what with several consultants on the show before, and we love
hearing about what they're seeing on the ground.
As a consultant, you get a breadth of view across many companies who are trying to solve
problems that are similar or at least contiguous in the data space.
So why don't we start with what are you seeing on the ground?
What are trends?
I mean, you mentioned that more and more companies are using Snowflake, but even outside of tools, are there architectures, team structures,
problems or solutions that are really interesting that have popped up over the last couple months?
Yeah, I mean, like, I think one of the big things that I'm seeing is people trying to readjust,
or trying to figure out how to do more with less or it's becoming, I think, for most companies,
just apparent that data is growing at a rate that if you were to continually hire data engineers
at the rate required based off the current solutions that you might be using, it will
become just vastly expensive to pay enough data engineers to manage all of the various
data sources that you have, right?
Like, I mean, I'm small companies these days that are in the, you know, eight figure, seven
figure range, we'll have like 20 to 50 data sources possible. And hiring one data engineer for that,
that's going to be 150 to 200k, depending on where you are, is barely feasible. And having
to hire more is not. So I think a lot of companies are looking to different solutions. And in that
regard, whether whatever place that might be, again, five trends, a popular one or different,
it depends where in that data flow
you're kind of looking to fill in that role.
But I think that's one thing is like people
are trying to figure out how to do more with less.
So it's not that, in a weird way,
I think some people feel like data engineering
is going to go away in that regard.
But in my aspect, it's just like,
you're just going to have to,
data engineers are just going to be more capable,
kind of like how software engineers,
I think have been amplified over the last couple of years with like cloud becoming more
efficient in terms of how you can actually like deploy code and things of that nature
much easier rather than having to like get a server and have like five people just to
stand up a little bit of code.
Now you can have one or two really solid engineers kind of manage a whole flow.
So I think that's kind of where things are shifting in data is we're trying to figure
out how do we have one really solid data engineer manage a lot more with tools and the right solutions rather than spending tons of time putting together patchwork systems built up of like Cron six combination of clients and proposals that I've written all around that. So that's been like interesting in terms of more
like tools that people are selecting. I think it's just one of those interesting things because
Redshift I think has had this like kind of started us off. And I think then BigQuery and Snowflake
have kind of had this like quiet popularity where even if they're doing better marketing or I don't
know what it is about those two that tend to just get decent traction,
at least in people's minds.
So I'm still trying to figure out that whole thing
more from like a perspective of like someone
constantly coming into these projects
and seeing those same two things
rather than seeing something like Redshift
or like, well, Microsoft products
kind of been a little newer.
So that could also be a reason
why Microsoft's a little behind.
But yeah, why those two seem to be doing kind of well overall is something else that I'm
kind of thinking about and trying to figure out how to kind of work with as a consultant.
Yeah, it's super interesting.
You know, and I'd love to, before the show, we chatted a little bit about ML.
And so I want to dig into that subject a little bit later in the show.
But we're seeing a lot of companies leverage BigQuery for some of the native ML functionality,
which is super interesting. It kind of, in some ways, allows teams to punch above their weight
class when doing certain things. But one final question on what you're seeing, and I agree,
it's super interesting. The idea of data engineers going away is a fascinating conversation in and of
itself. There's no way that's going to happen, of course, in my opinion. But I'd love your opinion on what I will call the gap between companies who are
actively figuring out the frontier of data using these new tools and processes.
And then a lot of companies who are there are a lot of companies who are just,
they don't even know that some of this stuff exists. And it seems like that gap is widening and the companies that aren't adapting, it seems like they will really, I mean,
depending on the business model, they'll really struggle just because operationalizing data is
really the way that, you know, we shape the experiences that modern consumers have come to
expect. Do you see that happening where there's a lot of companies where it's like, man, I've never
even, I didn't even know you could connect all this stuff.
Yeah, no, I do kind of get that feeling where it's like, and I think part of this is due to the fact that there's kind of information overload, right?
Like it's like, what is the right product?
So I think sometimes people might be focused on either older products that have existed forever.
I mean, they're still kind of looking at that as their solution, or maybe they just, you know, whatever they might have be viewing the products that they've known forever, and seeing the limitations that they had maybe 10 years ago and thinking that they still have those limitations. And I think that's kind of like maybe what's holding people back. Like I think even with like Tableau, I remember there was a point where I didn't realize, I don't remember what feature that they had recently added that for a while there, I didn't even realize they had added, But there was one that was like pretty pivotal in terms of like, if you were to pick tableau
as a good data viz solution or not.
And yeah, like I, if you're not staying on top of all of the stuff that's changing in
every space, it, I think that's what can really keep people behind.
I did a video on like a data engineering roadmap.
And I kind of had a joke within the first five seconds, or maybe it was like 15 seconds where I like picked a picture or an infographic of like the data or data space tool,
tool wise, right? Like it was like one of those like VC kind of picked or infographics of like
all the tools based broken down. The ones that like all the VC firms have been like making their
architectures and visualizing that. Yeah. But like you literally couldn't tell what you were looking
at. There was just so many.
And I think that's, that's currently the, the, one of the major problems
where like data ingestion alone has like 50 tools you could pick from.
Right.
And they range all from like tools that have been around since like
1995 and tools that came out just last year.
And so like, I think that alone can make it very difficult in terms of like
knowing what exists and
knowing what each of these things do and why you might want
to use one or the other. So I think I think that's one of the
big things. It's just it's just so hard to keep up. And then the
other thing is like, I think some companies are, I think all
the big most a lot of the bigger companies I think are catching
up. I think some of the more mid and small companies that are
finally like there's just that I think that's where the big gap
is, is there's this like,
some of them like either are reading all the Medium articles
or whatever they're doing
and then they'll contact me
and be like, we want machine learning
and they're too far ahead
or something like that.
And then there's some people
who are just so busy
and they're probably day to day
and they're in their ops
that they just don't feel
like they have the time to keep up
or maybe they don't even feel
like it could help them.
So I think that's another hard place
for some people. They don't realize that maybe there's something that could help them. So I think that's another hard place for some people. They don't realize that maybe there's something that could
help them. So I think that's probably one of the gaps. Yeah. Super interesting. Okay. Let's change
gears a little bit. So that's what you're seeing on the ground on the consulting side of things,
but you also work as a data engineer at Facebook. And what really intrigues me about that is you
get to work with companies as a consultant
who are just orders of magnitude smaller than Facebook and then you also get to see,
okay, what does this look like at scale? And I know we could probably do a whole episode on that,
but I'd love for, especially the people in our audience who are maybe working at
a company in the mid-market or even a small enterprise company, to hear what are the
unique things that you've experienced as a data engineer at a company that's the size of Facebook,
a true international enterprise with a massive engineering team?
Yeah, I mean, I think I think that that alone, you just kind of stated one of the one of the
major differences right there is that you're talking, even like you said, like a mid-size company or even some large Fortune 500 companies, just they have a engineering staff of, you know, 2,000 people, like maybe a larger Fortune 500.
And if you're mid, small, mid or something like that, maybe it's a few hundred people that are part of your engineering staff.
And then you're trying to compete with companies like Google, Facebook, Amazon that have engineering staffs of 15, 20, 30,000 engineers, right?
And you just 5,000 of them might be focused on more of the enterprise side
and developing enterprise systems and, and the architectures and we don't call
it like just services in general that make your life so much easier if you
work at those companies versus again, small mid-cap companies,
you're likely either relying on pre-bought products
or maybe you're trying to put out
your own internal solutions.
But obviously it's just hard to commit the same time,
the same amount of time towards this.
And then you add in the fact that many of those mid,
small and even larger Fortune 500 companies
have been around so long.
So they're still relying on like source systems
and operational systems that are maybe
super functional and super developed amazingly, but maybe their analytic systems are just antiquated
or developed in such a way that it was developed for 20 years ago, data sets or something of that
nature. So then you're also having to remigrate and do all this other stuff just to get yourself
to a point where you can actually build systems that act like a larger company like Facebook, Amazon,
and so on. And so I think that's been the big difference I've noticed. And I've talked to
people like, I don't know if you know Veronica Zai from Fivetran. She kind of comes from finance
and banking at JP Morgan. And she's kind of brought that up as well, where their op side
is amazing, but their analytics side was pretty terrible when she first started. Yeah.
And she spearheaded that whole kind of development on developing kind of their whole ETL and
whatever.
And that's what eventually brought her over to Fivetran was like that whole like dealing
with all this terribleness and then seeing Fivetran as a possible solution for herself.
So that's like one kind of area that I think companies are going to continue to deal with,
right?
Like you've dealt with these systems and they work well, but they're old or they're not, they're not as
easy to work with because they were developed for a different thing. Yeah. I think that's one
of the major differences. Yeah. That's interesting. And I want to just touch base on one thing you
said that is, is pretty mind blowing. You said maybe you have, I mean, I know you're just
spitballing here, but 5,000 engineers who are focused on building tools that make the engineering team's life better.
And it's just crazy.
You know, there's companies that IPO with far less total employees than the number that are working on a very specific set of things inside of Facebook.
And that scale is just mind-blowing. I'd love to know,
I mean, just out of personal curiosity, but I think for our audience as well,
at that scale, you're probably building a lot of things internally because you outstrip the ability of even what we would say like enterprise-grade SaaS products can manage at scale.
And I know that different teams for different use cases will probably use various SaaS products can manage at scale. And I know that different teams for
different use cases will probably use like various SaaS products, but I would be surprised if you
didn't have a lot of homegrown solutions just because there isn't SaaS that's built to manage
what you're facing because not a lot of companies have faced that before.
Yeah. And I think that's something that when you join Facebook, they kind of bring up
just like you think about when Facebook came about and you think about Amazon and AWS and when it was kind of developing its things and there was no AWS when or not to the same degree, at least when Facebook was dealing with their problems, right?
They had to develop all of their own solutions, essentially.
So, yeah, I mean, I think that's the thing.
Like there wasn't even options to some degree and so a lot of companies that especially deal with that
size of data have to develop their own whether again whether it's google facebook amazon and i
think that's right like that's why amazon developed their products originally was for themselves and
then originally or then eventually realizing they could sell them develop their own cloud service
but yeah it's it's a combination of both yeah, Facebook probably can build their own or at least better integrate.
I think it's another thing, right?
Like regardless of the SAS product, there's only like, there's usually
limitations to integrations, regardless of how well they're often developed.
It's just hard.
And you're always going to be limited by the SAS provider, right?
Like if you're working with Salesforce, you're only going to be able to do so much.
Like it's pretty customizable, but there might be a point where you're like, oh,
I just wish I could do this one thing.
And there's no, there's no engineer you can go out to and be like,
Hey, can you do this thing for me?
But if you build it yourself, you've got a whole team.
You'd be like, Hey, can we get this feature going?
And at least that's a little more feasible, obviously then you've got
other problems with internal people making choices on what features to work
on, but at least there's a little more control where you can go to that team and be like,
hey, could we get this feature?
We think it really would change the workflow.
So I think that's another reason.
It's not just about scale.
It's also about having that ability to integrate
at a very different level than most other companies.
This is great, Ben.
Can you give us a little bit more information
about the teams of the engineers
and what the work looks
like in a company the size of Facebook, especially for data engineers? I guess I'm curious on what
specifically you're looking for. Yeah, it's more of an organizational question, to be honest. I mean,
we are used to think of data engineering teams to be like a small team, relatively like to the
product engineering teams that you usually have, right?
And of course, they don't reach the scale
of what Facebook has.
How the scale affects how the team structure
or the teams are structured.
And that's the essence of the question.
Yeah, I mean, I'm gonna guess it's very similar
to a few other or plenty of other companies
where it's like oftentimes,
I think Facebook tries to support a product with a team of data of data engineers right like that way you've got a good integration with
both sides both the software side and the analysts and data scientists side so depending on how big
the product is could change how big the team is but overall you're trying to support some some
some product with some data engineering team that that way it can kind of be one pipeline where it's
like you got they have a good relationship with the software engineers they have a good relationship with their
exfns on the other side and and everything runs smoothly i mean i think there's always going to
be a problem with and this is something i see regardless of whether you work at facebook
facebook or other companies is software engineers i think always tend to be focused on functionality
in terms of we want to make sure it works and care less about like data and they i mean obviously
they need data in terms of like making sure
their product is up to date, right?
Like if someone clicks or posts something, they want to make sure that
information gets stored, but logging and things of that nature is kind of secondary.
Right?
Like if the, if the product works, do you need to log things?
So I think that's generally the one interesting thing that I'll often see.
Yeah.
That's super interesting.
Uh, do you also, I mean, we have in our minds
that like data engineers,
they are mainly building and maintaining data pipelines, right?
What else do you see getting done by the data engineers?
Like, do you see them like building
like internal tooling, for example?
Is this something that you see?
And if yes, how is this managed?
And like how much of the work of the data engineer in the future you think it's going
to be something like that?
But yeah, no, I think there's definitely always going to be kind of a need to build internal
tooling to kind of abstract as much as you can away in terms of building data pipelines,
right?
Like trying to, again, it's being a balance between abstraction and building maintainable
systems. But I think
that's always kind of a goal in general of data engineers is not just to build pipelines, because
they could build that with some Python scripts, but also figuring out how to build more pipelines
more effectively in terms of, right, like if you have to manage 1000 pipelines, how can you manage
1000 pipelines easily, because that's, it doesn't take much for a pipeline to fail. Like I think,
I think that's the one thing I found interesting about, regardless of the company that I've worked at is it doesn't take much for most pipelines to
fail, it could be one column changes, one data type changes and, and regardless of
how much you maybe make some component inside that whole pipeline, maybe a little
more robust, so it doesn't get impacted.
There's always somewhere downstream that maybe does get impacted.
Maybe a table
in MySQL or something of that nature or
in your data warehouse.
So I think trying to figure out how to
develop systems that are more robust
in that sense or at least can
make it easier to manage when things do
go wrong or provide
better notifications, whatever it might be.
I think it's kind of a role of a data engineer
because we tend to know what we'll need. Again, i think it does depend on the company you work at as well
i think you work at a larger company like facebook you've got again tons software engineers that are
probably building a lot of those products you work at smaller companies elsewhere even things like
lyft i've talked to people you're tending to play a little more of a software engineering role and
not purely focus on just like data pipeline so yeah i think it also just depends on the company and how much support you have from maybe software
teams that maybe purely develop that kind of like data instrumentation or something of that nature
yeah yeah i think that was by the way like an excellent point what you said about pipelines
failing and that's like regardless of the size of the company. And I think that's also a space where there are many opportunities
for products also to probably be created,
exactly because the concept of observability, let's say,
that we used to have in typical software products,
it's not so well-defined when it comes to data pipelines and data in general
and probably needs some kind of like different approach. So it'll be super interesting to learn more about how do you
manage this problem and like what kind of lessons you have learned from like trying to do that.
But before we go deeper into this, a couple of months ago, we had another episode with someone who came actually
from Facebook. He was working there and his name is Ivan. And he left to start a company called
Slabdas. And actually he took like what he learned inside Facebook and the problems that he had to
solve there and in a way productize it, right? Like he came up with a product. I don't want you
to tell me like exactly,
but do you have the feeling
that we might see something similar
coming out from Facebook
also for data-related products?
Yeah, I mean, obviously there's multiple reasons
I probably can't speak to that.
Yeah, I mean, I think overall the answer
is I have no idea, right?
Like I already said,
like I kind of said earlier, right?
Like Amazon built all this
stuff internally and then started selling it facebook in a sense has built a lot of similar
things but has never sold it why i'm i'm unaware of it's not okay it's not part of my purview also
even if i was i imagine that would be something i would have to double check with someone before i
would say anything yeah of course of course of course. Yeah, makes sense. Makes sense. Okay, cool. So
let's get a little bit more technical. And let's discuss a little bit about data stacks. And based
on your experience, because you have experienced like both extremes, probably through your
consulting career, but also like working in a huge organization like Facebook, you have seen many different, let's say,
versions of what we keep calling data stack.
So based on your experience,
first of all, what is the data stack?
What part of the software that the company is using
to operate should be called the data stack of the company?
Sure. I mean, I think it's, again, like you said,
it's pretty broad.
If you're talking purely about maybe the analytics data stack, I combine? Sure. I mean, like, I think it's, again, like you said, it's pretty broad. If you're talking purely about like maybe like the analytics data stack, I think you
start with raw data and you go all the way to like data storage, data viz, and maybe
some light data analytics.
I mean, if you want to include some ML stuff in there, you could, if you really want to
go that far.
I think it's definitely like that tip of the iceberg kind of data stack stuff.
So that's why I'm not as focused on that when I refer to data stack.
Yeah, I really focus on that, like raw data, like data ingestion, data storage, data transformation,
and then some sort of data is or whatever your final data product could be, because I think
there's some data viz. I also think like, data products don't always have to be a dashboard.
I think there's plenty of examples of like, other forms of reporting that you could consider
kind of your final product.
I think one of the things I've been recently trying to work on is building something like
NerdWallet has this calculator that is a cost of living calculator.
And they basically scrape a bunch of information from different sources, put it all together.
And then you can now put in, I want to move to LA.
I'm currently making 150K and it'll calculate how much you should make.
And it kind of gives you some other information about like how pleasant
it is to live there based on like some walking scores and other information
you can pull from APIs and then like cost of rentals and things, other things
that they've kind of pulled and aggregated together.
So I think that's also kind of like less of a database and more of like a data
product side where I kind of put that in the data stack as well, because like
it's, it's part of part of part of the whole flow. So, so yeah's, it's part of, part of, part of the whole flow.
So, so yeah, so that's kind of, I think that the, the steps that most people will reference and what I kind of consider it, right.
In terms of like, what is the data stack?
And, and it's, and it's so broad in terms of like, even starting all the way to raw
data, like, what does it mean?
It's like, well, that raw data can come from everywhere, right?
Event logging, SFTPs from other companies and like external files from other companies,
scraping things from online,
pulling government data from online,
pulling from your various APIs and marketing tools
and Salesforce and things of that nature,
streaming data that you maybe you're getting in.
So it's just so broad that like even that alone,
it's like, that's where it all starts
in terms of like complexity
and like probably the hardest part and why so many
companies I think right now are focused on data ingestion layer because it's like if
you can do well and develop a good product for data ingestion, you'll do okay in terms
of a product.
Yeah, it makes a lot of sense.
Did you see something changing on the definition, the broad definition of data stack based on
the size of the company?
I think in a weird way, it's like, it's going like a lot of companies are getting access to a lot of
tools that they never had to before. Like, I don't think the term data stack was almost used the same
for smaller companies up until now, at least what I feel, right? Like a lot of companies up until now,
you had your 30 python scripts that you ran
on or or shell scripts or whatever you prefer to script in that you managed in cron and it worked
fine because you only had five or five data sources now that companies like all of their
products are sass all of them have apis or at least a good portion of them have apis being able
to switch over to some more
well defined components, I think is what's personally, I'm seeing more of a switch,
we can actually switch over to something, just so I don't say the same product over like,
like, like reverie.io or airbyte. I think those are two other kind of tools that are looking to
fit into the data connector data data ingestion layer, You can use those instead of having to, again, develop a bunch of custom scripts.
Because again, I've created four Salesforce connectors
in my life already, right?
So yeah, it's like we've all created the same things
over and over again.
And it makes a lot of sense that someone tries to sell that
and productionize it.
So yeah.
Yeah.
So if you were like to advise, let's say, a young startup that has their first customers,
they create some revenue, they have to do some reporting, they have to understand a little bit
better how their customers interact with their product. What would be, let's say, not an ideal,
but a data stack that would make sense for a young company.
And the reason I'm asking that is because many times
we tend to over-engineer solutions, right?
It's not like you need to operate a Kafka cluster
just to move your data around, right?
And it's like a very common mistake that people are doing.
And it costs a lot both in technical data and time.
And at the end, they end up with results that they are pretty much noise, right?
So can you give some advice how you would structure, let's say, the data stack again
for a young company?
Sure.
I mean, I think raw data, it's, again, hard to say how you're going to pull that all just
because it depends how you've developed
or like what tools you use but yeah beyond that right like i think tools like rudder stack or
segment can work well in terms of like trying to log and just getting a lot of that information
out there initially and getting it to the right place i also do like things of like i usually
switch in between five tran and airflow depending on maybe what companies like technical knowledge as well as
like maybe price sensitivity. I think for example, Fivetran can end up being very expensive, but if
you can afford it or if it really does help you because you've got enough just data sources that
you're trying to manage, that can be very helpful. But I also think like Airflow is kind of great
because it's overall decently simple and you can automate pretty well. I think the one hard thing for there is some people have a hard time managing
airflow because it does, does tend to be a little bit finicky for some people,
but that tends to be more of the coding side of what I'll use rather
than trying to develop my own thing.
Um, in terms of like maybe data storage, I think I, at this point, I'm, I will
probably switch between like end BigQuery and snowflake, I think those are kind of
two favorites or if they're not like using tons and tons of data, like if it's really small, I'll even use
Postgres just because it's like, if it's small, it's like, okay, look, this is fine. You're not
going to, we don't need to do anything crazy or go, go pay huge costs for crazy optimizations,
but you've got a lot of big data. Snowflake and big query is great. Also snowflake is just like,
I like to say like, I feel like snowflake is great. Also, Snowflake is just like, I like to say, like, I feel like Snowflake is the Apple of
data warehouses.
It just has, it just has this feel to it.
Like, I don't know why I like using my Mac or my Apple in terms of my laptop.
I just do.
I don't know why I like using Snowflake compared to some of its counterparts.
I just do.
It's easier.
I don't know why.
I'm like still sitting here like, I don't know what it is.
I just like it better.
Okay. I don't know why. Maybe it's just the branding. I don't know why I'm like, still sitting here. Like, I don't know what it is. I just like it better. Okay. I don't know why maybe, maybe it's the, maybe it's just the branding. I don't know. They've, they've, they've got something there. And then data is, I think, I think I still
generally like as much as I think lookers kind of what is often named as like the modern data stack
tool. I think I still just prefer Tableau's usability. It's just so much easier to build anything very quickly. And if you know what you're doing, I think it's fine.
I think Looker, the one thing is like it has ML, not machine learning, but it has its models and
things of that nature that you can kind of define things a little more, which some people prefer.
And I get that. But I think if you're safe with Tableau, I think it's, I think Tableau's just got
easier usability and you can build up something so quickly. think it's, I think Tableau has just got easier usability and you can
build up something so quickly. And it's, yeah, that's, that's, that's generally what I still
prefer in terms of data viz. Yeah. Makes sense. Makes sense. I actually found it very interesting
that you mentioned Postgres. It's especially when I talk like with young companies, like the first
thing that I ask them when they're trying to figure out their data stack, let's say is okay,
are you like a B2B or a B2C company?
And that makes a huge difference in terms of at what stage of the company data,
especially the volume of the data, becomes an issue.
Like a B2B company, I mean, you can pretty much grow a lot
and still just use Excel documents in some cases,
especially if you're like focusing on large enterprises and stuff like that,
which is, of course, completely different compared to building a marketplace or building an app
door-to-door, right?
Even at the early stages of door-to-door, like the amounts of data generated might be
like huge.
And it's a very common advice that I also give, like just use Postgres.
I mean, Postgres can scale to quite a lot of data without having to go and get yourself into using like something like
snowflake which of course it has you put it very well i think that this parallelism with like
apple is like amazing i love it it feels nice but at the end it's like i mean come on dude you don't
need that to answer like a few small queries right yeah like just just
use postgres as you would do also like for your for your product so that's super interesting
you mentioned snowflake and big query right there's also like redshift and it's a very interesting
story because redshift is the first product in the cloud data warehouse space, right?
But we tend to not talk that much about them today.
In your opinion, what do you think went wrong with Redshift?
And you touched a little bit about Snowflake,
but what do you think Snowflake did really, really well?
Yeah, I mean, this is my personal opinion.
I think Redshift is just, it's not that different from most data warehouses, but it's almost
too different.
I feel like it's not, like in my own mind, like there's so many nuances on how it works
and how, like you have to be just that much more technical and using it and making sure
you're like using it properly than I think maybe some of the previous tools, or at least
like people were technical using like Oracle and my SQL server in terms of like building their data warehouses back whenever, but like, I don't know, it just felt like such a shift in how you thought and how you design, right?
Like you couldn't run updates, right?
Like that was like a weird thing.
I think they might have recently added that, but there was like little things that like classic, like data warehouse modeling wouldn't necessarily work well.
And I think that that kind of took, took it's like people are like, okay, so if
I want to run like an insert merge and do slowly changing dimensions, oh, I can't.
Or I gotta like do this weird thing and add, like, do it, do two tables kind of
thing, like have a staging table, have the current table and then create
a new table based off of that.
And so I think that might've been a little bit clunky. I think, I think that's the biggest thing. I think it's clunky. And again, going back
to Snowflake, Snowflake is just, it, it operates. It, it's how you think it should work. And I think
that, that, that's what makes it different. It's the same way. Like why, why do people prefer Macs
over windows? Like windows can sometimes be clunky because they're not like, I don't know. I don't
know what it is about it.
It just feels a little clunkier than it would with Apple.
I'm not a designer, so even for me,
sometimes the reasoning eludes me.
It's just like, that's usually my description.
It's like, well, this one feels clunky,
that one feels smooth.
That's what I can tell you.
I like using one, I don't like using the other.
Yeah, yeah.
Yeah, I totally agree.
I think that the feeling that you get from Snowflake is that it just works, right? Yeah, I totally agree. I think that the feeling that you get from Snowflake
is that it just works, right?
And if you have to scale up or scale down,
again, it just works.
I remember having to deal with Redshift.
I don't know how it is today because, okay,
I think they have changed quite a few things
and the product has matured a lot
and has some kind of parity
with the rest of the data warehouses out there.
But having to rescale your cluster, it was a nightmare.
You had downtimes, for example, right?
Or having to vacuum your data, which is, okay, it's something, some kind of like relic,
because they built the distributed system on top of Postgres.
Postgres has this concept of vacuuming.
And then they had also to introduce stuff like deep copying, and then you had vacuum.
Anyway, it was a lot of, let's say, not unnecessary work, but it could be very inconvenient when
it shouldn't be inconvenient.
And that's something that, from a product perspective, I think Snowflake did really, really well. And that's something that from a product perspective, I think Snowflake did
really, really well. And that's amazing. But on the other hand, BigQuery is not that different,
right? When I first tried BigQuery, it has pretty much the same feeling. It just works, right?
Why do you think that Snowflake is that much more successful in a way or we hear more about it
than BigQuery? I don't know if
their marketing is better i think maybe that's possibly one side of it like i think their
branding in general has been like i remember i recall back back in like oh it must have been
like 2015 or something i like went to a meetup assuming it was some sort of like tech talk and
about like 30 minutes real 30 minutes in i realized it was some sort of like tech talk and about like 30 minutes,
30 minutes in, I realized it was like a sell, like basically just a sales guy trying to sell Salesforce or Salesforce Snowflake. But like, even that was like all the interesting,
so we're talking about like the design and like how they made it different and things of that
nature. So I think they've just been building on it for so long that I think that's kind of,
kind of helped. I just think it's been a lot more of a branding thing. And I think it's easier to
brand than it is to brand BigQuery, which is connected to Google Cloud. And so it's hard to
maybe separate from that. Where Snowflake, it's like, it's just Snowflake. There's nothing else
connected to it. I agree with that. I was going to say, if you think about Redshift had the first mover advantage, right? And so the product itself, in terms of being a new solution in the market, was sort of
groundbreaking just because of the nature of the product and having the first mover
advantage.
Google is such a huge beast.
And so I wonder, and this is complete conjecture, but you kind of have a strange feeling about
using your free Gmail account and then dumping all of your customer data into BigQuery and it's
the same company. That's just a little bit of a weird perception. Like you said, where it's like,
I mean, BigQuery, it's an awesome tool. But from a branding standpoint, I think it's really hard to overcome offering
lots of free consumer products and then building a brand around corporate enterprise trust when
it comes to your most valuable asset. Whereas Snowflake, that was all they did. And so it was
a much more straightforward branding exercise. Yeah. I'm not a marketer, but I assume
that's it. I think that plays a role. I mean,
like, again, I obviously, you've noticed that I do play some marketing, but I think even then it's
like, just for me, it's sheer force of marketing. Like I'll just keep putting out content, but I
don't exactly have like a marketing strategy. Yeah. Well, if there's anyone out there in the
audience who has an informed opinion on this, we would love to have you on the show to discuss it
because we love talking about the battle of the warehouses.
Then one other thing that we had chatted about
before the show was what we call,
and I loved how you said it,
the other side of the stack.
And I love that visual, right?
Like there's a lot of the core components of the stack
and however you want to architect your data stack. There's this other side of the stack that doesn't
get discussed a lot, I think for a number of reasons, but it's machine learning and the
ecosystem and process and tooling around, around machine learning. And that's, that's been of
interest to you lately. Can you tell us, just start out by saying like, what is, when you think
about machine learning and ML ops, what are the types of things you're talking about and why don't you
think that they're at the forefront of the conversation with the data stack?
Sure. So, so yeah, when I, when I refer to MLOps, it's almost like everything, but the model,
I mean, in some ways it is the model, obviously it plays a role, but there's so much other stuff
that right. Encompasses getting some form of model out into production. And I think this is
such a constant problem for anyone who's ever built a model that you
realize you understand how to build a model.
And maybe that's what you learned in school.
Maybe that's what you learned at a bootcamp, but you never learned the other side, which
was okay.
But now how does it go into production?
How does it like maintain?
How do we, how do we make sure it's still operating correctly over time?
How do we deal with the various problems that you can deal with in terms of like
keeping models up to date?
What if problems start occurring?
What if data drift starts occurring?
Thing, things of that nature become kind of challenging.
So I think, I think that's kind of what I refer to MLOps, which is like all of the
stuff around the ML model, which is like kind of, to me, very similar to like data
engineering, which is like you have the data pipeline, but then around the data
pipeline, you have so much infrastructure. That's just there to make sure that that data pipeline
operates smoothly and get notifications when things go wrong. And I think that's a space that
I think is going to continue to grow over the next few years, just because we've had now a decade or
so of big tech companies and other, you know, tech companies developing their pipelines, developing
good or best practices.
Like you said earlier,
you have enough people with entrepreneurial drive
that have worked at those companies
that will now probably take those learnings
and develop products to send out to other companies
in the next couple of years.
So that's one reason or one area that I'm kind of focusing
in terms of like MLOps.
In terms of why it doesn't necessarily get discussed
is I think the term itself is still kind of coming into its own. I think it's like only really like there was like a paper, I think
in 2015, that kind of kicked off the idea. But I think if I recall, like the term started getting
used like 2018 2019, in terms of like ml ops. And I think I think that's one reason I think people
are trying to just now get attention to this idea of like, okay, in order to actually get that model out into production, we need to have a system. We can't just push it out there
and not think about how it lives out there in the wild. It has to have something more.
And so I think it's just more as companies mature in their data understanding and data
like infrastructure, they'll eventually get to that point. But I think a lot of companies aren't
there yet, right? Like they're still trying to gather their data and manage it in such a way that
makes sense. And so the next step after that would be like, okay, now that we have it all,
we've done more, we've done dashboards, we've gotten all the value out of this quote, unquote,
low hanging fruit, how do we really drive that next level? And that will be where they'll learn
about, okay, I want to develop this machine learning model. Okay, wait, I've developed it.
Now what? And so I think that's generally what most new people come up to.
It's like, okay, now what?
Ben, quick question about ML Ops.
It has like, the word comes from ML and Ops, okay?
And the reason that I'm saying that is because
usually with ML, we're associating data scientists
and ML engineers, right?
How do you see the overlap between data engineering
and data engineers
and like these other engineering roles participating in MLOps and whose problem is MLOps at the end?
That's definitely kind of an interesting question because like, right, like you have MLO engineers
that like develop models and put them out into systems. And I think they've definitely been
doing that for a while, like larger tech companies. But I think I've also seen a lot of people kind of implement machine learning models in things like airflow, right?
Like if you're doing like a batch job for your machine learning model, it's like, okay, well,
what's one way we could kick this off? Well, let's just use airflow, right? Like that's one thing we
could do to get out whatever the output is. And obviously, that's not for live models. But if
you're doing something that's batch focused, so I think that's kind of the similarity where you
kind of have the same thing where you're dealing with either batch jobs in kind of ML, if you're doing something that's batch focused. So I think that's kind of the similarity where you kind of have the same thing where you're dealing with either batch jobs in
kind of ML or you're dealing with live more streaming like jobs. And you probably come up
with similar optimization problems and like performance problems as well that you would
run into data engineering when you're doing transformations and things of that nature. So
that's kind of why I think they're somewhat related. Whose problem it is, I think will
depend on tooling in the future. I think for now, you'll still So that's kind of why I think they're somewhat related. Whose problem it is, I think will depend on tooling in the future.
I think for now, like you'll still have ML engineers
kind of taking care of a lot of the implementation
or software engineers in general,
just because again, you're trying to optimize something
that maybe you might not have someone
that's both strong in ML and in implementation.
So you might have to kind of find this happy medium.
But once I think you start getting hopefully better tooling,
you can get to a point where maybe ML engineer
or machine learning, like researchers,
or maybe data scientists can figure out a way
that they can deploy it,
maybe without having to have a whole extra person
required for that.
Well, I mean, again,
that'll depend on where tooling goes though.
That's great.
I mean, okay, it's probably also like a little bit early.
It's still other like definition
and like trying to figure out exactly
how it fits in the organization in general.
But can you give us like a list of tools
that you think are like core
or like important for MLOps today
that are very commonly used?
I think there's a few
that I've been like looking into myself.
I think like I've personally been looking into like
things like just Azure ML and it's different features that that it has, because like obviously it's got some things that are more focused on helping you kind of find the right model.
But it also is, I think, going towards that drag and drop very similar to SSIS feel tool.
I don't know what it's called, where you can kind of run models using things of that nature.
I think also things like feature stores tend to be something that can play a role in ML ops. Also, I'm looking at like data robot right now in terms of like how
it's going to kind of play its role. So I don't know if there's like specific components. I think
there's like specific tools I'm looking at and try to try to figure out what their role is,
what works best where, right? Like I think that's you're going to deal with a similar problem
that we have in the data engineering space
which is there's just a lot. So I'm still trying
to figure out like which tools will work
best. Yeah.
Yeah. Actually, that's
a very interesting topic which is
feature stores. The reason like I
find feature stores very interesting
from a product perspective more and
not like the machine learning or the engineering
perspective is that you hear a lot about them, but actually there aren't that many of them out there.
I mean, you have the big companies that they have built their own,
and even companies that traditionally open source many things, like Netflix, for example,
they haven't done it yet.
And you pretty much have something like Tekton,
which is something that, at least until recently,
it wasn't publicly open.
It was a very enterprising kind of product.
And Fist, which is the open source.
And that's all.
I mean, is there anything else out there?
I think I recall doing something a while back
where maybe I saw one or two more, but those are definitely the two that I recall doing something a while back where maybe I saw one or two more, but like
those are definitely the two that I recall.
In fact, it's one of, TechTown is one of the Slack channels I'm on.
So yeah, yeah.
Those are the two that I'm well aware of, but I'm sure there might be some more, maybe
more open source-y style projects out there.
But yeah, it really has been kind of, you know, people haven't really tried to productionize
it and make it into a product. Are you thinking about that as your next product costas i don't know i
mean it's a very interesting data problem that like feature stores are trying to solve we had
we had we have an episode with someone from techton actually about about feature stores and
it was very interesting it was like the first time that i talked with someone about feature stores, and it was very interesting. It was like the first time that I talked with someone
about feature stores, and it was very fascinating.
But I find it very interesting that we don't see
that many products yet.
That's one.
And the other is that we don't see open source projects,
which is another thing.
Like, for example, let's take Data Lake, right?
We have Delta Lake, which has been open source.
We have Iceberg. We have Hoodie. You can do your own things probably with more of vanilla stuff by
just using something like RK files and run Athena or something like that. But I would put that
closer to the products that are related to data warehouses and data lakes, right?
But you have quite a few open source projects there.
But that's not the case with feature stores, which I find it very interesting.
Maybe it has to do with the nature of the problem or the products or the scale.
That's another thing.
What kind of company do you believe needs a feature store?
And when it becomes something
important i don't think i have a good answer for that at this point yeah i just don't think i have
a good answer because i had this conversation with a guy from tecton and he was saying that
like feature stores is something that you needed when that is going to affect the productivity of
a team right like you need to have a sizable team to need that it's not something that
just because you have someone who's creating a model or two or trying to do some
prediction internally, you're going to need a feature store. Maybe that's also another reason,
like maybe it's a product that you need to have a certain scale and above like to actually need it,
or it's just too early. And it's this whole product category is still under definition.
I don't know, but it's very fascinating.'s it's very interesting i'm very interested to observe how
feature stores are going to to progress as products i guess like in facebook you have
similar technologies that you are using but these are all builds internally right yeah yeah i mean
i think a lot of a lot of the a lot of the stuff that's like ml ops it's funny's funny. Like now I take for granted a lot of things, I think, at this point,
when you talk about my work internally at Facebook,
just because it is the positive and negative, I think,
of working at a big tech company, right?
Like you don't understand all of the problems.
So before, in order to get my way through college,
I worked in the culinary field. And the first restaurant I worked at was like one of the problems. It's so before in my, in order to get my way through college, I worked in the culinary field and I worked, the first restaurant I worked at was like one of the top
restaurants in the city. And like I eventually went to a slightly lesser one and they kind of
point out they're like, yeah, like you used to work somewhere where you just basically had to
slice a tomato and serve it. Cause you've got such good ingredients and now you actually have
to work hard to make those ingredients something. So something same here, it's like some places you'd like,
you just start with, it's such a good place.
That's like, that's just so easy.
I mean, in comparison, right?
Like when you have a lot of the harder problems solved,
it's not to say there aren't problems,
it's just different.
Yeah.
Yeah, yeah, makes total sense.
All right, one last question from me.
What do you think is the importance of open source
in this whole category of data related technologies that we see around us?
Yeah. I mean, I think definitely open source always will play a role, right? There's so many
things we already kind of rely on in one way or the other that are open source. Like I think even
things like Hive, although like started Facebook, went open source, like it just benefits a lot from people being
able to improve the overall solution and not being limited to the, again, the 10 engineers
that could possibly be working on it.
And I think, I think it just gives a lot more perspective to the problems that you run into
in that code base, right?
Like you're not having, you're not being forced to wait for someone to fix the problem.
You can fix the problem.
And I think especially as engineers, that tends to be our mindset anyways right like we we just just give
us the code right like we'll fix it like just give me the code i'll fix it and then we can we can go
forward with this and make this better together so i think i think that's that's kind of the
important thing in terms of like why the benefits of of open source right like we can we can in
theory move faster if people if you have a good community around a product. Yeah. You mentioned Airflow a few times, which is an open source project.
Is there any other projects that you are really, I don't know, like you love what is open source?
I don't know if I'd say that I have huge ones I love. Like I'm, I keep tabs of things, right? Like
Airbytes, something I've been keeping tabs of.
It's like an interesting idea
in terms of like open sourcing data connectors.
Yeah, I think that's the other one
that I've kind of currently been paying attention to.
Yeah, I think that's currently my focus.
I don't know, again,
I don't think I have a love for anything.
I think if something's open source,
if something costs money,
I think it just depends on the tool.
Like if I like it, I like it.
If I, like Snowflake, like Snowflake is not a cheap thing, but I like it. So I would,
I would enjoy using it, but yeah, I don't, I don't think there's like a preference.
Really an interesting conversation. Loved hearing about your experience as a consultant,
loved hearing about Facebook. And then of course the other side of the stack,
which is a whole fascinating conversation in and of itself and something I think we should do an episode on probably here soon, Costas,
because I agree it's the next wave of what's going to happen to data once everyone gets the analytics
and the data unification sorted out. So Ben, really appreciate the conversation. We'd love
to have you back on the show sometime soon. Yeah, no, thank you guys. I
really appreciate your time. I enjoyed this conversation. Yeah. Let me know. That was a
really fun show and I'm going to be a broken record here and restate something that Ben stated
and then that I also restated, but it's amazing to me, just thinking about the fact that they have more
engineers working on internal tooling for engineers than many companies have total employees
when they IPO. And that's just incredible to me, just thinking about having those level of
resources and the types of things that you can build, the speed with which you can build them.
Of course, working in a large organization, there's process and bureaucracy,
but that kind of leverage is pretty mind-blowing.
Absolutely.
I don't think that we can understand
the scale of a tech company like Facebook.
And I'm not talking about the technology.
Forget about the technology.
I think the most fascinating thing
is the organizational scale.
How you can get all these people
and all these thousands of engineers
and create such a consistent product experience
at the end internally and externally.
It's amazing.
And I don't think that it's something
that you can easily experience.
It was a very interesting and very fun episodes.
I also want to, outside of what you said,
there are two things that I want to keep
from our conversation.
One is that the problems at the end are the same,
regardless of how big or small of a company you are.
What changes is the scale, actually.
And that might change the tools that you might be using.
That's an interesting part of the conversation
where we were saying that, okay, just use a Postgres at the end.
You don't really need necessarily a huge data warehouse
that is super ultra scalable like Snowflake, right?
That's one thing.
And the other thing that I really liked was what Ben said about Snowflake, that it's the apple of data warehouses.
I found...
I loved that.
Yeah, I loved it too.
I think he's like very to the point about the product experience that people get from Snowflake.
So that was also amazing.
And yeah, hopefully we will have him again for another episode soon.
I really enjoyed hearing that because I think especially in the world of data, we have a
requirement to be very precise in our work and be very descriptive, require very specific
features. And there's this intangible component
of really great products that make people say, I just like to use it. And that's kind of hard
to describe. And I love that he brought that up and said, it's expensive, but I just really like
it. And I think that's, I think that's a big testament to Snowflake and what they've built.
That's true. That's true.
All right. Well, thanks for joining us and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every
week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com that's e-r-i-c at datastackshow.com the show is brought
to you by rudder stack the cdp for developers learn how to build a cdp on your data warehouse