The Data Stack Show - 190: Aligning Data Teams and Data Tools With Business Needs Featuring Ben Rogojan, the Seattle Data Guy
Episode Date: May 22, 2024Highlights from this week’s conversation include:Ben’s background and journey in data (0:18)Relating data to business outcomes (2:33)Facebook's approach to data-driven business outcomes (4:43)Subj...ectivity and data-driven business outcomes (8:43)Infrastructure and data collection at Facebook (12:04)The importance of first-party data and the death of third-party cookies (16:27)Facebook's Data and Attribution Challenges (20:08)Facebook's Infrastructure and Tooling (23:41)Differences in Data Approaches (28:26)Challenges of Data Project Alignment with Business Outcomes (32:58)Integration of Data into Tools and Partnerships (35:12)Building Alliances with Embedded Data Analysts (38:08)Budgeting for Data Teams (40:02)Healthy Team Dynamics and Budgeting (44:18)Data Team Reporting Structure (46:23)Connecting with Ben and More Content (50:55)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here with Ben Rogozhan.
Ben, you were on the show, actually, this is crazy, a couple of years ago.
It's crazy to say that.
So you were one of our very early guests.
And it's so great to have you back on.
Thanks for joining us.
Yeah, no, thank you.
Thanks so much for having me jump on.
All right.
Well, for those new listeners to the show who didn't hear your original episode,
tell us a brief background. So where did you come from and what do you do today?
Yeah. So, hey, everyone. Thanks so much for joining the show. But my name is Ben Rogajon.
A lot of people know me as the Seattle Data Guy online. Currently, I help companies kind of set
up their end-to-end data infrastructure,
help them, you know, figure out which solutions to pick. There's just so many these days and sometimes help implement. And before that, you know, I've worked as a data engineer for the
past near decade. My last job was at Facebook and working as a data engineer. Before that,
working at kind of a healthcare analytics startup, doing a lot of similar work. And honestly, I started out at a
hospital doing a lot of like programming and dashboarding and things of that nature. But
that's really kind of where I started my data journey. So Ben, we talked a little bit before
the show about your background with Facebook and then the journey into consulting. So I'm really
curious to dig in a little bit on that of
like, what problems did you work on at Facebook? And then now that you're consulting, working for
various companies, like which, which of those problems, you know, really cross between the two
and you're like, yeah, this works well. Or which of them are like, man, that's more of a Facebook,
you know, bigger tech company problem. And it's not applicable. I'm really excited to dig into
that topic. What about you? What are you excited to talk about?
Yeah, no, I think that's definitely one of the subjects I'm kind of interested in talking about
is like really comparing some of the differences, you know, like in some ways the similarities in
terms of outcomes you're always trying to get to, but also in terms of like how you get there,
maybe the amount of data you're dealing with or just the complexity and the various challenges
that companies of different sizes and data maturities kind of face.
Yeah. Awesome. All right, let's dig in.
Let's do it. Ben, I'm super pumped to have this conversation with you about relating data
to business outcomes, which is a huge topic. I think it's become much more acute of late,
actually, just because, you know, with the nature of many things, the macro environment, there have
really been a lot of layoffs, actually. I mean, we hear all the time, and I'm sure you hear,
and John, I'm sure you hear, especially as consultants, you know, our data team isn't as
big as it used to be, right?
And so we're, you know, there are a lot of things to figure out.
And one topic that John and I've been talking a lot about is, you know, how do you relate
data stuff to some sort of business outcome?
And that sounds a little bit like a tired cliche, but it's really not as straightforward as you would think it is, especially as we think about how far upstream some of the data stuff can
sit from a number moving in the right direction in some executive BI dashboard, right? So I'd love
to dig into that in today's show. And John, you were really interested in what that looks like at Facebook, which I think is a really interesting topic because the complexity.
And if we think about that supply chain, you know, of data and how far stuff can set upstream at one of the things, it's going to be a totally different world than say, like, you know, in a mid-market type company.
Yeah. you know, in a mid-market type company. Yeah, and some context here. I actually took one of my first, like, true programming analytics courses was,
I think it was Udacity or Udemy, one of those.
And it was the Facebook analyst engineer that taught me.
No way.
So, yeah, it was a great course.
Learned a lot from it.
But I'm curious on the business outcomes thing.
Maybe talk about Facebook, some business outcomes thing maybe talk about facebook
some business outcomes that you worked on there and then how you got there and then maybe we could
talk about the same or a similar outcome and how maybe you would get there now in a consulting role
and i'm imagining they're not always or probably often not the same path. Yeah, no, it is always dependent, right?
Like one of the nice things about Facebook
is that their infrastructure
was arguably very mature, right?
And well integrated, which in terms of like the data
and as well as the solutions where, you know,
when I often work for a company
or work for like as a consultant,
you'll like come in and they'll be like, hey, we've got, you know, let's say a marketing funnel,
but it's like across seven different solutions. Maybe I'm exaggerating. It's probably like three
or four, but it's across multiple solutions. They have multiple steps going throughout all
those solutions. Sometimes, you know, maybe one of those steps isn't captured or is kind of like
skipped. And so you kind of have to put it all together. Whereas, you know, at Facebook,
a lot of that data is generally pretty well integrated, right?
Like it generally has a flow, right?
Like I think that was something that I was impressed with
when I first started there was like,
just how like, as soon as you signed up for one application
or one like internal system,
like you were basically proliferated through all of it.
And you had an ID that kind of, you know, went through all of it. And, you know, it's kind of interesting in that
way. So, you know, in terms of business outcomes, you know, some of it was even very similar to that
where we would, I worked very close on like the HR recruiting kind of data teams. And so like,
especially at that time, right, when we were like hiring very heavily, you know, we were often
looking at the recruiting funnel and figuring out, okay, where are we winning? Where are we losing? Where are you know, how are people actually going through the questions? Are they, you know, trying to figure out how, you know, different interviewees kind of kind of do in terms of like, do they have higher or lower kind of acceptance rates and just seeing if there's ways we can improve maybe you know how we teach interviewees to make sure that they do a good job of actually like helping you know who
they're interviewing in the right ways to make sure like hey if you do have a candidate that
could have gotten through but it was maybe something you didn't do in terms of setting a
good kind of got a kind of set of whatever you want to say like hints or whatever might have
been required like how do we improve that so there was a lot of focus on that, especially at that time. I think one of the,
it's not necessarily a business outcome, but like one of the first projects I did
was honestly all around data modeling. So at the time we, you know, like most companies, you had
proliferated these multiple data models around recruiting and HR and And, you know, multiple teams had taken them on.
And there just was this kind of lack of standards across all of them, right?
Like everyone's kind of doing their own thing.
And eventually that starts impacting the ability for analysts to essentially work, right?
Because it's like, okay, when I work with this data set,
they've got multiple IDs that all can kind of join to each other,
but we don't really know what the master ID is here.
So that can cause a problem.
You know, there was some challenges
for some people dealing with
certain types of data formats.
And these seem like small things,
but like that was kind of my first project was like,
okay, we've got all these different data models.
How do we create like one that we can all own
that like help analysts, you know,
create insights faster or get to data and don't reach out to data engineers as fast and that was really my goal was
like how do we make it so they don't have to reach out as much they just can work on you know it's
very clear when they look at it like this is the data we're looking at where we understand what
ids to join to and that kind of just helps build confidence and build those results much faster.
I have to ask a question about the HR project, because I think that's really interesting. At a
company, you know, as large as Facebook, especially in a phase where there's a ton of hiring,
you have an entire sort of business unit or data practice that's dedicated to that, right? Whereas
at a lot of companies, you know, you have to be pretty large to get to that point. But I think it draws an interesting
dynamic out, which is that, and I think this relates directly to the question around the
relationship between data stuff and business outcomes, in that there's a high level of
subjectivity there to some extent, right? So of of course, like, what, you know, what are you measuring as part of that? Well, you know, what's our close rate on hiring for key positions? Okay, maybe that's a way the HR team is, you know number of metrics there, right? But when you're interviewing, there's a level of subjectivity there that's actually pretty hard to capture with data, even though quantifying the pipeline is really important to drive the accountability to set priorities to be very different, right? And have different styles, and that's okay.
So how did you think about, or how did Facebook think about, you know, data as an input to this
with some sort of hard business outcome, which is, you know, are we hitting a certain close rate on
key positions? And then the subjectivity element of it, right? Because that's a very, I just think
that's such a good example of, there's a huge data aspect to this. But it's also, you know, it's the marriage of all these different inputs that sort of create, you know, an outcome where the sum of the parts is greater than the whole if i can hopefully answer that question i i think you know especially
at that point like there were some clear goals that i want to say were pretty public in terms
of like how much facebook wanted to grow right like maybe that was somewhat to some degree
partially you know to make people on the stock side happy to show hey we're constantly growing
we're growing you know in all different aspects but like i remember i don't remember what the
targets were but you know there was at some point, maybe even someone saying something
to double it, I'd have to find articles, but I feel like I remember seeing like, certain goals
being pretty, pretty public. And, you know, because we've done all this research, we know that, hey,
if we interview 100,000 people, we're going to likely, you know, land whatever, 1000 employees,
right. And so now, you know, like, okay, if our goal is to hit 80,000, you know, land, whatever, a thousand employees, right? And so now, you know, like,
okay, if our goal is to hit 80,000, you know, in the next three years, we need to interview X amount
of people, and then kind of just keep walking that back and like, okay, well, if we need to do,
that means every day, we need to run X amount of interviews, that means, you know,
just it kind of just, you know, keeps building upon itself, because you have that information
now, like, if this is our target, we also know how much it takes us to get there.
We can now kind of kind of get there, you know, break down exactly what we should be doing
long term. And then if you see things throughout that whole process, right, like, okay, we're
now we're seeing like a reduction, maybe in the numbers, because eventually you will,
that was something that we talked about a lot. It's like, okay, we're seeing kind of reduction
in percentage. Is that because we've interviewed everyone which is i think parts of it where we'd be like can we
either interview most people that fit and they've either you know passed or didn't want the role
and that's why they're not here you know should we now reach out to them again right okay these
you know these people didn't want to roll okay let's set up something to send them an email
again and be like hey we know you didn't want this six months ago or a year ago.
Are you interested now?
So that kind of gives you that information.
Yep.
Super interesting.
How about the infrastructure side?
John, you were asking about heavy-duty infrastructure that you would use to drive something at Facebook, right?
Which can be very expensive, both in terms of hard costs and headcount.
Yeah. can be very expensive, both in terms of like hard costs and headcount. Yeah, I think before we jump into infrastructure,
one thing I'm also curious about is the collection.
Because I think people really gloss over,
because you mentioned that, and this is some very like
fuzzy things you're potentially trying to capture.
So any like creative things you all did around the collection of,
and it could be as simple as like, all right, we have the managers like fill out this form at this step or like, I don't know.
Oh, interesting.
You know, they chatted with the Slack bot.
I don't know.
Like, cause, cause that, I think people skip over that. completely ignore data that you don't have and that can be really valuable data that if you just
collect it like first party like one or two things you can really like benefit downstream so i don't
know if anything comes to mind but that's yeah something that i was thinking about yeah i like
i feel like it's one of those things i'm like i feel like there were interesting things like
in ways that we captured information i'm just spacing on it but i do remember kind of like
throughout the flow there's obviously all these ways that we captured information, I'm just spacing on it, but I do remember kind of like throughout the flow, there's obviously all these ways that we would kind of capture information, including like, again, after you interviewed someone, you'd go through, you'd write your notes.
They'd usually, they'd yell at you, they'd have systems that would be like, hey, it's been like, you know, four hours since you interviewed this person, like the more you wait, the worse your memory is going to get on this.
Right.
And they'd also just have a clear form where it was like, okay, like, where they do good, where they do bad, how many questions they get through, which questions they get through, which we basically had a pretty preset of questions, which was, you know, you could basically, I think, just find on Glassdoor.
And we would also have like information on like, hey, this person's interviewed before.
So, you know, before you even interview this person,
you'd already know like, hey, they've interviewed before.
They've seen questions at A.
They've seen questions at B.
So you need to make sure you don't ask that same question.
Yeah.
So there was definitely like a lot of those things
throughout the process.
Because like Facebook's interview process at the time,
it might have changed at this point.
It's been like now more than five years, I want to say.
It's like six years since I interviewed.
And even when I was interviewing or doing the actual interviews, it's been like three years.
But it was very much like we had a system.
It was very standardized.
I think in the goal of being that if it's standardized, you kind of remove some bias out of it and you have more of a process.
So that was kind of the goal, but yeah,
kind of maybe some ways we would capture it.
It was just, you know, as you're going through the process,
it'd be like, Hey,
time for you to like review and give your perspective.
And they definitely hound you.
If it took more than like, I don't know what it was like.
They give you like 72 hours.
If you didn't fill it out,
I think your score didn't count or something. I remember correctly. Ooh, I don't remember what it was, they give you like 72 hours if you didn't fill it out. I think your score didn't count or something.
I remember correctly.
I like that.
That's data governance.
I do think that's
a really good point though
in that
and I didn't even
think about this, but when we
think about tying
a data project to some sort of outcome, thinking about
the datasets that are important to that is huge. Not just being biased on what you have.
Right, exactly. Because to the point, okay, you can quantify a funnel. That's not
rocket science. But are you using all of the available inputs? i mean that's not rocket science but are the inputs are
you using all of the available inputs i think it's a great question because in the e-com space for
example like um quizzes were awesome if you could get people to take a quiz to get just even halfway
through a quiz that nice like first party high intentintent, useful data that might
not natively be in your data
warehouse, so marketing might be doing that.
And then the data team
just didn't even think of it.
Not to
flip it around too much on y'all,
but you're talking about
first-party data. Obviously, one of the
discussions going around the data world
is the death of the cookie, which we still haven't seen uh it's it's a forever dying
cookie rosary 2032 you know yeah yeah so like are you do you guys see any like people kind of being
like we have to collect first party data even more now so you can kind of understand who your
customer is because i feel like you guys deal with that more on the event
side. Yeah, for sure. It's certainly a big topic. I think a lot of companies are, they're thinking
critically about how they adapt to that future when it comes. And I would say increasingly, we have seen data teams who are really trying to
adopt, I guess I would call it like a first party first. Is that even? That's a nice way to do it.
First party first. You heard it here on the data side. Yeah.
Approach, right? And I think the big question there is the sacrifices that you make. So they
fall into a couple of categories that actually I'll do another flip and ask both of you because
I think you're seeing a lot of this on the ground as well. There are a couple of areas that we see.
So one is advertising, obviously, right? So we talk about Facebook, which are
advertising on meta and is, you know, through the ecosystem of their apps, right? And so that's a
big concern for companies who have a lot of revenue that's heavily reliant on the third
part of the script and cookies being on their site. Now, one thing that is
very interesting is that, you know, no one likes change, right? And so if that's changing,
and the third party cookies going away, and we, you know, could expect X revenue from
advertising on, you know, Google search or whatever it is, there's also this sense of,
man, it's going to be great not to rely on this black box that
we're beholden to, right? Because whenever they change the rules or their conversion logic or
their attribution logic, you're beholden to that actually. And that can be a really big challenge.
So that's sort of one area. And then the other area would be, you know, just any
sort of like operational tooling. So, you know, you can think about, of course, Google Analytics
is a huge one, but there's all sorts of scripts running on everyone's, you know, websites and apps.
And so when you, that's in many ways more of like an operational thing, right?
Like, are those tools going to face limitations if they can't store a cookie?
And so I'm going to lose functionality for some operational tool.
I mean, it's all sorts of stuff, right?
From, you know, screen recording to analytics to whatever it is, right?
Personalization tools.
So I mean, but what are you guys seeing?
Yeah, I think so i kind of started
interesting time so i started you know the google analytics like web space around 2015 2016
and the general attitude was well like this is what it is like this is what you use you use just
you use google analytics we're beholden to google like we hate them some days we like them okay other days yep like that was just the that was what was
available for the vast majority of people yeah and i think i don't know and i think people i guess
were happy enough and then like you've got some evolution of tooling and you've got some probably further skepticism of like just around google and facebook both then you had the
big thing with apple and facebook that really you know e-commerce really hit some e-commerce
companies with some basically facebook not being able to target as well and then i think people
reacted with like i need more i need to like dig into
this more and be able to control this more yeah and i think from there then you have
for like facebook and google really like for e-commerce that's what you know drives a lot of
the traffic for people so i think then you have this attitude of like okay well if i did like
what could we do if we like control this and you get some data people
involved and then you end up with like,
oh wow,
like this actually opens up a lot of opportunity,
not the least of which,
which was just the very basics,
two basic things.
One site speed.
Like there's so many things you can AB test.
If you just make the site faster,
like that's one of the best things to do for people
because you just get these marketing teams that would just pile pixel on and they'd have like
27 pixels with like three second four second you know page load times and then the attribution was
the other thing that like at least especially when i was getting involved in like like oh
because i have an email tool and we use Shopify, like Shopify and then some Google,
and you compare attributions and it would add up to like 200%. And you're like, well,
because they're each trying to, you know, grab and say, well, yeah, I contributed to that.
So like having that like objective, like first party data to do some objective
attribution was another. I was going to actually ask both of you about that as sort of a follow
up question. I mean, one of the things that we've seen with a lot of our customers when we think about business outcomes is that as the warehouse has increasingly become the center of the data stack and you have a first-party-first approach, it seems like it's been way easier for a lot of companies to create a business case for
the data side of things. Because you're not having to explain, you're not having to defend
the ad platforms or marketing platforms interpretation of conversion, which you
then have to do some sort of mapping, right? So if you think
about like the data team is collecting some sort of data from websites at wherever, right? And
let's say you have transactional data, right? So you have purchases or whatever those are,
add to carts, right? Way downstream, that maps to some sort of business KPI, right? It's number
of orders, which is revenue, which there's margin and you sort of apply
all that.
But it's this really interesting dynamic where a lot of times it's almost like, well, we
have to defend our interpretation of what's happening in the ad platform as opposed to
saying, this was raw data and we modeled it to reflect the actual reality of the business
and you can prove that
which is pretty interesting right yeah you're saying that ben i'm definitely like i think i
haven't had to spend too much time in like purely advertising like recently i think most of my
projects thinking back were like very what am i trying to think of it like very domain specific like working with like a casino
and then like analyzing their gaming or working with like a telecom company and analyzing like
calls and and things like that so a little less on focus on like how are you converting
um somebody uh and more focused on like how are people just using our product or using the thing that we do?
So it's been very domain specific there.
Yep, makes sense.
Okay, how about the infrastructure question?
I'm dying to hear about this because it can get really spendy.
And I think in today's environment,
it's a good topic to discuss.
Yeah, I'd love to talk Facebook first,
some infrastructure and tooling, and then
like, what are you using now day to day with like consulting clients? And I'm expecting the answers
typically, they're pretty different, but I'm curious. Yeah, I mean, you know, at the time,
and I'm sure this is somewhat similar, even even now, but obviously, they're investing tons into
more on the like gen ai side and like
hardware and things on that side and probably making solutions and and tooling like internal
tooling to make even that development easier for developers but that's something i think facebook's
always done well like when i was in facebook it's like they made your job very easy like to the
point that like i would work with certain data engineers that would then pull
me aside like a few months in and be like i'm bored right because like your job has been made
like easy you know the for example you know they've got something internally that's very
similar to like airflow or like workflow orchestration and really all you're doing
is making this kind of half or or more like 75% SQL,
you know, 25% kind of Python configuration file, that you then just push somewhere and like it
runs and you know, you're kind of just works, right? Like there's no need to like spin up your
own like Kubernetes cluster or something to like spin up, like all of these various things. It's
like someone else is
managing the actual infrastructure you're literally just dropping you know and committing files
somewhere which obviously i think is very facing specific they actually had a whole team that was
dedicated to it was called data swarm and just developing that and managing that so they were
constantly making it better as well as like, maintaining it on a daily basis. So
if it went down, you weren't like, I need to solve this problem. It was like, well, I have nothing to
do for the next hour, because someone else is solving that problem. And that's not my problem.
And I can't like, I can't even solve it, right? Like, it's not even accessible for me to solve
this problem. So I think there's that aspect of it. I think the interesting thing is that Facebook
was doing the whole and i think
probably a lot of the big data or big tech companies were doing this before more recently
they were doing the whole like hey we're gonna put our data in kind of this open format right
like like it's just gonna kind of exist in you know this data lake data warehouse states somewhere
and then we're gonna use whatever engine on top of it you know you can
specify that engine you know later on and now i'm you're seeing that now i think like iceberg or
people are putting things in s3 and then you know using whatever engine they want to sit on top of
it if it's more cost effective or if it just makes more sense for that specific job so i do remember
that kind of being the thing when i left was like okay hey you want to use presto use presto, use Presto, you want to use Spark, you want to use, you know, something else, you know, you can kind of pull that off the shelf and use that to run the specific job on that data set. And it's very abstracted away where it's like, literally, just again, that's that configuration, like, this job is gonna be Spark, this job is gonna be Presto. And you just call it out early on. But again, I think you're starting to see that now. I think like it makes sense, right?
Like as people are trying to control costs
to try to figure out, okay,
sometimes it's about cost,
sometimes it's about performance.
I do imagine there'll be a line
where like certain companies,
it'll just make sense to stick with,
you know, one.
You know, I see that with most of my clients
that are more in that mid, small size.
It's like, you're not going to try to juggle
BigQuery and Databricks and Snowflake.
You're going to pick one and
try to do that really well and make sure it fits.
But when I look at
the larger organizations I work with, they
already are using all of
the above, and it's more about
maybe trying to coordinate it longer
term to try to figure out what makes the most sense
for various teams.
Yep.
That's just touching the iceberg.
I think that question can go
multiple different directions, so feel free to
keep digging in.
I think maybe this is what you were thinking of,
John. So that's Facebook.
They have all of this.
What a luxury to have an entire team
work on this internal
tooling.
But as we've seen in the data space so often,
the fangs are really pushing the boundaries on inventing stuff because you have teams that are
solving problems that very few other companies have faced. Have you seen there be sort of like,
okay, so in the mid-market, like you said, okay, we're sticking with sort of one cloud,
like we're a Google shop, Snowflake shop, data brick shop, whatever.
We're going to do that really well.
What about some of the other tooling?
Like, I mean, it seems like there's a lot of SaaS popping up that can help sort of act
as that dedicated data team to sort of take care of, you know, those pieces for teams
that don't have, you know, the resources to have like a bespoke
solution are there areas in particular where you see like okay there's a ton of really great tooling
that's making this sort of more streamlined and accessible to smaller companies
that don't have resources like what areas of the stack are their sort of efficiencies due to new
tooling yeah i think know, it's interesting
because I think Ethan Aaron posted about this.
It was like 2015, his data teams were like one person,
especially like mid-sized companies.
Then like 2020, they were like 30 or 50 or whatever.
They blew up pretty big.
And now we're like, you know, in 2024
and we're looking at like three to five people again on
these teams and so it's interesting that we got to that point you know back in 2020 i think what
happened is people found out very quickly that if you built 100 data pipelines you had to maintain
100 data pipelines so as the fact the faster you built which you know a lot of these tools could
kind of give you the more you had to maintain And then you just kept having to kind of build bigger and bigger teams to kind of...
20 of them, and only 20 of them actually got used, you know?
Yes, exactly.
And only half of them get used or 5% of them get used or whatever.
I'm sure you could find some interesting statistics around that.
But there's definitely a lot of tooling that I do think can make things easier, you know?
I think what's interesting about the solutions that have existed
now that i've like you know been working in this space for a while is like we've somehow still
recreated the same problems we had before and when i say that okay we have a tool whether you know
be portable five-train estuary to do data extraction great now we needed to write like
okay now we have to get a tool for transformations. Great.
And now we're doing the same thing we were doing before, which was like, okay, someone created a
cron script to do data extraction.
Great. Okay, someone created the cron script
that called a stored procedure somewhere.
And it's a separate script.
And so now we have to set up, like,
you know, I say cron, but I mean, like, Python script
managed by cron. Now we have to, like,
set up these two things to run, like like about an hour and a half apart,
because that's like the optimal timing.
And it feels like some in some way, we've recreated that in this world.
It's like, okay, it's easier now, but we still have the same problem where it's like,
your Fivetran or Estuary job runs a certain time.
And now you hopefully run your dbt job or coalesce or whatever your transformation tool is
at the same time. And then hopefully, you know, you've got your next, you hopefully run your dbt job or coalesce or whatever your transformation tool is at the same time.
And then hopefully, you know, you've got your next, you know, your Power BI dashboard updating at the exact same time or at the right time.
So it's funny how that's happened.
And like now, again, we have all these orchestrators that have been developed to like kind of go around that.
We're like, you know, it was what Airflow was to like Python scripts and SQL, you know, kind of one-off jobs back in 2015 it's just like
it's the same thing it's like we created the same problem you think we would have built this
solution into it or had this in mind but maybe find that i think interesting but again all these
tools do help i do see them like actually like i have i had a client one of the first clients
i had when i quit that i built up their solution with a few tools.
And like every once in a while, I reach out to them like, hey, how are you guys doing?
Anything?
And every once in a while, they'll reach out to me like, hey, we think we might need you to help on something.
And then like 24 hours goes by and like, never mind, we solved it.
And, you know, it's just like one data person, essentially, who's kind of managing it all.
And it kind of handled it.
So, yeah, I do think a lot of this has helped.
But it is always interesting how we've kind of recreated some of the same problems we've had for a long time now.
Yeah, it's like a system that allows for innovation in individual problem areas creates a more complex system right and
but these systems have to operate like as a system if that makes sense right and so yeah
yeah it's super interesting okay i have a question i was thinking a little bit more about
earlier we had sort of discussed like this distance of data, project, data team, whatever, from like the business outcome, right?
So interested, this is a question for both of you.
Where have you seen that become a problem, right?
And so when I say become a problem, to put a sharper point on that, you know, funding
gets cut or the data team comes under scrutiny because it's like,
well, this is just a cost center. What value are they adding? Right. But, you know, and to some extent there is a bunch of infrastructure that runs upstream of, you know, what's sort of happening
downstream that shows up in the executive, you know, BI dashboard. What are the like symptoms
of that distance becoming a problem, right? Where it's like, okay, you're in a realm now where things are getting dangerous or there may be issues because even though on the ground,
you know, well, all this stuff we're doing, all this infrastructure, whatever,
is making this stuff possible downstream, but perception is reality.
Right. That list of like, you might be in trouble if.
Yeah, exactly.
Like as a data team.
Because a lot of times it's, those things are not a problem until they become a problem,
if that makes sense, right? Like, you know, that dynamic can persist for a while until whatever,
right? The company has a bad quarter, you know, a new VP comes in who's like, you know, going
through every line item, you know,
on the budget and inquiring about every single thing, right? Like those things happen. And so
those things, sometimes those dynamics can persist where a perception doesn't come to light until
there's some sort of event that brings it to light. And then at that point, it becomes a problem.
So how do you think about like, what are those dynamics that can you could catch earlier like symptoms of that
yeah i think look a quick one for me is like you might be in trouble as a data team if you just
produce reports and dashboards because if you are if you've got your data warehouse integrated into
pushing things out to key partners like via integrations to tools that people already use
like you're pushing data back into salesforce back into erps back into that those data teams i think
are like seen as indispensable because that sales team is like oh well you know i use that thing
it's in salesforce it's useful to me whereas if you're just doing dashboards if you know i think
dashboards can be useful and reports can be useful but those can be in trouble because those can be
things where it's like well i don't remember my login or like i used to check that but the data
was wrong one time and i don't look at it anymore so that would be my number one thing is are you
integrating into the tools
people are to use?
And then are you integrating in with like partners that do really useful
things with data?
Yeah.
I think something like along those lines where you like,
if you start having clear disconnects where your business like doesn't seem
to care because of sometimes like,
I think he referenced like that apathy where it's like, okay,
we ask them for things it's's wrong, or it breaks eventually.
Like, I had a client a while back who was like, oh, yeah, we, like, don't use the data warehouse anymore because, you know, this one report broke.
And, you know, now I just, we just don't do it.
You know, we use other options.
You know, we just manually create it.
So, you know, if you start having that apathy i think that's one way i think that can
also like manifest itself in like if you're sitting there and you're not like you're building
things because you think it's the right way to build things and no one in the business is
like asking like where things are going to go i think that's never a great sign right like if
you're like oh yeah like if you're really building you know and just
building as ethan aaron kind of quoted it infrastructure for infrastructure sake and
no one at any point is stopping you like they're like not like hey yeah what are we doing this for
like that there's some concern they're more just in maturity than anything else like there should
be hopefully that maturity of like you know they the business hopefully understands like hey this
should probably come in stages like at this stage would stage, would like, when can we expect, like, to at least be able to like,
play with the data and understand it? Because I think the more you can, like,
give them some tangibility, the more they'll, like, see that they can do things. Because on
the flip side, when I do like, let's say, you know, like clients, as I do start creating their
data warehouse, like they have this like initial vision of what they do, right? Because they've had Excel, they've got like their initial world of what they think.
And then if you give them a little more access, suddenly, they're like, Oh, my gosh, right? Like,
I've got 20,000 things I want to do suddenly, because I can see all this data, I can play with
it, I can poke at it. And then the game becomes more of like, hey, we need to now have a process
to like, you know, what's going to what needs to be prioritized, right?
Like that becomes a discussion, not like what's going to be created and create all the things you can.
It's like, OK, now that you have all this access, now you have all these ideas because you finally do, you know, see it all.
You know, how do we funnel that into an actual process?
So that's what you want.
You want to get to that point where it's like the business is like super excited.
And if anything, you're like having to spend time prioritizing what actually should be done and like also spending time maybe getting rid of old things and then things like that so
yeah i i think organizational structure is also a big piece here because i've found if i can find
or make embedded data analysts so find them like maybe there's already like a financial analyst or
something or like maybe there's somebody just interested in analytics that's already
embedded in a marketing team or an ops team.
Like those can be some of the best people.
And then as far as driving adoption inside those teams,
like they can do way more than I could ever do like in a data team seat
because they just know they're there every day. They can say, Oh, Hey,
you know, you've got this problem. They can, you know,
take the data, apply it to a problem in the moment because they're on the ground.
Can we dig into that a little bit? So when you say, so you find an analyst, say, and find it,
because you were a CTO, right? And so you oversaw like the data practice, all the technical side of
things. So you're saying there's like an analyst who works in finance. And so are you essentially building an alliance with that person, making sure you're serving them with, you know,
things that they need so that they're almost an advocate for the data team in there? Or are you
like trying to poach them? Oh, no. Yeah. That's a good clarification. No, like these people are,
I did poach one or two, but in general, the good ones, but in general, it's, they stay in their
current seat. And these, then these people are like typically highly analytical, especially
finance is great because if you've got that accounting background and maybe you're like
a financial analyst and like, I've done this at two companies now, like financial analysts that
take, I mean, they take days, hours and hours to close out books
for the month before. It just, it's so much work all in Excel. And there's actually been
two companies now where that analyst has gotten the right access to data in a data warehouse.
And then they've self-taught SQL and have been some of the fastest learners,
most motivated learners to learn SQL. And they've reduced the close times by days at both companies
just because they were eager and hungry
and then had somebody to give them the right access to the data.
So that's just one simple example.
And then other analysts, maybe ops analysts often can too,
get really bogged down in manually tracking things,
having to spend hours and hours in
excel if you're already putting the time in and then you because i think ben mentioned the automation
you'd always look for jobs like sql automation that automation thing could be really crucial
for those analysts that are already like just spending hours doing stuff manually yeah one manually. Yeah. One question, I'm laughing here because our friend Matt
here just sent us a message and said, you might be in trouble if all finance seems to care about
is your CapEx number, which is true, I would say, across the board. But that brings up an interesting point about the way that data teams are budgeted
or projects are budgeted, because that can vary a lot. And I'll give you an extreme example,
but I think this actually also relates to how the organization views a data team.
So I was talking to someone the other day, and it's a very large company, and they work on the data platform team.
And they actually do not have a budget for this team that they're on.
That team tracks usage of the data product that they built internally, and they have chargebacks that go to the teams. And so, which is a little bit weird
because, I mean, that's slightly perverse
and that, you know, you want people to use more data,
but you get, you know, your budget, you know.
But like, so that's kind of an extreme example.
But we'd love to hear, like,
there are different ways the data teams get funded.
They're an independent organization.
They get their own budget, right?
There can be chargebacks.
There can be, I mean, what are the different, you know,
maybe just think through some of the situations where, or maybe like a healthy example, Ben,
an example of a healthy dynamic and an example where it's not as healthy,
just in terms of how the budgeting works around that stuff.
Yeah. Like in terms of like unhealthy, like obviously you can go in both directions, right? Like on one side, like I said, 2020, let's keep just adding more. And because we have added more, let's add more without truly trying to connect with, you know, does this help? Right? Like, does adding like, like, because we've added in these new systems, will our business do better?
There was a ton of startups or companies that went from startups to IPO.
I don't remember, probably in the 2022, right?
Either went bankrupt or, you know,
their stock price is doing terrible.
I don't, actually, before I say this company name,
let me just see.
Let me just check out before.
Yeah.
So like, let's say for example, and this is not to talk ill of any company, but like
if you're talking about a company that like, hey, their data infrastructure is amazing.
Like people would like look at it.
Stitch Fix, I think is a great example, right?
Like they had this like, they like like it was cool to go to the website
like as a data person
and see what they're doing.
And like, you know,
and it's not saying that it's unhealthy,
but it's like,
is that over like fascination with data?
Is that helpful or not in the long term?
And that I can't answer.
I don't know their internal,
but I think that can happen.
You look at a company like that.
You're like, hey, they're like cool.
They're doing data.
Then you think your company needs to be that.
And it kind of becomes this cargo culting moment yeah um yeah again it's not to say that like it's just to say that like data isn't everything you know just
because you have cool models just because you've done all that your business can still do poorly
and so i think that happened a lot in 2020 we have all these businesses just grow and they were like
let's hire more data people that seems to be what everyone's doing and then you know you end up struggling because you're
spending you know if you've got 20 people that you're spending 150 200k and you know a year on
like that's a significant amount of your budget especially if you're a startup on the flip side
you can also be in this like point where i often hear people say like if your cfo if your data team
rolls up to the cfo you're gonna have a time. So like that's kind of the other side where
it's like, yeah, you can be like very like treated like you're just a cost. And to some businesses,
I say like, you might be, you might just be a cost and that might just be your role. And you
have to understand that sometimes. But if you think you can do more, it's going to be really
hard in that situation that's
unhealthy on the other side, where it's like, you just don't get enough attention, or you don't get
enough budget. And so you're only ever going to be able to do just enough to keep them, you know,
from having maybe an advantage, if they could have it. Yep. I'd say a healthy situation,
you know, hopefully, you're not growing, you know, know your team unless there's like a specific reason
like like a business reason to be like yes we need a data engineer because you know maybe you
had a data analyst because i think a lot of people start with a data analyst you had a data analyst
they've been building all this nice stuff but now it's getting hard to maintain right because it's
like okay they've kind of got these three or four reports or four or five reports they're having to
manually create them it's taking a long time.
Is there a way we can automate this?
And is there a way we can justify, you know, hiring 150 to 200K person to do that, right?
Like, does it actually add that to our bottom line?
Or does it still just make sense to have this data analyst kind of manage it?
Right.
So I think that like having like a healthy team would have those discussions.
And they wouldn't just be like, we need to hire a data engineer because that would solve the problem it's like well these reports are only saving x amount
it might not make sense in the long term so something along those lines yeah yeah makes
sense john thoughts yeah i think when you get when you see like the infrastructure data engineering
stuff as a productivity driver i think typically for
more than one analyst like maybe just one analyst but more than one analyst so like man we have
teams of analysts doing x y and z like every single week and then they have to go through
the mental exercises like of like cool what if they did less what if we do we actually need this
report like all those things that need to happen prior to like oh no we need this it drives value here's why and the business
goes through that exercise and then they get to the point where like okay i think we need data
engineering help it's enablement for these analysts they'll be more productive it's useful i think
that's a really good exercise whereas like to what you're saying versus like oh like yeah we should
hire data engineer we need a data engineer we need
a data warehouse we need this we like we need ai right like all the cargo culting yes yeah
but i think that process is super helpful and then the finance thing was interesting too
is the safest data teams if you want to if you want job security be on a data team that reports
to cfl okay yeah but if you want like to work on really cool stuff and you want job security, be on a data team that reports to a CFO. Okay. But if you want to work on really cool stuff,
because the CFO is going to think typically,
not all CFOs think more accounting, right?
It's going to think more cost.
And the good news is CFOs in charge of the budget
and usually they stick up for their people,
which is good news.
But you might not get to work on the most interesting right things and your team is going to be small and you're going to have to work hard yeah um that yeah stereotypical but
i think that's like kind of the general yeah that's interesting okay oh sorry go ahead ben
no i think it's super interesting that's kind of. Okay. We're close to the buzzer here, but interested, you know, just changing gears a little
bit with the last couple of minutes we have for both of you, what are some, you see a ton of
different companies working on a ton of different stuff. What's like one of the most fun, cool
projects you've seen recently, you know, or done recently with a company?
John, why don't we start with you and then Ben can take us home.
I think at least in the e-comm space, I haven't gotten to do a project on this yet, but I'm really excited about search.
We got to talk to a really neat company, Marco, that's working on this space and like the biggest problem like data related
problem for ecom in my opinion is this discoverability thing like if you need a part type in a part
amazon works great if you're like not sure what you're looking for the search experience is really
difficult and and the only way discovery works well is if you have a really small skew count
so if you only sell like 10 things then like that's fine it's easy yeah i think that and then like incorporating data into search
yeah and like search intent and signals i think that's like a really interesting space
but i haven't gotten to do one of those yet but we'll see all right then yeah you know i think a
lot of my projects end up being migrations, which aren't necessarily boring, but they're not the most thrilling.
Like last year, I was just proving out for one client who was spending like upwards of like, I think it was like $35,000 a month on their infrastructure.
Just kind of proving out a simpler version and helping them move to that.
Which, you know, it's cool to always hear those numbers and hear the reductions that you can do
in that regard right like okay this is totally possible to reduce let's do it recently like
this is just like more of just i think kind of a nice project i like it when it has like this
realness to it again thirty five thousand dollars to you know bringing it down to like ten thousand
dollars that's real i think the other thing that's that was real was like i i have this client i've
had them for a while where we work off and on. And they always have kind of interesting ideas.
And this most recent one was like, they're basically a logistics company, like they deal
with like busing and like, like people rent them for various reasons. And they're like, Hey, one of
the things that we do is like, during the summer, we do this kind of like specific sets of bus
routes. And one of my employees essentially has to wake up really early to like,
you know,
we have all the bus routes,
we have all the pickup stops and has to like plan that out.
And that takes,
you know,
they wake up at 1am just to manually do this whole process.
And I don't want them to ever quit because I don't think anyone else can do
it.
We like try to automate like even 70 of it and so basically we've kind of
developed a system to like just automate that process and that's been really cool because again
like we are in the end saving someone from having to wake up at 1 a.m to kind of develop this whole
thing and it you know just it feels good in that regard so that's. So it really isn't like a complex ML model.
It really is just like a rules engine that we created.
Part of the client was like,
we really want to go down this like Jenny I route.
And I was like, I don't think it's going to work.
Like maybe, but I know we can definitely get something to work.
Maybe in more of a rules engine kind of fashion.
So we went down that route.
So yeah, I think that's always kind of cool.
Makes me think of another kind of instance where like we ended up doing a migration that
helped avoid some analysts having to wake up like on Saturday and Sunday to do this
one report because they have to report it every day.
So anything like that, it's always kind of kind of cool just to help someone out that
has some real problem like that.
Yeah, I love it.
Well, really quickly before we hop, remind us where we can find your information, where listeners can connect with you, see all your content.
Yeah, I mean, you can look up CL Data Guy.
I'm on YouTube, Substack, LinkedIn, probably a few other places.
But yeah, you can pretty much find most of my content. So if you want to watch videos on like becoming a data engineer or even some more specific
topics like data modeling, I've got a few pieces on that.
And same thing in the sub stack.
I've got a pretty good plethora of content that ranges from beginner content to, you
know, organizational kind of how you should set up your organization and things like that.
So yeah.
That's great stuff.
We read it all the time.
Well, Ben, thank you so much for joining us on the show.
Great conversation. I learned a ton, tons to think about. And we will have you back on again soon now
that you are a multi-time guest. Yeah. Yeah. Thank you. Thanks so much. I appreciate it.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at
rudderstack.com.