The Data Stack Show - 190: Aligning Data Teams and Data Tools With Business Needs Featuring Ben Rogojan, the Seattle Data Guy

Episode Date: May 22, 2024

Highlights from this week’s conversation include:Ben’s background and journey in data (0:18)Relating data to business outcomes (2:33)Facebook's approach to data-driven business outcomes (4:43)Subj...ectivity and data-driven business outcomes (8:43)Infrastructure and data collection at Facebook (12:04)The importance of first-party data and the death of third-party cookies (16:27)Facebook's Data and Attribution Challenges (20:08)Facebook's Infrastructure and Tooling (23:41)Differences in Data Approaches (28:26)Challenges of Data Project Alignment with Business Outcomes (32:58)Integration of Data into Tools and Partnerships (35:12)Building Alliances with Embedded Data Analysts (38:08)Budgeting for Data Teams (40:02)Healthy Team Dynamics and Budgeting (44:18)Data Team Reporting Structure (46:23)Connecting with Ben and More Content (50:55)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here with Ben Rogozhan. Ben, you were on the show, actually, this is crazy, a couple of years ago. It's crazy to say that.
Starting point is 00:00:33 So you were one of our very early guests. And it's so great to have you back on. Thanks for joining us. Yeah, no, thank you. Thanks so much for having me jump on. All right. Well, for those new listeners to the show who didn't hear your original episode, tell us a brief background. So where did you come from and what do you do today?
Starting point is 00:00:53 Yeah. So, hey, everyone. Thanks so much for joining the show. But my name is Ben Rogajon. A lot of people know me as the Seattle Data Guy online. Currently, I help companies kind of set up their end-to-end data infrastructure, help them, you know, figure out which solutions to pick. There's just so many these days and sometimes help implement. And before that, you know, I've worked as a data engineer for the past near decade. My last job was at Facebook and working as a data engineer. Before that, working at kind of a healthcare analytics startup, doing a lot of similar work. And honestly, I started out at a hospital doing a lot of like programming and dashboarding and things of that nature. But that's really kind of where I started my data journey. So Ben, we talked a little bit before
Starting point is 00:01:37 the show about your background with Facebook and then the journey into consulting. So I'm really curious to dig in a little bit on that of like, what problems did you work on at Facebook? And then now that you're consulting, working for various companies, like which, which of those problems, you know, really cross between the two and you're like, yeah, this works well. Or which of them are like, man, that's more of a Facebook, you know, bigger tech company problem. And it's not applicable. I'm really excited to dig into that topic. What about you? What are you excited to talk about? Yeah, no, I think that's definitely one of the subjects I'm kind of interested in talking about
Starting point is 00:02:11 is like really comparing some of the differences, you know, like in some ways the similarities in terms of outcomes you're always trying to get to, but also in terms of like how you get there, maybe the amount of data you're dealing with or just the complexity and the various challenges that companies of different sizes and data maturities kind of face. Yeah. Awesome. All right, let's dig in. Let's do it. Ben, I'm super pumped to have this conversation with you about relating data to business outcomes, which is a huge topic. I think it's become much more acute of late, actually, just because, you know, with the nature of many things, the macro environment, there have
Starting point is 00:02:55 really been a lot of layoffs, actually. I mean, we hear all the time, and I'm sure you hear, and John, I'm sure you hear, especially as consultants, you know, our data team isn't as big as it used to be, right? And so we're, you know, there are a lot of things to figure out. And one topic that John and I've been talking a lot about is, you know, how do you relate data stuff to some sort of business outcome? And that sounds a little bit like a tired cliche, but it's really not as straightforward as you would think it is, especially as we think about how far upstream some of the data stuff can sit from a number moving in the right direction in some executive BI dashboard, right? So I'd love
Starting point is 00:03:41 to dig into that in today's show. And John, you were really interested in what that looks like at Facebook, which I think is a really interesting topic because the complexity. And if we think about that supply chain, you know, of data and how far stuff can set upstream at one of the things, it's going to be a totally different world than say, like, you know, in a mid-market type company. Yeah. you know, in a mid-market type company. Yeah, and some context here. I actually took one of my first, like, true programming analytics courses was, I think it was Udacity or Udemy, one of those. And it was the Facebook analyst engineer that taught me. No way. So, yeah, it was a great course. Learned a lot from it.
Starting point is 00:04:21 But I'm curious on the business outcomes thing. Maybe talk about Facebook, some business outcomes thing maybe talk about facebook some business outcomes that you worked on there and then how you got there and then maybe we could talk about the same or a similar outcome and how maybe you would get there now in a consulting role and i'm imagining they're not always or probably often not the same path. Yeah, no, it is always dependent, right? Like one of the nice things about Facebook is that their infrastructure was arguably very mature, right?
Starting point is 00:04:54 And well integrated, which in terms of like the data and as well as the solutions where, you know, when I often work for a company or work for like as a consultant, you'll like come in and they'll be like, hey, we've got, you know, let's say a marketing funnel, but it's like across seven different solutions. Maybe I'm exaggerating. It's probably like three or four, but it's across multiple solutions. They have multiple steps going throughout all those solutions. Sometimes, you know, maybe one of those steps isn't captured or is kind of like
Starting point is 00:05:20 skipped. And so you kind of have to put it all together. Whereas, you know, at Facebook, a lot of that data is generally pretty well integrated, right? Like it generally has a flow, right? Like I think that was something that I was impressed with when I first started there was like, just how like, as soon as you signed up for one application or one like internal system, like you were basically proliferated through all of it.
Starting point is 00:05:44 And you had an ID that kind of, you know, went through all of it. And, you know, it's kind of interesting in that way. So, you know, in terms of business outcomes, you know, some of it was even very similar to that where we would, I worked very close on like the HR recruiting kind of data teams. And so like, especially at that time, right, when we were like hiring very heavily, you know, we were often looking at the recruiting funnel and figuring out, okay, where are we winning? Where are we losing? Where are you know, how are people actually going through the questions? Are they, you know, trying to figure out how, you know, different interviewees kind of kind of do in terms of like, do they have higher or lower kind of acceptance rates and just seeing if there's ways we can improve maybe you know how we teach interviewees to make sure that they do a good job of actually like helping you know who they're interviewing in the right ways to make sure like hey if you do have a candidate that could have gotten through but it was maybe something you didn't do in terms of setting a good kind of got a kind of set of whatever you want to say like hints or whatever might have
Starting point is 00:06:42 been required like how do we improve that so there was a lot of focus on that, especially at that time. I think one of the, it's not necessarily a business outcome, but like one of the first projects I did was honestly all around data modeling. So at the time we, you know, like most companies, you had proliferated these multiple data models around recruiting and HR and And, you know, multiple teams had taken them on. And there just was this kind of lack of standards across all of them, right? Like everyone's kind of doing their own thing. And eventually that starts impacting the ability for analysts to essentially work, right? Because it's like, okay, when I work with this data set,
Starting point is 00:07:21 they've got multiple IDs that all can kind of join to each other, but we don't really know what the master ID is here. So that can cause a problem. You know, there was some challenges for some people dealing with certain types of data formats. And these seem like small things, but like that was kind of my first project was like,
Starting point is 00:07:38 okay, we've got all these different data models. How do we create like one that we can all own that like help analysts, you know, create insights faster or get to data and don't reach out to data engineers as fast and that was really my goal was like how do we make it so they don't have to reach out as much they just can work on you know it's very clear when they look at it like this is the data we're looking at where we understand what ids to join to and that kind of just helps build confidence and build those results much faster. I have to ask a question about the HR project, because I think that's really interesting. At a
Starting point is 00:08:12 company, you know, as large as Facebook, especially in a phase where there's a ton of hiring, you have an entire sort of business unit or data practice that's dedicated to that, right? Whereas at a lot of companies, you know, you have to be pretty large to get to that point. But I think it draws an interesting dynamic out, which is that, and I think this relates directly to the question around the relationship between data stuff and business outcomes, in that there's a high level of subjectivity there to some extent, right? So of of course, like, what, you know, what are you measuring as part of that? Well, you know, what's our close rate on hiring for key positions? Okay, maybe that's a way the HR team is, you know number of metrics there, right? But when you're interviewing, there's a level of subjectivity there that's actually pretty hard to capture with data, even though quantifying the pipeline is really important to drive the accountability to set priorities to be very different, right? And have different styles, and that's okay. So how did you think about, or how did Facebook think about, you know, data as an input to this with some sort of hard business outcome, which is, you know, are we hitting a certain close rate on
Starting point is 00:09:38 key positions? And then the subjectivity element of it, right? Because that's a very, I just think that's such a good example of, there's a huge data aspect to this. But it's also, you know, it's the marriage of all these different inputs that sort of create, you know, an outcome where the sum of the parts is greater than the whole if i can hopefully answer that question i i think you know especially at that point like there were some clear goals that i want to say were pretty public in terms of like how much facebook wanted to grow right like maybe that was somewhat to some degree partially you know to make people on the stock side happy to show hey we're constantly growing we're growing you know in all different aspects but like i remember i don't remember what the targets were but you know there was at some point, maybe even someone saying something to double it, I'd have to find articles, but I feel like I remember seeing like, certain goals
Starting point is 00:10:31 being pretty, pretty public. And, you know, because we've done all this research, we know that, hey, if we interview 100,000 people, we're going to likely, you know, land whatever, 1000 employees, right. And so now, you know, like, okay, if our goal is to hit 80,000, you know, land, whatever, a thousand employees, right? And so now, you know, like, okay, if our goal is to hit 80,000, you know, in the next three years, we need to interview X amount of people, and then kind of just keep walking that back and like, okay, well, if we need to do, that means every day, we need to run X amount of interviews, that means, you know, just it kind of just, you know, keeps building upon itself, because you have that information now, like, if this is our target, we also know how much it takes us to get there.
Starting point is 00:11:08 We can now kind of kind of get there, you know, break down exactly what we should be doing long term. And then if you see things throughout that whole process, right, like, okay, we're now we're seeing like a reduction, maybe in the numbers, because eventually you will, that was something that we talked about a lot. It's like, okay, we're seeing kind of reduction in percentage. Is that because we've interviewed everyone which is i think parts of it where we'd be like can we either interview most people that fit and they've either you know passed or didn't want the role and that's why they're not here you know should we now reach out to them again right okay these you know these people didn't want to roll okay let's set up something to send them an email
Starting point is 00:11:42 again and be like hey we know you didn't want this six months ago or a year ago. Are you interested now? So that kind of gives you that information. Yep. Super interesting. How about the infrastructure side? John, you were asking about heavy-duty infrastructure that you would use to drive something at Facebook, right? Which can be very expensive, both in terms of hard costs and headcount.
Starting point is 00:12:03 Yeah. can be very expensive, both in terms of like hard costs and headcount. Yeah, I think before we jump into infrastructure, one thing I'm also curious about is the collection. Because I think people really gloss over, because you mentioned that, and this is some very like fuzzy things you're potentially trying to capture. So any like creative things you all did around the collection of, and it could be as simple as like, all right, we have the managers like fill out this form at this step or like, I don't know. Oh, interesting.
Starting point is 00:12:31 You know, they chatted with the Slack bot. I don't know. Like, cause, cause that, I think people skip over that. completely ignore data that you don't have and that can be really valuable data that if you just collect it like first party like one or two things you can really like benefit downstream so i don't know if anything comes to mind but that's yeah something that i was thinking about yeah i like i feel like it's one of those things i'm like i feel like there were interesting things like in ways that we captured information i'm just spacing on it but i do remember kind of like throughout the flow there's obviously all these ways that we captured information, I'm just spacing on it, but I do remember kind of like throughout the flow, there's obviously all these ways that we would kind of capture information, including like, again, after you interviewed someone, you'd go through, you'd write your notes.
Starting point is 00:13:16 They'd usually, they'd yell at you, they'd have systems that would be like, hey, it's been like, you know, four hours since you interviewed this person, like the more you wait, the worse your memory is going to get on this. Right. And they'd also just have a clear form where it was like, okay, like, where they do good, where they do bad, how many questions they get through, which questions they get through, which we basically had a pretty preset of questions, which was, you know, you could basically, I think, just find on Glassdoor. And we would also have like information on like, hey, this person's interviewed before. So, you know, before you even interview this person, you'd already know like, hey, they've interviewed before. They've seen questions at A. They've seen questions at B.
Starting point is 00:13:52 So you need to make sure you don't ask that same question. Yeah. So there was definitely like a lot of those things throughout the process. Because like Facebook's interview process at the time, it might have changed at this point. It's been like now more than five years, I want to say. It's like six years since I interviewed.
Starting point is 00:14:09 And even when I was interviewing or doing the actual interviews, it's been like three years. But it was very much like we had a system. It was very standardized. I think in the goal of being that if it's standardized, you kind of remove some bias out of it and you have more of a process. So that was kind of the goal, but yeah, kind of maybe some ways we would capture it. It was just, you know, as you're going through the process, it'd be like, Hey,
Starting point is 00:14:35 time for you to like review and give your perspective. And they definitely hound you. If it took more than like, I don't know what it was like. They give you like 72 hours. If you didn't fill it out, I think your score didn't count or something. I remember correctly. Ooh, I don't remember what it was, they give you like 72 hours if you didn't fill it out. I think your score didn't count or something. I remember correctly. I like that.
Starting point is 00:14:49 That's data governance. I do think that's a really good point though in that and I didn't even think about this, but when we think about tying a data project to some sort of outcome, thinking about
Starting point is 00:15:08 the datasets that are important to that is huge. Not just being biased on what you have. Right, exactly. Because to the point, okay, you can quantify a funnel. That's not rocket science. But are you using all of the available inputs? i mean that's not rocket science but are the inputs are you using all of the available inputs i think it's a great question because in the e-com space for example like um quizzes were awesome if you could get people to take a quiz to get just even halfway through a quiz that nice like first party high intentintent, useful data that might not natively be in your data warehouse, so marketing might be doing that.
Starting point is 00:15:50 And then the data team just didn't even think of it. Not to flip it around too much on y'all, but you're talking about first-party data. Obviously, one of the discussions going around the data world is the death of the cookie, which we still haven't seen uh it's it's a forever dying
Starting point is 00:16:10 cookie rosary 2032 you know yeah yeah so like are you do you guys see any like people kind of being like we have to collect first party data even more now so you can kind of understand who your customer is because i feel like you guys deal with that more on the event side. Yeah, for sure. It's certainly a big topic. I think a lot of companies are, they're thinking critically about how they adapt to that future when it comes. And I would say increasingly, we have seen data teams who are really trying to adopt, I guess I would call it like a first party first. Is that even? That's a nice way to do it. First party first. You heard it here on the data side. Yeah. Approach, right? And I think the big question there is the sacrifices that you make. So they
Starting point is 00:17:11 fall into a couple of categories that actually I'll do another flip and ask both of you because I think you're seeing a lot of this on the ground as well. There are a couple of areas that we see. So one is advertising, obviously, right? So we talk about Facebook, which are advertising on meta and is, you know, through the ecosystem of their apps, right? And so that's a big concern for companies who have a lot of revenue that's heavily reliant on the third part of the script and cookies being on their site. Now, one thing that is very interesting is that, you know, no one likes change, right? And so if that's changing, and the third party cookies going away, and we, you know, could expect X revenue from
Starting point is 00:17:56 advertising on, you know, Google search or whatever it is, there's also this sense of, man, it's going to be great not to rely on this black box that we're beholden to, right? Because whenever they change the rules or their conversion logic or their attribution logic, you're beholden to that actually. And that can be a really big challenge. So that's sort of one area. And then the other area would be, you know, just any sort of like operational tooling. So, you know, you can think about, of course, Google Analytics is a huge one, but there's all sorts of scripts running on everyone's, you know, websites and apps. And so when you, that's in many ways more of like an operational thing, right?
Starting point is 00:18:45 Like, are those tools going to face limitations if they can't store a cookie? And so I'm going to lose functionality for some operational tool. I mean, it's all sorts of stuff, right? From, you know, screen recording to analytics to whatever it is, right? Personalization tools. So I mean, but what are you guys seeing? Yeah, I think so i kind of started interesting time so i started you know the google analytics like web space around 2015 2016
Starting point is 00:19:12 and the general attitude was well like this is what it is like this is what you use you use just you use google analytics we're beholden to google like we hate them some days we like them okay other days yep like that was just the that was what was available for the vast majority of people yeah and i think i don't know and i think people i guess were happy enough and then like you've got some evolution of tooling and you've got some probably further skepticism of like just around google and facebook both then you had the big thing with apple and facebook that really you know e-commerce really hit some e-commerce companies with some basically facebook not being able to target as well and then i think people reacted with like i need more i need to like dig into this more and be able to control this more yeah and i think from there then you have
Starting point is 00:20:11 for like facebook and google really like for e-commerce that's what you know drives a lot of the traffic for people so i think then you have this attitude of like okay well if i did like what could we do if we like control this and you get some data people involved and then you end up with like, oh wow, like this actually opens up a lot of opportunity, not the least of which, which was just the very basics,
Starting point is 00:20:36 two basic things. One site speed. Like there's so many things you can AB test. If you just make the site faster, like that's one of the best things to do for people because you just get these marketing teams that would just pile pixel on and they'd have like 27 pixels with like three second four second you know page load times and then the attribution was the other thing that like at least especially when i was getting involved in like like oh
Starting point is 00:21:00 because i have an email tool and we use Shopify, like Shopify and then some Google, and you compare attributions and it would add up to like 200%. And you're like, well, because they're each trying to, you know, grab and say, well, yeah, I contributed to that. So like having that like objective, like first party data to do some objective attribution was another. I was going to actually ask both of you about that as sort of a follow up question. I mean, one of the things that we've seen with a lot of our customers when we think about business outcomes is that as the warehouse has increasingly become the center of the data stack and you have a first-party-first approach, it seems like it's been way easier for a lot of companies to create a business case for the data side of things. Because you're not having to explain, you're not having to defend the ad platforms or marketing platforms interpretation of conversion, which you
Starting point is 00:22:03 then have to do some sort of mapping, right? So if you think about like the data team is collecting some sort of data from websites at wherever, right? And let's say you have transactional data, right? So you have purchases or whatever those are, add to carts, right? Way downstream, that maps to some sort of business KPI, right? It's number of orders, which is revenue, which there's margin and you sort of apply all that. But it's this really interesting dynamic where a lot of times it's almost like, well, we have to defend our interpretation of what's happening in the ad platform as opposed to
Starting point is 00:22:37 saying, this was raw data and we modeled it to reflect the actual reality of the business and you can prove that which is pretty interesting right yeah you're saying that ben i'm definitely like i think i haven't had to spend too much time in like purely advertising like recently i think most of my projects thinking back were like very what am i trying to think of it like very domain specific like working with like a casino and then like analyzing their gaming or working with like a telecom company and analyzing like calls and and things like that so a little less on focus on like how are you converting um somebody uh and more focused on like how are people just using our product or using the thing that we do?
Starting point is 00:23:28 So it's been very domain specific there. Yep, makes sense. Okay, how about the infrastructure question? I'm dying to hear about this because it can get really spendy. And I think in today's environment, it's a good topic to discuss. Yeah, I'd love to talk Facebook first, some infrastructure and tooling, and then
Starting point is 00:23:46 like, what are you using now day to day with like consulting clients? And I'm expecting the answers typically, they're pretty different, but I'm curious. Yeah, I mean, you know, at the time, and I'm sure this is somewhat similar, even even now, but obviously, they're investing tons into more on the like gen ai side and like hardware and things on that side and probably making solutions and and tooling like internal tooling to make even that development easier for developers but that's something i think facebook's always done well like when i was in facebook it's like they made your job very easy like to the point that like i would work with certain data engineers that would then pull
Starting point is 00:24:25 me aside like a few months in and be like i'm bored right because like your job has been made like easy you know the for example you know they've got something internally that's very similar to like airflow or like workflow orchestration and really all you're doing is making this kind of half or or more like 75% SQL, you know, 25% kind of Python configuration file, that you then just push somewhere and like it runs and you know, you're kind of just works, right? Like there's no need to like spin up your own like Kubernetes cluster or something to like spin up, like all of these various things. It's like someone else is
Starting point is 00:25:05 managing the actual infrastructure you're literally just dropping you know and committing files somewhere which obviously i think is very facing specific they actually had a whole team that was dedicated to it was called data swarm and just developing that and managing that so they were constantly making it better as well as like, maintaining it on a daily basis. So if it went down, you weren't like, I need to solve this problem. It was like, well, I have nothing to do for the next hour, because someone else is solving that problem. And that's not my problem. And I can't like, I can't even solve it, right? Like, it's not even accessible for me to solve this problem. So I think there's that aspect of it. I think the interesting thing is that Facebook
Starting point is 00:25:43 was doing the whole and i think probably a lot of the big data or big tech companies were doing this before more recently they were doing the whole like hey we're gonna put our data in kind of this open format right like like it's just gonna kind of exist in you know this data lake data warehouse states somewhere and then we're gonna use whatever engine on top of it you know you can specify that engine you know later on and now i'm you're seeing that now i think like iceberg or people are putting things in s3 and then you know using whatever engine they want to sit on top of it if it's more cost effective or if it just makes more sense for that specific job so i do remember
Starting point is 00:26:21 that kind of being the thing when i left was like okay hey you want to use presto use presto, use Presto, you want to use Spark, you want to use, you know, something else, you know, you can kind of pull that off the shelf and use that to run the specific job on that data set. And it's very abstracted away where it's like, literally, just again, that's that configuration, like, this job is gonna be Spark, this job is gonna be Presto. And you just call it out early on. But again, I think you're starting to see that now. I think like it makes sense, right? Like as people are trying to control costs to try to figure out, okay, sometimes it's about cost, sometimes it's about performance. I do imagine there'll be a line where like certain companies, it'll just make sense to stick with,
Starting point is 00:26:56 you know, one. You know, I see that with most of my clients that are more in that mid, small size. It's like, you're not going to try to juggle BigQuery and Databricks and Snowflake. You're going to pick one and try to do that really well and make sure it fits. But when I look at
Starting point is 00:27:09 the larger organizations I work with, they already are using all of the above, and it's more about maybe trying to coordinate it longer term to try to figure out what makes the most sense for various teams. Yep. That's just touching the iceberg.
Starting point is 00:27:25 I think that question can go multiple different directions, so feel free to keep digging in. I think maybe this is what you were thinking of, John. So that's Facebook. They have all of this. What a luxury to have an entire team work on this internal
Starting point is 00:27:41 tooling. But as we've seen in the data space so often, the fangs are really pushing the boundaries on inventing stuff because you have teams that are solving problems that very few other companies have faced. Have you seen there be sort of like, okay, so in the mid-market, like you said, okay, we're sticking with sort of one cloud, like we're a Google shop, Snowflake shop, data brick shop, whatever. We're going to do that really well. What about some of the other tooling?
Starting point is 00:28:10 Like, I mean, it seems like there's a lot of SaaS popping up that can help sort of act as that dedicated data team to sort of take care of, you know, those pieces for teams that don't have, you know, the resources to have like a bespoke solution are there areas in particular where you see like okay there's a ton of really great tooling that's making this sort of more streamlined and accessible to smaller companies that don't have resources like what areas of the stack are their sort of efficiencies due to new tooling yeah i think know, it's interesting because I think Ethan Aaron posted about this.
Starting point is 00:28:49 It was like 2015, his data teams were like one person, especially like mid-sized companies. Then like 2020, they were like 30 or 50 or whatever. They blew up pretty big. And now we're like, you know, in 2024 and we're looking at like three to five people again on these teams and so it's interesting that we got to that point you know back in 2020 i think what happened is people found out very quickly that if you built 100 data pipelines you had to maintain
Starting point is 00:29:17 100 data pipelines so as the fact the faster you built which you know a lot of these tools could kind of give you the more you had to maintain And then you just kept having to kind of build bigger and bigger teams to kind of... 20 of them, and only 20 of them actually got used, you know? Yes, exactly. And only half of them get used or 5% of them get used or whatever. I'm sure you could find some interesting statistics around that. But there's definitely a lot of tooling that I do think can make things easier, you know? I think what's interesting about the solutions that have existed
Starting point is 00:29:45 now that i've like you know been working in this space for a while is like we've somehow still recreated the same problems we had before and when i say that okay we have a tool whether you know be portable five-train estuary to do data extraction great now we needed to write like okay now we have to get a tool for transformations. Great. And now we're doing the same thing we were doing before, which was like, okay, someone created a cron script to do data extraction. Great. Okay, someone created the cron script that called a stored procedure somewhere.
Starting point is 00:30:16 And it's a separate script. And so now we have to set up, like, you know, I say cron, but I mean, like, Python script managed by cron. Now we have to, like, set up these two things to run, like like about an hour and a half apart, because that's like the optimal timing. And it feels like some in some way, we've recreated that in this world. It's like, okay, it's easier now, but we still have the same problem where it's like,
Starting point is 00:30:35 your Fivetran or Estuary job runs a certain time. And now you hopefully run your dbt job or coalesce or whatever your transformation tool is at the same time. And then hopefully, you know, you've got your next, you hopefully run your dbt job or coalesce or whatever your transformation tool is at the same time. And then hopefully, you know, you've got your next, you know, your Power BI dashboard updating at the exact same time or at the right time. So it's funny how that's happened. And like now, again, we have all these orchestrators that have been developed to like kind of go around that. We're like, you know, it was what Airflow was to like Python scripts and SQL, you know, kind of one-off jobs back in 2015 it's just like it's the same thing it's like we created the same problem you think we would have built this
Starting point is 00:31:11 solution into it or had this in mind but maybe find that i think interesting but again all these tools do help i do see them like actually like i have i had a client one of the first clients i had when i quit that i built up their solution with a few tools. And like every once in a while, I reach out to them like, hey, how are you guys doing? Anything? And every once in a while, they'll reach out to me like, hey, we think we might need you to help on something. And then like 24 hours goes by and like, never mind, we solved it. And, you know, it's just like one data person, essentially, who's kind of managing it all.
Starting point is 00:31:45 And it kind of handled it. So, yeah, I do think a lot of this has helped. But it is always interesting how we've kind of recreated some of the same problems we've had for a long time now. Yeah, it's like a system that allows for innovation in individual problem areas creates a more complex system right and but these systems have to operate like as a system if that makes sense right and so yeah yeah it's super interesting okay i have a question i was thinking a little bit more about earlier we had sort of discussed like this distance of data, project, data team, whatever, from like the business outcome, right? So interested, this is a question for both of you.
Starting point is 00:32:34 Where have you seen that become a problem, right? And so when I say become a problem, to put a sharper point on that, you know, funding gets cut or the data team comes under scrutiny because it's like, well, this is just a cost center. What value are they adding? Right. But, you know, and to some extent there is a bunch of infrastructure that runs upstream of, you know, what's sort of happening downstream that shows up in the executive, you know, BI dashboard. What are the like symptoms of that distance becoming a problem, right? Where it's like, okay, you're in a realm now where things are getting dangerous or there may be issues because even though on the ground, you know, well, all this stuff we're doing, all this infrastructure, whatever, is making this stuff possible downstream, but perception is reality.
Starting point is 00:33:20 Right. That list of like, you might be in trouble if. Yeah, exactly. Like as a data team. Because a lot of times it's, those things are not a problem until they become a problem, if that makes sense, right? Like, you know, that dynamic can persist for a while until whatever, right? The company has a bad quarter, you know, a new VP comes in who's like, you know, going through every line item, you know, on the budget and inquiring about every single thing, right? Like those things happen. And so
Starting point is 00:33:51 those things, sometimes those dynamics can persist where a perception doesn't come to light until there's some sort of event that brings it to light. And then at that point, it becomes a problem. So how do you think about like, what are those dynamics that can you could catch earlier like symptoms of that yeah i think look a quick one for me is like you might be in trouble as a data team if you just produce reports and dashboards because if you are if you've got your data warehouse integrated into pushing things out to key partners like via integrations to tools that people already use like you're pushing data back into salesforce back into erps back into that those data teams i think are like seen as indispensable because that sales team is like oh well you know i use that thing
Starting point is 00:34:46 it's in salesforce it's useful to me whereas if you're just doing dashboards if you know i think dashboards can be useful and reports can be useful but those can be in trouble because those can be things where it's like well i don't remember my login or like i used to check that but the data was wrong one time and i don't look at it anymore so that would be my number one thing is are you integrating into the tools people are to use? And then are you integrating in with like partners that do really useful things with data?
Starting point is 00:35:12 Yeah. I think something like along those lines where you like, if you start having clear disconnects where your business like doesn't seem to care because of sometimes like, I think he referenced like that apathy where it's like, okay, we ask them for things it's's wrong, or it breaks eventually. Like, I had a client a while back who was like, oh, yeah, we, like, don't use the data warehouse anymore because, you know, this one report broke. And, you know, now I just, we just don't do it.
Starting point is 00:35:38 You know, we use other options. You know, we just manually create it. So, you know, if you start having that apathy i think that's one way i think that can also like manifest itself in like if you're sitting there and you're not like you're building things because you think it's the right way to build things and no one in the business is like asking like where things are going to go i think that's never a great sign right like if you're like oh yeah like if you're really building you know and just building as ethan aaron kind of quoted it infrastructure for infrastructure sake and
Starting point is 00:36:09 no one at any point is stopping you like they're like not like hey yeah what are we doing this for like that there's some concern they're more just in maturity than anything else like there should be hopefully that maturity of like you know they the business hopefully understands like hey this should probably come in stages like at this stage would stage, would like, when can we expect, like, to at least be able to like, play with the data and understand it? Because I think the more you can, like, give them some tangibility, the more they'll, like, see that they can do things. Because on the flip side, when I do like, let's say, you know, like clients, as I do start creating their data warehouse, like they have this like initial vision of what they do, right? Because they've had Excel, they've got like their initial world of what they think.
Starting point is 00:36:48 And then if you give them a little more access, suddenly, they're like, Oh, my gosh, right? Like, I've got 20,000 things I want to do suddenly, because I can see all this data, I can play with it, I can poke at it. And then the game becomes more of like, hey, we need to now have a process to like, you know, what's going to what needs to be prioritized, right? Like that becomes a discussion, not like what's going to be created and create all the things you can. It's like, OK, now that you have all this access, now you have all these ideas because you finally do, you know, see it all. You know, how do we funnel that into an actual process? So that's what you want.
Starting point is 00:37:17 You want to get to that point where it's like the business is like super excited. And if anything, you're like having to spend time prioritizing what actually should be done and like also spending time maybe getting rid of old things and then things like that so yeah i i think organizational structure is also a big piece here because i've found if i can find or make embedded data analysts so find them like maybe there's already like a financial analyst or something or like maybe there's somebody just interested in analytics that's already embedded in a marketing team or an ops team. Like those can be some of the best people. And then as far as driving adoption inside those teams,
Starting point is 00:37:54 like they can do way more than I could ever do like in a data team seat because they just know they're there every day. They can say, Oh, Hey, you know, you've got this problem. They can, you know, take the data, apply it to a problem in the moment because they're on the ground. Can we dig into that a little bit? So when you say, so you find an analyst, say, and find it, because you were a CTO, right? And so you oversaw like the data practice, all the technical side of things. So you're saying there's like an analyst who works in finance. And so are you essentially building an alliance with that person, making sure you're serving them with, you know, things that they need so that they're almost an advocate for the data team in there? Or are you
Starting point is 00:38:34 like trying to poach them? Oh, no. Yeah. That's a good clarification. No, like these people are, I did poach one or two, but in general, the good ones, but in general, it's, they stay in their current seat. And these, then these people are like typically highly analytical, especially finance is great because if you've got that accounting background and maybe you're like a financial analyst and like, I've done this at two companies now, like financial analysts that take, I mean, they take days, hours and hours to close out books for the month before. It just, it's so much work all in Excel. And there's actually been two companies now where that analyst has gotten the right access to data in a data warehouse.
Starting point is 00:39:16 And then they've self-taught SQL and have been some of the fastest learners, most motivated learners to learn SQL. And they've reduced the close times by days at both companies just because they were eager and hungry and then had somebody to give them the right access to the data. So that's just one simple example. And then other analysts, maybe ops analysts often can too, get really bogged down in manually tracking things, having to spend hours and hours in
Starting point is 00:39:45 excel if you're already putting the time in and then you because i think ben mentioned the automation you'd always look for jobs like sql automation that automation thing could be really crucial for those analysts that are already like just spending hours doing stuff manually yeah one manually. Yeah. One question, I'm laughing here because our friend Matt here just sent us a message and said, you might be in trouble if all finance seems to care about is your CapEx number, which is true, I would say, across the board. But that brings up an interesting point about the way that data teams are budgeted or projects are budgeted, because that can vary a lot. And I'll give you an extreme example, but I think this actually also relates to how the organization views a data team. So I was talking to someone the other day, and it's a very large company, and they work on the data platform team.
Starting point is 00:40:47 And they actually do not have a budget for this team that they're on. That team tracks usage of the data product that they built internally, and they have chargebacks that go to the teams. And so, which is a little bit weird because, I mean, that's slightly perverse and that, you know, you want people to use more data, but you get, you know, your budget, you know. But like, so that's kind of an extreme example. But we'd love to hear, like, there are different ways the data teams get funded.
Starting point is 00:41:20 They're an independent organization. They get their own budget, right? There can be chargebacks. There can be, I mean, what are the different, you know, maybe just think through some of the situations where, or maybe like a healthy example, Ben, an example of a healthy dynamic and an example where it's not as healthy, just in terms of how the budgeting works around that stuff. Yeah. Like in terms of like unhealthy, like obviously you can go in both directions, right? Like on one side, like I said, 2020, let's keep just adding more. And because we have added more, let's add more without truly trying to connect with, you know, does this help? Right? Like, does adding like, like, because we've added in these new systems, will our business do better?
Starting point is 00:42:07 There was a ton of startups or companies that went from startups to IPO. I don't remember, probably in the 2022, right? Either went bankrupt or, you know, their stock price is doing terrible. I don't, actually, before I say this company name, let me just see. Let me just check out before. Yeah.
Starting point is 00:42:29 So like, let's say for example, and this is not to talk ill of any company, but like if you're talking about a company that like, hey, their data infrastructure is amazing. Like people would like look at it. Stitch Fix, I think is a great example, right? Like they had this like, they like like it was cool to go to the website like as a data person and see what they're doing. And like, you know,
Starting point is 00:42:48 and it's not saying that it's unhealthy, but it's like, is that over like fascination with data? Is that helpful or not in the long term? And that I can't answer. I don't know their internal, but I think that can happen. You look at a company like that.
Starting point is 00:43:01 You're like, hey, they're like cool. They're doing data. Then you think your company needs to be that. And it kind of becomes this cargo culting moment yeah um yeah again it's not to say that like it's just to say that like data isn't everything you know just because you have cool models just because you've done all that your business can still do poorly and so i think that happened a lot in 2020 we have all these businesses just grow and they were like let's hire more data people that seems to be what everyone's doing and then you know you end up struggling because you're spending you know if you've got 20 people that you're spending 150 200k and you know a year on
Starting point is 00:43:34 like that's a significant amount of your budget especially if you're a startup on the flip side you can also be in this like point where i often hear people say like if your cfo if your data team rolls up to the cfo you're gonna have a time. So like that's kind of the other side where it's like, yeah, you can be like very like treated like you're just a cost. And to some businesses, I say like, you might be, you might just be a cost and that might just be your role. And you have to understand that sometimes. But if you think you can do more, it's going to be really hard in that situation that's unhealthy on the other side, where it's like, you just don't get enough attention, or you don't get
Starting point is 00:44:07 enough budget. And so you're only ever going to be able to do just enough to keep them, you know, from having maybe an advantage, if they could have it. Yep. I'd say a healthy situation, you know, hopefully, you're not growing, you know, know your team unless there's like a specific reason like like a business reason to be like yes we need a data engineer because you know maybe you had a data analyst because i think a lot of people start with a data analyst you had a data analyst they've been building all this nice stuff but now it's getting hard to maintain right because it's like okay they've kind of got these three or four reports or four or five reports they're having to manually create them it's taking a long time.
Starting point is 00:44:45 Is there a way we can automate this? And is there a way we can justify, you know, hiring 150 to 200K person to do that, right? Like, does it actually add that to our bottom line? Or does it still just make sense to have this data analyst kind of manage it? Right. So I think that like having like a healthy team would have those discussions. And they wouldn't just be like, we need to hire a data engineer because that would solve the problem it's like well these reports are only saving x amount it might not make sense in the long term so something along those lines yeah yeah makes
Starting point is 00:45:14 sense john thoughts yeah i think when you get when you see like the infrastructure data engineering stuff as a productivity driver i think typically for more than one analyst like maybe just one analyst but more than one analyst so like man we have teams of analysts doing x y and z like every single week and then they have to go through the mental exercises like of like cool what if they did less what if we do we actually need this report like all those things that need to happen prior to like oh no we need this it drives value here's why and the business goes through that exercise and then they get to the point where like okay i think we need data engineering help it's enablement for these analysts they'll be more productive it's useful i think
Starting point is 00:45:58 that's a really good exercise whereas like to what you're saying versus like oh like yeah we should hire data engineer we need a data engineer we need a data warehouse we need this we like we need ai right like all the cargo culting yes yeah but i think that process is super helpful and then the finance thing was interesting too is the safest data teams if you want to if you want job security be on a data team that reports to cfl okay yeah but if you want like to work on really cool stuff and you want job security, be on a data team that reports to a CFO. Okay. But if you want to work on really cool stuff, because the CFO is going to think typically, not all CFOs think more accounting, right?
Starting point is 00:46:34 It's going to think more cost. And the good news is CFOs in charge of the budget and usually they stick up for their people, which is good news. But you might not get to work on the most interesting right things and your team is going to be small and you're going to have to work hard yeah um that yeah stereotypical but i think that's like kind of the general yeah that's interesting okay oh sorry go ahead ben no i think it's super interesting that's kind of. Okay. We're close to the buzzer here, but interested, you know, just changing gears a little bit with the last couple of minutes we have for both of you, what are some, you see a ton of
Starting point is 00:47:17 different companies working on a ton of different stuff. What's like one of the most fun, cool projects you've seen recently, you know, or done recently with a company? John, why don't we start with you and then Ben can take us home. I think at least in the e-comm space, I haven't gotten to do a project on this yet, but I'm really excited about search. We got to talk to a really neat company, Marco, that's working on this space and like the biggest problem like data related problem for ecom in my opinion is this discoverability thing like if you need a part type in a part amazon works great if you're like not sure what you're looking for the search experience is really difficult and and the only way discovery works well is if you have a really small skew count
Starting point is 00:48:02 so if you only sell like 10 things then like that's fine it's easy yeah i think that and then like incorporating data into search yeah and like search intent and signals i think that's like a really interesting space but i haven't gotten to do one of those yet but we'll see all right then yeah you know i think a lot of my projects end up being migrations, which aren't necessarily boring, but they're not the most thrilling. Like last year, I was just proving out for one client who was spending like upwards of like, I think it was like $35,000 a month on their infrastructure. Just kind of proving out a simpler version and helping them move to that. Which, you know, it's cool to always hear those numbers and hear the reductions that you can do in that regard right like okay this is totally possible to reduce let's do it recently like
Starting point is 00:48:50 this is just like more of just i think kind of a nice project i like it when it has like this realness to it again thirty five thousand dollars to you know bringing it down to like ten thousand dollars that's real i think the other thing that's that was real was like i i have this client i've had them for a while where we work off and on. And they always have kind of interesting ideas. And this most recent one was like, they're basically a logistics company, like they deal with like busing and like, like people rent them for various reasons. And they're like, Hey, one of the things that we do is like, during the summer, we do this kind of like specific sets of bus routes. And one of my employees essentially has to wake up really early to like,
Starting point is 00:49:28 you know, we have all the bus routes, we have all the pickup stops and has to like plan that out. And that takes, you know, they wake up at 1am just to manually do this whole process. And I don't want them to ever quit because I don't think anyone else can do it.
Starting point is 00:49:41 We like try to automate like even 70 of it and so basically we've kind of developed a system to like just automate that process and that's been really cool because again like we are in the end saving someone from having to wake up at 1 a.m to kind of develop this whole thing and it you know just it feels good in that regard so that's. So it really isn't like a complex ML model. It really is just like a rules engine that we created. Part of the client was like, we really want to go down this like Jenny I route. And I was like, I don't think it's going to work.
Starting point is 00:50:17 Like maybe, but I know we can definitely get something to work. Maybe in more of a rules engine kind of fashion. So we went down that route. So yeah, I think that's always kind of cool. Makes me think of another kind of instance where like we ended up doing a migration that helped avoid some analysts having to wake up like on Saturday and Sunday to do this one report because they have to report it every day. So anything like that, it's always kind of kind of cool just to help someone out that
Starting point is 00:50:42 has some real problem like that. Yeah, I love it. Well, really quickly before we hop, remind us where we can find your information, where listeners can connect with you, see all your content. Yeah, I mean, you can look up CL Data Guy. I'm on YouTube, Substack, LinkedIn, probably a few other places. But yeah, you can pretty much find most of my content. So if you want to watch videos on like becoming a data engineer or even some more specific topics like data modeling, I've got a few pieces on that. And same thing in the sub stack.
Starting point is 00:51:12 I've got a pretty good plethora of content that ranges from beginner content to, you know, organizational kind of how you should set up your organization and things like that. So yeah. That's great stuff. We read it all the time. Well, Ben, thank you so much for joining us on the show. Great conversation. I learned a ton, tons to think about. And we will have you back on again soon now that you are a multi-time guest. Yeah. Yeah. Thank you. Thanks so much. I appreciate it.
Starting point is 00:51:39 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.