The Data Stack Show - 226: Building Trust in Marketing Data: An Engineer's Guide to Attribution Architecture with Lew Dawson of Momentum Consulting

Episode Date: January 29, 2025

Highlights from this week’s conversation include:Lew’s Background and Journey in Data (1:06)Attribution Challenges (2:16)Attribution War Stories (8:09Defining Attribution (12:32)Complexities of At...tribution (16:08)Multi-Touch Attribution Challenges (21:31)Campaign Creation Difficulties (23:27)UTM Parameters Explained (26:01)Challenges in Data Extraction (31:17)Transforming and Merging Data (36:28)Behavioral Data and Identity Resolution (40:29)Hierarchical Structure of Campaigns (44:03)Challenges of Data Consistency (49:38)Mitigating Freeform Data Issues (52:21)Creating Unique Join Keys (55:30)The Importance of Defining Requirements (58:42)Final Thoughts and Takeaways (1:00:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the Data Stack Show.
Starting point is 00:00:34 We have a special guest today, Lou Dawson, from Momentum Consulting. Lou, you have such an interesting background and have done lots of different things. We met when you were a Redder Stack customer, now you're a Redder Stack partner. And so you and I have actually talked many times about one of our favorite subjects, which is attribution and all of the related data and reporting stuff. And so I am pumped to spend a whole hour talking with you about that. So welcome to the show and give us just a high level background of your journey and data. Yeah, thanks, Eric.
Starting point is 00:01:05 Awesome to be on the show. Thanks for letting me come on. In short, my background real quickly. I started writing code back in the late 90s websites. So I got started really early, loved it, and been doing it over 25 years now. Got started early in the data warehousing space, spent a long time doing that. Then moved over to the marketing space and doing early days of MarTech and implementing a lot of MarTech technologies from scratch for companies. And did a little cybersecurity in there and back really solving MarTech full time. That's the niche I found that businesses really need help with and really can use my consulting services is how do you really implement a proper and great marketing, mark tech ecosystem. So that's where we are today.
Starting point is 00:01:57 And that's how I got here. Awesome. So Lou, you were talking before the show about attribution. And we're going to dig deep today. We're going gonna be pulling out wires where you know where did that go it's gonna be fun yeah so what what like attribution topic are you most excited to jump in oh man attribution is a deep and wide topic i think this one interests me immensely because it's a hard business problem and a hard data problem to solve.
Starting point is 00:02:27 So it's just, it touches every facet of a business and every facet of data from coordinating with leadership, product, marketing. So yes, like you have to deal with people. It's scary, right? All the way to those really scary, like down in the basement engineers. Then, you know, talking about the data side for a second, you have to figure out how do I model my data? How do I make sure my data is accurate? And how do I accurately represent it to the people who care about that data so you can make good marketing decisions. And it's just a cycle that continues over and over. And hopefully, if done right, you optimize your ad and retention ecosystem, and you keep
Starting point is 00:03:14 getting better and better and better. And you continue to grow conversions by using that data, that attribution data. How hard it is to get there. You made it sound so simple. We're going to break it down today. So let's dig in. Yeah, let's do it. Lou, I'm so pumped to have you on the show. And I'm kicking myself because we talk, we have talked a lot, you know, I guess over the last year plus. Like a week, a week, every week you talk. Every week, every single week. And just somehow I haven't invited you on the
Starting point is 00:03:46 show i've been keeping the secret of our conversations but now we're going to expose that to the world so you give us a brief overview in the intro but go back maybe just a couple of roles so we met when you were doing uh data and martech stuff at Allbirds, who's a Ruddersat customer. And so that's how we met. That's how we connected. We've maintained our friendship. Now you're doing consulting. So go back a couple of roles,
Starting point is 00:04:13 maybe prior to Allbirds and tell us kind of about that story. And then an overview of your consultancy momentum and the types of projects that you work on. Yeah, of course. In short, I really got started in the entire data ecosystem back before I got out of college. I worked at Teradata for a long time.
Starting point is 00:04:33 The data warehousing company, probably a lot of viewers are familiar with it to some degree. And they were one of the primary vendors at that time for data warehousing. And so I got a lot of exposure to data warehousing there, large-scale processing of data. And then somehow, I don't recall the details, doesn't really matter, I moved over to Intuit for a while and early on was tasked with rewriting the personalization engine
Starting point is 00:05:02 on the marketing website and so a lot of that was how do we optimize what the customer sees and on the marketing website so engagement part of it so when they come and how do we really optimize for conversion when they get here so how do we get them into the product and so that really like, like, that was my big, big exposure and my big realization that this is a cool technology. This is a cool area to focus on. And I like it. Like, it's really interesting, interesting problems to solve.
Starting point is 00:05:35 So built that for a while. Then, like I mentioned, cybersecurity for a while. I've always been interested in security. That's less interesting on this show, probably. And then ultimately ended up at Rudderstack through an acquisition. Sorry, not Rudderstack. My apologies at Allbirds.
Starting point is 00:05:52 And I ended up on the data team. And that was early on. We wanted to implement Customer 360 so we could improve our acquisition and retention campaigns, but especially retention. And so we developed out,
Starting point is 00:06:07 well, we partnered with our stack early on. I think y'all were a really, really early stage startup at that point, if I remember correctly. I think we were one of your earlier customers. And so we worked with you on a lot of stuff. I think if I remember correctly, some of what we ended up implementing fed directly into
Starting point is 00:06:25 some of your requirements so that, you know, we built off of each other. Yeah. And ultimately, awesome relationship. Yeah. Yeah. Ultimately that cultivated into somewhat successful customer 360 all birds. And then me realizing that I really enjoy doing this for a lot of different customers and I enjoy the data space. And that got me to consulting for multiple different companies doing a myriad of different things in the MarTech space but I do always love talking about acquisition because it's such a challenging problem yeah and so yeah that's that's what I'm doing today with Momentum Consulting anything MarTech related I do other stuff outside as well but it's generally marketing focused martech and you know the the niche i've really carved out for myself in short
Starting point is 00:07:10 is a combination of providing solutions or providing a strategy for how to implement martech for folks from marketing and product all the way up to leadership so communicating with them getting their requirements etc to occasionally actually implementing solutions so communicating with them getting their requirements etc to occasionally actually implementing solutions working working with them either me or a team people working to implement solutions so that's what we do at momentum consulting love it okay i want to start out there's so much to cover but i want to start out with brief for both of you this question is for you too john with maybe brief anecdotes about like an attribution war story okay that was either wildly successful or a huge failure so either one but it needs to be sort of like a you know on on either end of the on either
Starting point is 00:08:02 end of the spectrum so john why don't you go first? So, Attribution War Story, huge win or huge failure? Oh, I'm definitely going huge failure because it was the most fun. It was kind of a two-part failure. Those are kind of more frequent. Yeah.
Starting point is 00:08:18 Those come to mind more quickly, too. This was a fun one. It came to mind during prep, actually. And so you can picture yourself. This was a fun one. It came to mind during prep, actually. And so you picture yourself, this was several years ago now. The board meeting, I'm sitting at the time in an IT spot. I eventually started managing marketing and IT. But I'm sitting from an IT spot, had a marketing leader in there, board meeting, presenting. And they're presenting the just overall like performance acquisition performance and talking through that so presenting the return you know the the row ads the return on ad spend super common metric and they're saying it's it's so good like things are great it's 800 percent
Starting point is 00:08:57 return on ad spend which is quite high that's yeah quite high and my data brain starts churning a little bit like like you know like i think the thing is like go forward from there and fast forward a little bit i ended up taking over that group and digging in really deep on the attribution and and and all of that and and we found two major problems one One of which I think was already there and one of which I think we created. The first one was like the most obvious problem, but it happens a lot is conversion events were firing twice. So that eight was a four. And that is a massive financial difference if you're trying to understand your ROI on ads. And your willingness to deploy budget. Yeah, right.
Starting point is 00:09:47 So that was like an early on find of like, ooh, this is not good. And then the second one, which was just a bizarre one and was hard to find. So this was a B2B site. We had some larger orders, but not every order was large so there was some bizarre bug where orders over a thousand dollars didn't get captured correctly it had something to do with like placement so like you know like typical orders like typical day we would have several over a thousand but not a lot and it was just off and it was the hardest thing to find because you know like odds are like you pick up a pick a
Starting point is 00:10:25 random order pick a sample like it's not over a thousand dollars yeah there was enough to where like it was a big problem like overall so those are those are two like attribution like data challenges where yeah it was tough all right your turn lou yeah i could think of like john was saying the failure comes to mind the quickest. And the one that comes to mind immediately is attribution, specifically conversion. Well, attribution stopped working, specifically conversion stopped firing in a lot of cases.
Starting point is 00:10:57 And no one noticed this for a while, right? So you basically see a massive just drop off. And they're depending's, Ooh, you basically see a massive just drop off and they're, they're depending on theoretically they're depending on this data in order to make decisions on how to re reallocate ads. But for some reason, like both there's a failure on both the data side and a failure on the marketing side data side. Like we didn't notice we weren't alerting on it and marketing side. It's like, were you guys actually using the data and paying attention to the data? How did you not notice a massive drop off? Right. So that's definitely the one that always sticks with me. It's like, you really need robust alerting and monitoring mechanisms and data, which is one of the many, many problems for acquisition. You have to solve. Yeah. Yeah. Totally. Okay. Well let's start our deep dive and we're sort of at the edge of the many, many problems for acquisition we have to solve. Yeah, yeah, totally.
Starting point is 00:11:47 Okay, well, let's start our deep dive. And we're sort of at the edge of the hill here. And I'll nudge the car towards the slope. Towards the precipice, yes. Okay, I want to talk about why attribution is hard. But Lou, can you just give us a high-level definition of attribution? What is the business problem that you're trying to solve using data? Because I think this is probably something that a lot of our listeners have exposure to, but perhaps some of them don't, and the levels of exposure may differ, and it can look so different at different
Starting point is 00:12:24 companies. So kind of level set us with just a really high level definition. What problem are we trying to solve when it comes to the subject of attribution? Yeah, I think this is actually one of the challenges is defining this. This is like one of those very early challenges and the many challenges that attrition. But in short, Attribution is taking all of the traffic that you receive and to your engagement properties. So where the customers are coming to actually do their final conversion. So taking e-commerce, for example,
Starting point is 00:12:56 like a website that's selling things. You want to understand if you're converted, so they bought a product, they checked out, they bought a product. Where did they come from? You want to attribute where they came from to an order to understand essentially like your customer acquisition cost, understand how well your ads are performing. So you mentioned earlier ROAS, John, things like that. You want to understand at the end of the day, like how efficiently am I spending my dollars? Number one. And how well are my customers converting across my various channels? Number two. And then tangentially number three, it's like how well am I retaining customers across different channels? So that's the highest level last thing i'll say is it's a challenging problem because every business is a little different and how they want to look at it when you dig into the details and then further down
Starting point is 00:13:54 different businesses have different stakeholders with different weight in that giving you a slightly more specific example some some businesses like if let's say like the acquisition team is kind of the driver like their their leader has more weight than like the engagement or the retention or let's say the engagement leadership has a greater weight they might care more about conversion. Right. So especially at larger companies where it's like KPI driven development, let's say. So like people care about getting promotions. So they care about boosting their KPIs. Yeah.
Starting point is 00:14:35 They're going to potentially care more about prioritizing their KPIs and boosting their metrics. So like conversion and engagement versus maximizing revenue. So that's just, I think that's one of the challenges of just defining the problem is like, what are you trying to optimize for? What are you trying to measure? So I think that's such a good point,
Starting point is 00:15:01 but let's, okay. And I think your definition is great. You have customers coming, you have customers coming through channels. You use the example of an e-commerce website, but it could be a store. Actually, more and more, you have e-commerce companies that started online who are actually launching brick-and-mortar presence.
Starting point is 00:15:16 You have these channels, and you want to know we're trying all of these things to get more people to walk into the store, to come to our website, and then ultimately make some sort of purchase. So when I hear that definition, I think before I actually had to face this challenge, it's really easy to think, okay, I'm pretty savvy with technology and with data,
Starting point is 00:15:43 and so we have a set of channels, and so we need a measurement mechanism. We need to see the conversion. I'm pretty good at math. I feel like I can tackle that. That is not untrue, but I think
Starting point is 00:16:00 it's easy to start out with an idea of like, okay, that doesn't seem like that hard of a problem. And it actually turns out to be a very difficult problem. But why is that? Break down for us the different dimensions of why actually putting that math problem together is really challenging.
Starting point is 00:16:25 Because there is an entire multi-billion dollar industry of software focused on this, and that doesn't even include all of the effort and time and compute that goes into companies that are hand-rolling this on their own stack and their own infra. Yeah, absolutely. It's a multifaceted challenge, like I i said so i'll keep the like from the
Starting point is 00:16:47 business perspective all the people involved is one one challenge just to give you one super quick example you can dig into this later you need to structure campaigns a certain way like the wording how you define them etc so right, that's a people problem, but that then becomes a data problem. So then getting to the data side, there's massive amounts of data challenges to actually make this work. So again, using that same example, you get the data on the other side. Well, what if the campaign name is not the same every single time, even when it comes from the same source, right? So what if someone's browser mangles it? Well, now you can't attribute without additional logic. You can't attribute 100% accurately every single person coming in
Starting point is 00:17:38 and every single conversion, right? So data challenges are, I don't want to immense here but they are there are a lot of them and they're complex so it's basically there are from the data side there are a few challenges so it's like it just highlighted data accuracy so getting the data in fully accurately and correctly so that's acquiring it transforming it and spitting it out correctly yep then we talked about getting the data in like part of it is just generating the data so on your engagement portion so on your website your mobile app it's am i even generating the data necessary to track where someone came from or came came from and also what they purchased right so again
Starting point is 00:18:27 i came in i checked out on my shopify cart like how do we get the data that says i blew purchased this product and i came in from these channels how do i then merge that data with two things like data that came in session session data sorry so like i'll be by behavioral data and then also all the ad spend in data that i then pulled in like how do i merge those together to say oh yeah for lou i i spent 30 cents showing him an ad i spent 10 showing him that ad right so you have to connect all that data. There's sort of a huge data connection problem. That's way more complex than it seems on the surface. Next, there's a, what do I do with that data problem? So it's, that's cool. Like you've connected it.
Starting point is 00:19:17 You now have data, but having the presence of data alone doesn't help you. Now you have to figure out what do I do with that data in order to give me data that I can go take action on to evolve my business, to improve my conversion, to improve my revenue profit. And that's a challenge on its own. It's like, how do you first figure out what's important? And then what hands do I get that into so the correct decisions can be made so that we can evolve these campaigns so boost the good ones kill the bad ones um and then lastly again like it's a it's a people problem it's like how do I coordinate everyone to do all the things correctly across all these technologies we just talked about
Starting point is 00:20:08 to make sure that nothing breaks and that everything is done in a normalized enough fashion that we can continue to do this over and over? Did I miss anything? No. Luke, I want to expand on the people problem of this because i think this is like really fascinating is like a hundred percent like i think you hit
Starting point is 00:20:32 all of the major like components there but there's a there's this like additional like people probably have to do the right things like you said like name the campaign name the campaigns the same you know every time of the same campaign so there's people problems like that there's also this people problem of at its fundamental level we are taking this big pile of money of revenue for the company and trying to figure out who gets credit for what and that creates drama in most companies right like if if you're like like you said if you're driving hard on like all right like you're the amazon channel or you're on like, all right, you're the Amazon channel or you're the inside sales team or you're the whatever,
Starting point is 00:21:11 each of them wants their fair share, their fair credit or attribution for whatever they contributed. Many of them have financial incentives. That adds a wholeher dimension to this problem besides extremely big technical problems. Yep. Yeah. And that's especially prevalent.
Starting point is 00:21:34 Again, it's like you figure out what you're measuring. That adds a whole layer of problems, especially when you get into multi-touch attribution, which we'll talk about later in greater detail. But in short, it's like people get partial credit. Yeah, that becomes a huge problem when a business, like a partner, stakeholder decide, I disagree with that. Like, I think I should have gotten more credit for that one.
Starting point is 00:21:56 Yeah, there's all sorts of people issues here. I think they're almost as prevalent as the technology issue of stuff. I just, sorry. I think they're almost as prevalent as the technology issue of stuff. Okay, let's dig into the tech stack a little bit. And Lou, let's walk through the sequence that you discussed. Because I want to dig into the people side more a little bit later. Because I think that's arguably to your point probably.
Starting point is 00:22:29 If you can solve the people side, then that actually paves a pathway for the tech side. But let's talk about the stack really quick. I think just to orient, just to orient everyone. So we talked about collecting the data. Are you even collecting the data? So let's start there before we even get to accuracy. So where is this data coming from? Like what are the data sources and sort of what mechanisms are you using for this capture or like you know sort of what if you're going to go in and sort of put together a strategy just describe the types of you know pipelines i guess or data sources but actually take that it's up even one from that crazy enough and it's you know just it feels like table stakes but having been in a number of the platforms I have to say this like
Starting point is 00:23:08 even being able to create those campaigns it comes before that right so it's like yeah like it sounds stupid to say but like some of those platforms are actually a little bit on the harder side to like to even create
Starting point is 00:23:24 campaigns successfully, to get them started. And you're talking about someone going into an advertising platform. You have to create some entity that's a campaign. It has to target some subset of users. You have to have some, you're sending something,
Starting point is 00:23:41 text or images or something that's going out to reach these people it has to go to it has to be a valid landing page like theoretically it should be like it should be a tailored landing page but like it is the easier part but nonetheless this still is a barrier in itself like someone someone who's new to this whole paradigm of like, let's say an e-commerce website, that is a, that's, that's the first barrier that they have to hop over is like, how do I even run an ad and that, that on its own would take time to learn one, one platform, let alone, you know, like Facebook, Google, like there are a number of different platforms, right?
Starting point is 00:24:21 So I would say that's first. Yeah. There are a number of different platforms, right? I would say that's first. Speak to the listeners who are on the other end of the pipeline where the campaigns and landing pages are generating data, but they're on the other end of the pipeline, so they're seeing this come through, and probably I see it as tables of data. Speak to them a little bit about what are the things that you would say, here are things to keep in mind
Starting point is 00:24:52 about that process of even, let's just call them assets. You have to have some sort of assets that are actually going to generate this data. There's a campaign that's being served, someone's clicking on something, they land on some landing page or something like that, right? Which sort of ultimately generates the data. What is the data professional on the receiving end of the pipeline? What are the main things they need to know about that whole process?
Starting point is 00:25:17 You're referring specifically to like all that data flowing in on the other end? Yeah, totally. Understanding. Okay. there's a number of like things that have to be orchestrated on that end let me know if this doesn't completely answer your question yeah yeah but there are a number of different areas that have to be orchestrated together to get all that data right which we'll talk about in a second but effectively like that that data only flows in if you enable the campaigns and that data only flows in if further you are collecting either behavioral data manually or your platform is in some fashion like collecting
Starting point is 00:25:55 the the data especially like utm prams that are in the url yep those are and really quickly just just for just for those who don't know what UTM parameters are, give us a quick breakdown on UTM parameters, because I think that will become important later in the conversation. Yeah, it's a kind of an antiquated paradigm and technology at this point. But in short, query your param, well, two things. So query your param and URL. You'll see after the question mark, you'll see key value pairs. So key is some sort of text and then an equal sign. And then you'll see more text and then possibly an ampersand. You'll see that over and over repeating.
Starting point is 00:26:37 That's query pram. That gives you the ability to essentially add additional data and or metadata that modifies behavior of the experience the customer is seeing in a lot of cases or just tracks data. So UTM is Urchin Tracking Metrics, I think. I can't remember the M. But nonetheless, it's a company who kind of, I would say to a degree, was the initial starter of a lot of
Starting point is 00:27:06 what we would say is modern analytics. So they were the company that developed what is Google Analytics. Google Analytics actually bought them, or Google bought them and turned it in Google Analytics. So in short, there's a specific set of UTM crams. So UTM name for the campaign, or is it UTM campaign? It's a UTM campaign. There's a specific set of UTM crams. So UTM name for like the campaign or is it UTM campaign? It's a UTM campaign.
Starting point is 00:27:28 There's a few of those and those are standard and those are used to track various dimensions of a specific campaign. Yeah. So those ideally come in on every channel and every time a user comes from an external site or an external entity into your engagement experience i say ideally because that doesn't always happen due to a myriad of reasons and yet another reason why this is challenging yeah i think that's one of the
Starting point is 00:27:59 fascinating things you know you i mean queer params are used for all sorts of things in software right i mean it can filter a list it can whatever right but it you know and i think actually when when urchin decided to use that back in the day as essentially a way to capture metadata about the source of where a user's coming from it's it's a it's it was a very elegant way to solve a pretty tricky problem in a ubiquitous manner. Then Google Analytics as a free tool gets worldwide mass adoption as the go-to way to track web analytics, which means UTMs for better and now probably for worse are cemented as a way. So you have five dimensions as key value pairs
Starting point is 00:28:51 that drive marketing reporting for most of the world. And there are five arbitrary dimensions. They're completely made up. This is something I didn't know, but they're completely made up. You can type whatever you want it could be you know and you can have as many as you want but but we've you know like you said because of the google analytics adoption yeah these are the five that somebody at urch like
Starting point is 00:29:14 you said like 20 years ago decided and kind of been standardized on that yeah i think the other part to that like you were saying john is in addition to people being able to decide what goes in there like each platform suggests you use certain utm prams differently too yeah that's right yeah to make it extra challenging so like here's how we generally do it on here but you can do it whatever way you want it's yeah it is the worst kind of standard because it's completely unenforceable and interpreted differently right so while there is a standard as far as like these five things people use them so wildly different it's almost not worth having this right right well and that's kind of why i wanted to like
Starting point is 00:29:55 speak to that a little bit for the person who's on the receiving end of that because my gut is to say like come on we have like five okay actually it even reinforces like we have five dimensions here like this can't be that hard but it's like it actually is like yeah it is a pretty pretty tricky to actually get things tight even just from tagging those five dimensions as metadata that i think at the end of the day like this is this foreshadows conversation we'll have later so we'll build a little bit of here but, but like there are ways that you can do this. Like you can make it work across all these paradigms. And we'll unpack some of those
Starting point is 00:30:32 just to let the reader know like some better ways. Yes. Ooh, yes. Ooh, I like that, Lou, foreshadowing. Yes, actually, Lou, I'm excited. You have some immensely helpful methodologies here to help overcome that. Okay, so then we have to collect the data.
Starting point is 00:30:47 And so you have to create the campaigns and the assets, then we're collecting data. And so you're using pipelines to do that. So there's probably behavioral data and structured data that's coming in. Well, yeah, so collecting the data. There's kind of two phases to collecting the data. So it's getting the data out of the source system. So out of Google Ads, Facebook Ads, which again, this whole thing is crazy,
Starting point is 00:31:14 but there's a myriad of challenges there. So again, everyone does it differently, number one. So the scheme is a different data structure, completely different. And then number two, some of these platforms make it really challenging to get the data out uh both from a it's convoluted the naming it's convoluted and complex but also throttling like facebook is a great example of this their their um their
Starting point is 00:31:38 paradigm of like how much data you can get out within a time frame is completely dependent on your audience size like the the the audience that you reach in facebook so like the larger audience you reach the more data you can get out at a time which makes sense when you when i say that loud at a high level but it it creates some pretty tough challenges when it's like yeah we're always getting throttled like we're so far behind collecting the data. So that's one thing, just like getting the data out of the source system. And then the other challenge, which is a little bit easier,
Starting point is 00:32:15 but it's getting that data then into a place where you can transform it, where you can do this actual attribution. Generally, that's going to be a data warehouse. Sometimes people favor data lakes, get a data lake, and then sometimes they'll do data lake to, so like S3 and to data warehouse, but nonetheless, wherever you store your data, you have to get it into there, right? Which is, we're talking some pretty large volume of data for some of these companies. Like it's not, it's not trivial.
Starting point is 00:32:41 It's not data. It'll just take like 30 seconds to, yeah, strain we're talking about impressions go ahead yeah i mean there's also just this like bad alignment with some of these companies with like your interest in like google meta whoever's interest as far as like they don't want you to get the data out they just want you to trust like they're like oh like you're get the data out. They just want you to trust. Like, they're like, oh, like your, you know, return on ROI is this or whatever is this. Like, they don't really want you to dig into it. I mean, let's face it.
Starting point is 00:33:11 It's A, it's better for them because they don't have to like, because, you know, it's costly to be streaming all that data out of their system. That costs them money. And then B, for the bigger thing of like, yeah, just trust us. Like, we'll tell you if it's going well or not. Yeah. Yeah, that's a fantastic, like, foreshadowing point, too, just trust us. Like we'll tell you, we'll tell you if it's where, if it's going well or not. Yeah. Yeah.
Starting point is 00:33:26 That's a fantastic, like foreshadowing point too, that we'll have to touch on. It's like, yeah, well, how does, how does Facebook,
Starting point is 00:33:32 how does Google track a conversion? Are they tracking the same way? Are they tracking like every single user who came to your site? Does that count as a conversion? Like they, they say they don't, but it is a black box. And when you go and calculate some of these and you compare them,
Starting point is 00:33:48 they're wildly different. Like your calculation with your like runner stack behavioral data versus their calculations. It's like, so sometimes you question like, is the Fox guarding the hen house? Because they're, they're incented to boost the conversion you're seeing
Starting point is 00:34:07 because then it grows you know it will theoretically grow their revenue ad revenue because you'll be like oh yeah i'm gonna spend more because it's yeah well yep so it's that's an interesting call john good it's yeah seven and then you have the like the attribution fighting problem too of if you've got different state, you've got multiple platforms you're using for advertising, multiple for retention. You've got this kind of war of like, oh, I want to take credit for this one
Starting point is 00:34:35 and it's some kind of retention tool. I'm going to take credit for it. And in reality, rarely does the number end up being like adding up to say it's $100. It adds up to $200. Like, well, I only got $100, but this attribution data adds up to $200. All of these can't be right.
Starting point is 00:34:52 It's just another challenge. Okay, so we're collecting data from source systems on the advertising side. We need to collect data from the website or the digital property. So all birds use RutterStack for that. So that's the behavioral data. This capturing page view data,
Starting point is 00:35:09 conversion data, etc. And so you're streaming that to the data store. So a data lake or a data warehouse. Okay. So now we are with the person who's on the receiving end of that and they have probably
Starting point is 00:35:24 a lot of different tables that's an understatement so what do we do now tables both in terms of numbers and then a lot of data within those tables yes yeah yeah go ahead well I'm saying, okay, what do we do now? Yeah. What do we do? What do we do? Yeah. So at that point now, it's the data has to all be, the data has to be transformed, which all impact that. And ultimately it has to be all merged together. Precursor to all that first has to be, which a lot of like engineering folks, especially struggle with is like, okay, what's the end state?
Starting point is 00:36:09 Like, what are we trying to accomplish here? So, because it's very tough to actually merge the data together and figure out like, what are we trying to get out of this? If you can't really say like, what's the end state here? So that's, that's usually the first first step which we'll unpack in a minute but like talking directly to your point essentially so it's once you figure that out and once you say okay i want to actually understand across you know all of for my website across all the channels that we're advertising on, for example, so like Facebook, Google, et cetera, like how well are users converting on each one of those? Let's just say channel level to start with.
Starting point is 00:36:53 Keep it easy. So Facebook is a channel. Google is a Google Ads channel. Just to clarify, how well am I converting there? So I spend advertising dollars. People are clicking on ads. They come to my site. When we say converting, it's just like, okay,
Starting point is 00:37:08 how many people who come from Facebook actually buy something where I make money on their... I make money on the purchase based on the advertising dollar that I put towards the ad that they clicked on. Yeah, so I spent $10 on the advertising dollar that I put towards the ad that they clicked on. Yeah. So it's, I, I spent, I spent $10 on the ad. How much did the user purchase? Like, did they purchase first of all? And then how much did they purchase?
Starting point is 00:37:35 Essentially, did I get more back than I put in? Right. That's ultimately the question you want to answer. Yep. Yep. And that, that then ladders up to all sorts of different interesting things. The other thing I mentioned too is like, you might want to measure conversion as the other fairly big thing.
Starting point is 00:37:52 Now I'm not a huge believer in measuring conversion because that can be gamed. We can talk about that later. But nonetheless, like those are kind of the two main things. Yeah. So basically what you have to do there, it's a transformation problem. So you have to get all that behavioral data. You have to get all of that that you've collected on the
Starting point is 00:38:10 website. So that's got UTM crams, user conversions, things like that. Yep. You put usually what you have to do as well as you get all your order data. So that gives you your conversions, the amount that user spent. Sometimes, sometimes you've merged those two to a degree to make sure they align closely. So obviously, as John said early on, it's like sometimes you can't get 100% of the data and behavioral. So that's why you'd want to merge in your actual e-commerce data, like let's say Shopify or whatever. Then you have to merge in your ad spending data. So we're talking Google ads and Facebook ads here. So you have to actually then figure out, okay, how do I normalize that data to figure out
Starting point is 00:38:51 per channel, how much did I spend? And usually this is temporal data. So you do like per day or a week, month or year, et cetera. Yeah, yeah. Same with all those other two I should mention, right? And then lastly, then from that like once you've merged all that together then you have to then generate data from that like metrics measurements and that's you know like i talked about a minute ago it's that conversion it's that
Starting point is 00:39:21 that revenue etc okay so i want to ask two things one of them is that i'm going to play dumb and ask about the keys that you join on at a very high level and then the second is i actually want to circle back to your way of thinking about utm parameters and how to solve some of the problems around that because you you have a couple of ways, and we've actually talked for a long time about some ways of overcoming some of the challenges there. But, okay, one join key, and I'm massively oversimplifying this, but I think it's fun. I think it's important to get into the details. Hopefully helpful. One of the join keys that makes sense to me is that you have behavioral data from the
Starting point is 00:40:05 website that contains the UTM values from a page view. So someone clicks on an ad, they come to the site. We'll use Rutter Sack as an example. As you and I talked about a ton with the Allbirds stuff, it fires a page call that goes into your warehouse, it gets flattened into a table, and there's a column that says UTM campaign from that page view that has the timestamp on that table. Then the data that comes from the source advertising systems, there's some campaign and ad, there's an ad, a row of data, however it is, you have to clean it probably.
Starting point is 00:40:43 Not probably, you do actually. I know that from experience. I can't play completely dumb here. You clean it up and you essentially get some clean tables that are rows of data where there's a URL that you input into the source advertising system when you deploy the ad so that when they click on it,
Starting point is 00:41:00 the user goes there. So at a very high level, you can join on UTM keys or sort of the components of the URL in order to tie like, okay, I spent this much money on this ad, and then I see this many UTMs in the behavioral data, you know, and then you can sort of correlate that to conversion. Now, what makes this really gnarly is that you have to do that on a unique user level, right? Like, because you have to tie the purchase and the page view and the conversion and all of that to like a unique user so that you can say, okay, well, this page view is associated with this user is associated with
Starting point is 00:41:42 this like actual transaction that has a dollar value tied to it and so there's almost like a like a user reconciliation identity resolution type element to this too where you have to like make sure that you're reconciling you know reconciling that cleanly from a user standpoint am i thinking about that correctly yes you absolutely are and there's even more to it as well um it's hot thickens so you're you're spot on it is a it actually is the identity resolution problem and that that identity is is basically the we're gonna say channel for right now because we're doing channel level but it depends on what level you're doing right so like channel ad set add like at each level it's an identity resolution problem um so oh right yes yeah like at each entity right because you have yeah yeah you have to reconcile all the different disparate data from the source system actually to whatever key you're
Starting point is 00:42:36 going to join on so that you can yeah yeah so like taking channel you have to, you have to do, you have to do identity resolution on, um, what are all the, what are all the, the channels? So in this case, theoretically it's Facebook, Google, then you have to figure it out. Okay. For each one of those channels, what are, what's the order values that we talked about? And so your join key key the end is those two channels but then there's the part that i was saying there's a little bit more to it you also have to figure out your spending in the ad platform which again is a join key and that is ultimately has to be your it's a combination of what did i spend
Starting point is 00:43:25 at a channel level and then joining that with the other two to get to get channel orders and spending right at spending conversion dollars and channel and so the combination of those three at a high level are like that's how your joining works and again right so like think about that that gets more complex each level you go down because like just ad set just touching on that for a second now ad set is they step below for listeners out there who may know know a little bit less ad says the step below a campaign so within a campaign you have an ad set and an ad set or an ad set is a step below a campaign. So within a campaign, you have an ad set. And an ad set or an ad group is,
Starting point is 00:44:10 it's basically, it can be multiple ads. We'll unpack later why you'd want to do that. But for now, just think multiple ads. And so now your join key is ad set and campaign. So campaign would be like, you know, overstock sale. Or channel, sorry. Yeah, that's fine. You could be like, you know, overstock sale. Or channel, sorry. You could have like, you know, so you have overstock sale, but that could be a campaign in Google, a campaign in Facebook. Then you could have an ad set that's like, you know,
Starting point is 00:44:36 shirts and an ad set that's like shoes or pants or whatever that are like these sort of logical groupings. Yeah. And then you may have ads within an ad set that are like blue shirts or green shirts or something. And so you have like a pretty complex hierarchy even to try to triangulate all of that. But spend your, yeah.
Starting point is 00:44:57 And so that's your join key, right? So your join key is the combination of all those things. So at whatever altitude you want to look at. Wow. And so again, this gets back to the, what does the business want to measure? What's the out time? You have to decide that up front,
Starting point is 00:45:15 but a lot of people don't understand that you have to decide that up front. I mean, I guess you don't technically have to. You can always do it later, but to really do it well, you should decide it up front. Yeah. That may be actually like i want to sorry to interrupt you there lou not at all i just wanted i want to reiterate that may actually be one of the most helpful things i've ever heard about attribution where it's like decide what you want up front because there are so many ways to slice this and altitude i think is a great
Starting point is 00:45:48 word for that like you can go so granular and get so close to the ground with a magnifying glass right or you can be at 30 000 feet and none of those are wrong but like trying to do every level of altitude is impossible. Yeah. At least a bad idea. At a minimum, rarely ever worth the effort. Right. But I think that gets to the second part exactly of sure.
Starting point is 00:46:15 You can do any altitude, but a naive, a naive individual might be like, Oh, let's just go all the way down. Like, and then we'll have the data all the way up. Sure. You can do that, but that actually is the hardest to implement it gets
Starting point is 00:46:28 it's harder to implement the deeper you go but then also the data the data it's it's harder to gain information that you can use to make like actionable decisions the lower you go um i in a lot of cases i equate this to like stock trading a little bit and so it's the more information you have possibly the better decisions you can make but also the worst decisions you can make so if you're trying to optimize like if you're trying to pick a stock like or you're trying to pick between two stocks it's an optimization problem like stock a or stock b and conversely you're trying to pick between two stocks it's an optimization problem like stock a or stock b and conversely you're trying to pick against advertisement a versus advertisement b because you're at the ad level you're trying to figure out which one do i do there are there are
Starting point is 00:47:16 a lot of day different ways like data points that you can decide on that it's not just a straight like it's not always gonna be a straight answer i should always go with a or i should always go with b same with stock trading right because stock trading is it's economic based it's news based so there's a myriad of different things you have to look at in order to actually decide like which ad should i boost which ad should i kill or should I do nothing? And so the decisions get more complicated, the lower you go. Cause you also have to like, you have more data and you have to decide more ads, which ones do I want to keep? Which ones do I want to get out? Same with stocks, more stocks you're looking at the more it's like, which
Starting point is 00:47:58 ones do I trade more of, which ones do I get out of, et cetera. Right? Like it's a, it's a Kelly criterion optimization problem, whether it's stocks or ads, like you could apply it kind of the same way. Yeah. And so that, that, those are your joint keys, like back and just taking that back. And then also if you think about it for a second, the other challenge of just generating the joint keys, which I want to fill out to people people like I highlighted earlier is the data is not consistent so that I think that's actually one of the biggest challenges
Starting point is 00:48:30 any level but especially as you get lower because the joint keys get more complex it's my my 100 different users came to my website through Facebook, 95, like, the campaign name was correct. But the campaign had a space in it. And so 5, like, the campaign, the space isn't represented as percent 20, it's represented as
Starting point is 00:48:57 plus, right? So now theoretically, if you're matching directly, like doing a direct string match, you actually have two different campaign names so they're going to be there's gonna be different like if you do a naive like i'm just going to directly do a direct string match in order to create my my join keys you now have two different campaigns even though they were the same campaign yeah but the characters were different yeah so that creates a whole different set of
Starting point is 00:49:25 challenges it's standardizing it's basically creating standardized yeah keys and you have to standardize those names you have to figure out like which ones are the same but which ones actually are different even though they look similar yeah yeah well and what totally because i think it's easy to conceive of the modeling problem. It's like, okay, multiple levels of altitude, yes, that can get complex. But if you don't assume that you're going to have dirty data, it's like, okay, that can get complex, but that's doable, right? But the dirty data problem compounds
Starting point is 00:50:04 because you have the different levels of aptitude within each platform. You have all the different platforms. You have the fact that the data is actually delivered differently in all these different platforms. And because they're all different tech, the conventions can break in all sorts of different ways. And so the long tail becomes like absolutely insane well and even if you have your your team
Starting point is 00:50:29 like completely aligned marketing data team you know the whole team aligned all of your stuff name is named perfectly correctly every time and every platform which never happens even if that were the case like this is like free form data like any user yeah if you're an evil person you want to mess with some marketing people let me give you some tips no but really like any user can advertently or inadvertently like you said introduce a little space any of the millions of people that may be on your website and then all of a sudden you have two campaigns for that one little record and so it is an unsolvable problem to get to perfect yep yeah um yeah yeah or john doe decides like he like he wanted to do something different because he's new to the company and he doesn't really know or understand
Starting point is 00:51:15 or he's like i don't want to read all the material and like he names the campaign differently or he modifies the currently named campaign because it's like something in the spelling error. Yeah, yeah, sure. Yeah, it's a million. Now you've splintered your campaign, right? Yeah, yeah. Yeah, and the highlight, John, like that's a great point. Like there, it's freeform.
Starting point is 00:51:35 It's an absolute nightmare. Yeah. Okay, so we're clearly gonna have to turn this into a two-part series because we are maybe 5% of the way through the conversation. At least two parts, if not more. One thing I do want to cover really quickly, because this is great.
Starting point is 00:51:55 I actually think we've gotten pretty deep down into the stack and into the data. But Lou, talk us through some of the ways that you mitigate some of that freeform data challenge and the inherent limitations of the prevailing five-dimension metadata methodology that is so ubiquitous because of Google Analytics. So what are some ways when you think about the system design?
Starting point is 00:52:24 And one thing I love about the way that you think about this approach that we've talked about many times is that this is sort of a holistic way of thinking about the problem both in terms of the inputs and then also in terms of join keys even, right? And sort of the way that you even think about solving the modeling problem. So just walk us through a different way to think about that that can help you move beyond being beholden to five free-form dimensions that are you know impossible to solve for yeah two things
Starting point is 00:52:59 before i like get into that so number one you, you know, this isn't, this isn't perfect. First of all, right. Like there's still, as John eloquently put it, like it's still, it's free form. It's yeah. It's impossible to get perfect,
Starting point is 00:53:12 but I mean, this is improvement. Number one. And then number two, I think credit where credit's due. Like I've been kicking a general idea like this around for a while. And I was talking with Eric about how to do this better and eric mentioned like his his old fern had come up with a way to do this as well and like they'd come up
Starting point is 00:53:34 with a pretty good way and it was a yeah it was it was a combination of this you know my thinking in that conversation so like eric thank you like you you actually helped out a lot in the space you and your you know your team of folks like you and benji so this is definitely not just me right this is this is far from me coming up with this many conversations over over 30 months yeah but in short you know the it is a key right like at the end of the day if you think about it from that perspective and actually i'm sorry one more thing super quick that i want to highlight that i wanted to highlight before is i think this is so important what i'm about to say in a second it's so important again to like define what you're trying to do up front because doing this up front
Starting point is 00:54:17 will save you so much trouble and will enable you to do like historical merging of your data versus if you don't do this until later on it's going to be tough to nearly impossible to go back and like do your historical attribution so getting into the meat of it it's really at the end of the day you have to develop i think success to be more successful at this and take out a lot of the like, Hey, UTM, UTM params at, especially at lower altitudes are really hard to merge together and create a key from the verge key. It's just create that merge key up front at the end of the day.
Starting point is 00:54:53 So it's create that merge key up front and attach it to every single campaign. So every single campaign, every single ad set, every single ad has a unique key and that it's a spaceless key, right? Like it's a key that's gonna be tough for a browser to munch. I'm not saying it's impossible, but it's gonna be very tough. And you attach that to every single essentially ad. And it's that unique join key is a query param. and then there's some nuances to that
Starting point is 00:55:26 obviously which you and i eric have talked about before we can unpack here but if basically if you do that job up front you could use that join key to do to skip all of the challenges we just talked about and just join on that key right yep and you're generating that usually as some sort of hash correct so you basically and how so what are the inputs to that hash because one interesting about thing about this that you and i've talked about lou is that if you limit yourself to five dimensions what generally well one at a base level, just from a strict technical standpoint, like you only have five dimensions and you don't want to add spaces and other things like that.
Starting point is 00:56:11 And so practically what ends up happening is probably the best way to say it would be that the people who are creating the key value pairs, generally who are marketers, get very creative in how they package information into those five dimensions. Yeah. Well, and I think just like for people that are less technical, you're talking about key value pair and such, like it can be as simple. And I think we've done this before. I've done this before of like, hey, we're going to start at one and we're going to
Starting point is 00:56:43 put the number one in there. And then in a reference sheet, number one equals that trade show we went to that was in London. You can say whatever you want. And then you can categorize it in 12 different ways for later groupings. And then when somebody changes their mind, you go rechange all those categories. And it works. Yeah. And I think the key there is you rechange them only on your system of record, like internally. Yes, exactly. Or you augment it. Right. So like what you're asking. But you don't reuse that number again. One is toast. Like do not reuse it. Exactly. Right. Like there are a couple of nuances to this one pack and you just hit on one, John. But basically, like you don't need to necessarily hash all that, like all that data, Eric. And so every single thing you're interested in it's basically you're hashing on an agreed upon set of columns so it could even be like the five utm params if you
Starting point is 00:57:31 want if you want to keep it simpler and you're just hashing that and you're hashing it like john said one and done meaning you're if you're if you go to the hash once you generate your hash you never change it like even if you change the utM params, you keep a stable hash because otherwise it's your join key. Right. So that's one, you know, that's one gotcha. One key piece is like, you have to be diligent about not changing your hashes when you change things internally. Another is like, you have to be diligent about tracking this. So you have to have a system of record.
Starting point is 00:58:03 So sometimes like that gets a little complicated, like a simple way to do it should be a spreadsheet that you feed into your data warehouse. People make mistakes. So you just have to like you have to be careful. By and large, yeah, like, I would say hash to the hash is highly resilient to collisions, meaning, you know, the same output should always generate the same input shall generate the same output should always generate the same input, should always generate the same output.
Starting point is 00:58:27 And any variation in the input should generate a wildly different output. You know, the internet is very broken, if that's not true, with modern hashing algorithms. So that's why you would, hashing is probably the best way to do it, generally, because that, I mean, that fits that paradigm very well. And Lou, one thing I love, just to circle back to what you mentioned earlier and which I called out, but I'm really saying this
Starting point is 00:58:49 to myself, almost to assuage my pain from past life. This is me doing a little self-therapy. You're helping people out. That's why we do this show. I think it's good to get those out there and help people out.
Starting point is 00:59:10 Defining, the hash hash thing as we've talked about it really can be a game changer because it just solves so many different issues but one one thing about it that is um that you have to be careful of is like you can you can pack as much information as you want into the hash, right? So I could have a thousand columns of data that I want to pack into a hash and this system of record and whatever, and then I have the ability to unpack all of that, right? But to your point, Lou, the thing is, what do you need to hash? It's the requirements that you defined up front.
Starting point is 00:59:42 That's what you actually need to hash, right? Is those requirements. And so, man, that's the, that's what you actually need to hash, right? Is, is those requirements. And so, man, that's just such good advice in terms of like getting super sharp on that, because that determines the level of complexity that the system needs to serve. Not that that can't be changed over time, but in all of these things, there's really no limit to how much you can add. And of course our tendency is to just say, well, we might need to use that. And so you tend to like add more and more and more you know or go or or do what you said
Starting point is 01:00:09 which is like let's just do every level of altitude right so right changing it over time definitely is the reason why i say you it's really important ideally to define this up front define what you're trying to accomplish is while changing it over time is not impossible changing it over time is not impossible, changing it over time adds a massive layer of complexity when it's undoubtedly like you have to do a full refresh of your data ecosystem, like say if you're doing DBT. So it just generates a lot of complexity if you ever have to go back and regenerate historical data. This is the think like an accountant part of the show, right?
Starting point is 01:00:44 Because if you put that accounting hat on you're like oh i'm gonna have to regenerate all these financials and do this to the bank and like like think like if you if you yeah grab an accountant pull them into your team and they would do this perfectly like maybe that's the strategy we've all been missing yes totally yeah okay well unfortunately we are over time but lou let's get you back on as soon as we can. Because, okay, we're at the point now where we're deep in the sack. We understand at a high level, like the input, some of the complexities, why this turns into a really gnarly problem.
Starting point is 01:01:21 And we have a way to do this way better with a hash. We just scratched the surface there. I think there's a lot more to talk about, but we literally have not even talked about like, okay, you're producing a metric and that is the other side of it that gets even crazier. So, so come back on and we'll start where we left off. We'll dig back into the hash and talk about some specific methodologies here. I think this has already been super helpful. I've got a teaser for the next show.
Starting point is 01:01:49 The other thing, like zoomed way back out. Like if I'm just listening in, like it's like, man, that sounds really complicated. Like when does it make sense to do this? Like we gotta answer that question. Yes. Okay, so agenda for next show. Deeper into the hash, attribution models, right?
Starting point is 01:02:06 And then especially when to apply advanced techniques that include machine learning. And then, Lou, also I think another thing that would be really helpful is how is, I mean, this sounds cliche, but legitimately how is AI shaping this, right? I mean, there are some things around that that I think are super important as well.
Starting point is 01:02:26 So stay tuned for part two. I already can't wait because this is so fun. The Data Stack Show is brought to you by Rudderstack, the warehouse native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.