The Data Stack Show - 131: How Data Teams Interact With Marketing Tools with Jason Davis of Simon Data

Episode Date: March 22, 2023

Highlights from this week’s conversation include:Defining CDPs (2:28)The data team's role in marketing (7:41)Leveraging commonalities across businesses (12:49)Building a CDP with customer data (18:0...5)Challenges in identity modeling (23:00)CDP lifecycle and one-to-one data (30:06)Segmentation and optimization (33:23)Real-time data in the cloud (40:37)The future of AI and machine learning (43:02)Final thoughts and takeaways (46:42)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 The Data Stack Show is brought to you by Rutterstack. As we said before, March is Transformations Month at Rutterstack, and you could win a $1,000 cash prize, a feature on this podcast talking to Kostas and I and more, by just contributing a transformation to our open source library. Go to our Twitter page at Rderstack for more details. Good luck. Welcome back to the Data Stack Show. Costas, we have a special episode today because we're going to talk with someone who is building a tool that has a lot of really interesting data componentry to it, but is ultimately intended for marketers.
Starting point is 00:00:46 So I think these are a little bit uncharted waters for it. Yeah, I mean, it's going to be interesting, I think. But haven't we talked about CDPs before? Is this like the first time? I think we have talked about CDPs, but I think it was in a shop talk. But today we're going to talk with Jason from Simon Data, and they call themselves a CDP. So I think we're going to talk with Jason from Simon Data, and they call themselves
Starting point is 00:01:05 a CDP. So I think this is the first sort of official marketing flavored CDP. And he has a background actually in machine learning at a PhD level. So that's even more interesting to me, because he obviously understands data on a deep level, but is building a tool for marketers. So I'm going to ask him about that, and background and some of the you know the ways that he thinks about marketing tools and the way they interact with data teams because i think he has a unique perspective yeah absolutely and i'm very interested in like talking with him and like learning more about what exactly like a cdp has to do with the data to offer its services, right? It's very easy to just focus on the user interface and Stockholm in terms of like, oh, okay, we're just creating audiences.
Starting point is 00:01:57 But this might be at the end a very complicated process that not only it's complicated like to describe it in SQL but also something that it's let's say it has to be driven by someone who has no idea about like data or like writing SQL or writing code. So I think that's what makes
Starting point is 00:02:18 like the problem like even more challenging and it would be great like to chat with him about all these challenges and see exactly like how they can be addressed by a platform like Simon Data. Yeah, I agree. Well, let's dig in and chat with Jason. Yeah, let's do it.
Starting point is 00:02:33 Jason, welcome to the Data Stack Show. Super excited to chat today. Thanks, Eric. Pleasure to be here. All right. Well, give us your background. So we want to hear about Simon Data, but you actually have a background in working with data. So tell us, you know, give usfounder, Matt Walker, and CTO at Simon Data today. We've been working together for over 19 years now. It's pretty hard to believe. Anniversary 20 will be coming next fall. But I always joke, it took me about five years into my PhD to realize the value in data isn't in the algorithms for machine learning. It's how the data is actually used in practice. My previous business was an ad tech product that was acquired by Etsy. And through that experience, I really just saw the power of enterprise data, centralized data, and how big
Starting point is 00:03:32 data can really be a disruptive force. And the core thesis behind Simon really brings that to today's cloud-enabled environment. Cloud-enabled data is a huge force. I certainly was not expecting this 70 years ago when we first started the business. And today, our thesis at Simon is really that of being the application layer for a next generation of data-driven marketing, to really rethink what a CDP is and what their data requirements are to affect better lifetime value, better ROAS, and better conversion rates. Yeah, super helpful. Okay, a couple of terms in there that I think would be super helpful.
Starting point is 00:04:11 So let's start by breaking down what a CDP is, because this show is all about data. And customer data platform is a term that is not new. It's been around for quite some time, but it's really easy for people, you know, when the term CDP comes up to think of different things, right? On one extreme end of the spectrum, people, you know, may think about this as a tool that sends marketing messages to users, right?
Starting point is 00:04:41 Like a push notification or an email. On the other end of the spectrum, people would, you know, may think this is just infrastructure that processes customer data. And then of course, there's a huge spectrum in between. Can you help provide some clarity to our listeners on the term CDP and maybe even help us understand like Simon's philosophy and where you fit into the spectrum? Simon Miles Yeah, it's a great question. At the end of the day, the category is undoubtedly wide. I was talking the other day with Sumya Ruderstag, a CEO. We were just talking about how our joint strategy and vision are actually fairly complementary, which is unusual for two
Starting point is 00:05:23 vendors in the category to get together. I'm very close with Michael Katz and Particle CEO, again, another vendor in the CDP category where we actually share quite a few customers in common. When we look at CDP, it really starts with asking,
Starting point is 00:05:39 how do you enable in-business stakeholders, marketers in particular, to be data-driven? And with this, what are the marketing activities that require deep, bespoke, and specific access to an evolving world of data? That starts with segmentation, but for us at Simon, that also includes personalization, that includes experimentation, and that finally includes thinking about all the marketing channels that exist today and how do you optimize across them in an asynchronous
Starting point is 00:06:05 way via something that marketers call orchestration, which is very different than data orchestration. Yeah, absolutely. And so how do you think about the data side or how does Simon operate on the data side? Because if we think know, or maybe actually a better way to ask the question would be, let's say I'm on a data team and one of my internal customers at my company is marketing, right? And let's say they're using Simon. What does my relationship with them look like? And how does Simon, you know, sort of, how do I interact with Simon? Can you break that down for us a little bit? Yeah. So, you know, really the way to think about Simon as someone, as a data practitioner, data engineer, data analyst, data scientist, who has a cloud data warehouse, Redshift, BigQuery, Snowflake set up, is we provide the infrastructure and the core ETL tooling to help your marketing team get started
Starting point is 00:07:01 around the problems they need to solve. That starts with holistic modeling around identity. That starts with thinking about treating batch data in your warehouse and real-time data separately. And that ends with building a customer 360, not for your warehouse, but for your marketing teams to really have that view of the customer relative to the applications they need to affect as a marketing organization. Yeah, got it.
Starting point is 00:07:29 That makes total sense. And so Simon's actually doing the building or augmenting the build of that 360 degree view of the customer on behalf of the marketer. That's 100% right. Yeah, got it. Super interesting. Okay. I have a question for you here. Because you're so familiar with and are building your products for the use cases that these marketers want, to your point, data is in and of itself, is it valuable? No, right? Like, what do you actually do with it? That drives value, right? For our listeners who work on a data team,
Starting point is 00:08:12 and maybe even the ones that serve marketing teams as an internal customer, but maybe aren't as familiar with like, what is happening at the end of the line with the data, because maybe their role is more around modeling, packaging, cleaning, whatever those pieces are, and then delivering this data product to a team, to an endpoint, to a tool. What are the top things that you think are important for someone in that role on a data team to know about what's happening at the end of the line or sort of the last mile as marketing teams are using this data? 100%.
Starting point is 00:08:51 And I'll answer this question, Eric, by throwing out a term that marketing talks about all the time and then mapping it back into data terms that the listeners of the show may probably know a bit more familiarity. So what marketers care about is something called the customer journey. The customer journey is the interactions that an individual has with a brand and business. From that first touchpoint, you see an ad on Facebook or you see an ad in the open web. A month later, you might click through a different ad and you interface with a website and you read about it for 15 minutes. You then listen to the company's podcast for an hour.
Starting point is 00:09:32 And then maybe a few weeks later, you finally dive into some of the documentation or material. And then eventually you might sign up and be a paying customer. And then there's sort of the entire engagement path downstream from there. And the first problem that we saw from a data perspective is really thinking about how data is modeled from... Sorry, how the user's identity is modeled. The first interactions are second-party data. These are interactions that aren't even anonymous users, they're non-users. They happen completely outside the realm of data as we have it today for most folks who are running data warehouse. With that first touchpoint on the website,
Starting point is 00:10:11 that would be a fully anonymous customer. The second touchpoint might be a fully anonymous customer from a different device, so a different cookie. And then at some point, these paths may converge together. And those anonymous browsers may link to a single known user. And then there are all sorts of considerations downstream there around householding and beyond. And what Simon does is it stitches together the customer journey from a data perspective, builds those identity
Starting point is 00:10:35 associations. And it does so in a way that's actually modeled directly in your warehouse. We are built natively on top of Snowflake. And those models are made available for our customers to have full visibility. And then our platform deploys directly on top of that to enable our marketing teams to see and to have that full continuity across the entire journey. Oh, fascinating. Okay, so I'm just going to say this back so I make sure I understand it because this is really interesting. So because, well, actually, let me sit back and say, you know, having been involved in this kind of work, you know, hand rolling it, generally, when you talk about building an identity in a warehouse, which, you know, is, I'm a huge fan of because you have visibility, and you can manage the edge cases for your specific business and, you know, patterns of
Starting point is 00:11:20 customer usage or whatever. But ultimately, you're talking about an unbelievable amount of SQL that ends up being really hard to maintain over time. And the trap that I've fallen into multiple times is that inevitably, there's someone in the organization or a group of people who have tribal knowledge about how this thing works and, you know, the output and, you know, whatever. And so it sounds like you actually sort of remove the need for data teams to, you know, to hammer through all that SQL and have something that's difficult to maintain, but you maintain the visibility like in Snowflake so I can see all of that. It's 100% right. And the way we view the world is we break data problems down into one of two buckets.
Starting point is 00:12:06 And this directly comes from my experience building data teams and dealing with all sorts of complex data challenges that come from, as anyone who's listening to this show has seen either on the front lines or managing teams or data functions. The problems that are bespoke to the business around collection, around aggregation, around core metric definition. And then there are problems that really have a degree of consistency across any brand. And our strategy is to leverage the latter and to sit on top of all the great work that's happening that as a CD people couldn't possibly own. While bringing efficiencies and generalization capabilities
Starting point is 00:12:47 to the lab. That's really how we think about things at a high level. The number one point of value that we bring to our end customers is speed. You have data in
Starting point is 00:13:03 your warehouse. It may not be perfect, but guess what? You can get to value in a few days, maybe a few weeks at worst. Let's not build out all your infrastructure. And let's look at where you are today. And look, data is not an end status journey. No matter where you are today, let's get to value. And then as you improve, as you bring on a platform like RudderStack and have better granularity around the data that you're collecting from your website and mobile application, that's another level up instep in the data capabilities that our customers have while providing incremental use cases all along to our end users. Super interesting. I want to dig into the difference between, you mentioned sort of bespoke business needs versus commonalities across businesses. And Costas knows that for some time I've brought up, he may be tired of hearing it.
Starting point is 00:14:07 I have this theory that when it comes to data models, there's probably less than 10 that every business could use, right? With light modification, right? So like e-commerce or B2B SaaS or whatever. And of course there's differences, right? But when you talk about just sort of the core model that you use as a starting point, there actually is a lot of commonality, right? And a lot of the differences that businesses create when they get bespoke are actually
Starting point is 00:14:36 more around syntax. Is that kind of what you're getting at in that you can provide time to value or help teams move faster when it comes to it because you're acting on some of those commonalities? Yeah. And look, I think 100%, and I'll add a couple of caveats to that. The other dimension isn't even business models for civic, it's just marketing dynamics specifically. Let's look at all the funnels across each of your core marketing channels across paid and owned and direct mail and email and push. There's a
Starting point is 00:15:12 degree of commonality here especially across tools as well. Look, if 9 out of 10 customers are extracting data into their snowflake with Fivetran, it all looks the same. Piecing that together, interestingly with Fivetran, it all looks the same. And piecing that together, and interestingly, Fivetran is building canonical ad tech views, but our views all are very hyper-focused and mapping bad to our own application to get to end value as fast as possible.
Starting point is 00:15:39 The other dimension to this, which I think I pushed back on a little bit, Eric, in terms of consistency across business models, is when you look at a lot of enterprises today, they have data which is as complex. And anyone who follows Chad Sanderson on LinkedIn, yeah, there's a huge movement around data collection. And for RutterStack customers, clean slate, you're recollecting data. But the fact of the matter is, if you have a system, which is 25 years old, actually being able to go and re-instrument the code, it just ain't happening. You can talk about it on LinkedIn all you want,
Starting point is 00:16:17 but it's a multi-year build-out. And quite frankly, I think reverse engineering, the data as it comes in, is sort of state-of-the-art for many of these larger businesses today. And it really is the only path. So I think there is, when we sort of look at, and I respond to your point, I think, in some sense, our strategy is if you think about how data teams at large enterprises have done all that work. You know, and the data is certainly not perfect, you know, but there are large teams of data people where the data team for specifically tasks will get, make it one step better every single day. Yeah. Let's take that as an input, you know, and then, you know, and then apply a lot of our standard transformations in a way that, you know,
Starting point is 00:17:01 aligns directly with the core applications that, you know, our end users need to affect. Yep, super helpful. One last question for me, because with the mention of transformation, Kostas has a lot of questions about the data model. But one question I have is, do you loop back into the warehouse?
Starting point is 00:17:23 Because one interesting thing about sort of know, sort of last mile tooling is that it's actually creating touch points on the customer journey. But a lot of times those can be a terminal destination. So do you loop back into the warehouse to feed the model in a loop? Yeah, 100%. I mean, it's, I mean, look, like I think,
Starting point is 00:17:38 you know, ultimately, you know, as someone who's run data teams in the past, I couldn't imagine building an application that did, you know, that wasn't anything but a good citizen of data on both sides. So let's make sure that the modeling integration paths are out of the box and straightforward and extensible. Let's make sure that any and all data that the platform collects or creates or reports on for that matter is then shared back into the environment.
Starting point is 00:18:06 Very cool. All right, Costas, all yours. Thank you, Eric. All right, Zeshan, let's start the conversation by talking a little bit about the data that is used by a CDP, right? Like if we want to build like a CDP, what kind of data we are looking for?
Starting point is 00:18:25 You mentioned earlier that at the end, what the marketeer wants is what cares really about is the journey that the user has with the brand and the company. But how is this represented in data? Everyone understands what a journey is, but what kind of data we need to recreate digitally in this journey? Look, when we think about data to marketing applications, there are two types of data. There's data that directly has a customer identifier, and there's data that does not. I think one of the poorly understood points that marketers understand but don't really communicate back to data teams is how critical customer data is in this broader marketing journey. Ultimately, it's not about what the customer does, it's how the customer interacts with inventory. It's about how all the
Starting point is 00:19:25 metadata around that inventory, whether the customer is browsing homes on the web or buying widgets in the e-commerce context, what's the property, what category is the widget in, what's the price point of the home, what geo is the home in? Is the home in a geo that's primarily vacation homes, or is it in a geo that's, you know, primarily, you know, vacation homes, you know, or is it in a large, you know, suburban development with great, you know, that kind of data is critical to understand the journey. And it really, when it comes down to segmenting and, you know, identifying, you know, audiences, it's critical for that as well. You know, and I think this is sort of when I sort of think about a lot of the rich data that, you know, drives, data that drives some of the really interesting use cases for assignment data as customers.
Starting point is 00:20:10 It's data that actually doesn't even originate on the customer. It's data that joins into the customer. And then I think the challenge is how do you build a UI that allows the end user to access this data in a way that's nine out of 10 times no code and low code when necessary. Because ultimately, I think the name of the game today with so many rich cloud data warehouse enabled environments is speed. The question isn't, can a data team build a segment?
Starting point is 00:20:42 Because the answer is yes. The question is how long? And furthermore, who's actually responsible for build a segment? Because the answer is yes. The question is how long? And furthermore, who's actually responsible for building the segment? And are they enabled to do it in a way that can take a few minutes instead of a few weeks? All right. First of all,
Starting point is 00:20:56 let's talk a little bit about some definitions because you use the term inventory, right? And here we are also, a big part of our audience is like engineers and data engineers so they might not be like so you know like don't know like all the marketing terminology so what is inventory like when you say like when you're talking about inventory what is this it's any database object that doesn't key into a customer you know and anything that can ultimately have an interaction with the customer.
Starting point is 00:21:30 That's really what it is in generality and practicality. It's what the customer can buy, what the customer can browse, what the customer can view, the content that the customer might read. And here we are talking about, let's say, assets that are only digital, right? Or there is also like data that might be coming, let's say, from, I don't know, like physical stores and the interactions that the user might have there. Is this also something that is happening? A thousand percent. I mean, there's a question to give a marketing application, even though I know you're trying to bring it back up to the data use case. But you can identify a set of customers who have an outstanding support ticket in the last month. Or you can ask, let's find everyone who has an outstanding support ticket of type X, where type X is something that you really messed up on and you want to be able to
Starting point is 00:22:15 remediate quickly. The support ticket might have a category or classification, an X is the classification, and F is the business requirement of identifying every user you know you know through which when you do this two-way join i guess yeah it satisfies that condition uh-huh and you mentioned also something else you talked about like there's like a great distinction there's like anonymous data and data that have like an identity right like that we can attribute to a specific user that we know some information about that person. Can you tell us a little bit more of how each one of these two categories of data is used, if there is some difference there?
Starting point is 00:22:58 And do these data ever, let's say, merge? Is it part of the process to connect an identity to the anonymous data? That's the hardest part of the whole process, is actually thinking about how identity merges and evolves. Look, ultimately, it's not a linear process. You can have two objects that have identity type anonymous that can merge into identity type known. And then you can have a third identity type known that can merge into that as well. And the identities can change. And then there's all sorts of corner cases that have to be dealt with.
Starting point is 00:23:33 And there are all sorts of generalized cases that are required to actually do the problem properly. But 100%, there's real complexity here. And the bookkeeping requires some meticulous domain-specific understanding. This is part of the CDP responsibility to reconcile the
Starting point is 00:23:53 identity and create this identity database or graph. We'll talk more about how it looks, but whose responsibility is it to construct and maintain these identities? So this is the million-dollar question, Costas. And I think you asked the reverse ETL guys, Kishash, I saw him the other day at a conference in the Bay Area.
Starting point is 00:24:17 We were talking about this at Lange. Look, I think five years from now, I think the world is going to look a lot different. But let me tell you how it is today. Today, I'd imagine 9 out of 10 listeners, if not 49 out of 50 listeners, they have the data in the warehouse. When it's relatively
Starting point is 00:24:35 clean, they probably have some reasonable metric definitions, and they're probably outgrowing their looker models and trying to move it upstream, and they're adopting DBT and using best practices. The fact of the matter is, maturity around identity modeling today, and by the way, you can't build a customer 360 because this, by definition, isn't an integrated view of your
Starting point is 00:25:00 data plus your identity to enable marketing. If the identity isn't done properly, then the marketing application to customer 360 can't happen. And maturity today across data engineers, data analysts, data scientists, and certainly open source tooling, along with any sort of dedicated providers that do this is incredibly low.
Starting point is 00:25:18 And the challenge is, and this doesn't mean that a motivated data engineering team can't take this on as their H1 project and devote a set of folks and figure it out and ship it, you know, at some point next year. But what it does mean is that it's a big effort, science experiment. There's a lot of risk. At the end of the day, there's a lot of detail that's still, you know, the unknown unknowns that lie ahead, you know, for so many folks who roll this on their own. You know, so our strategy is, look, we understand this. We want to get everyone from zero to one.
Starting point is 00:25:51 One can be a small step or a big step, depending on the eyes of the beholder. And then the key there is extensibility and enabling. Look, every one of these corner cases can change from business to business. The general approach, there's a high degree of consistency, but when you really get into how things work, there is a level of postponement. If you can't go from zero to one, don't try to go from nine to ten. Yeah, no, makes total sense. All right, so, okay.
Starting point is 00:26:19 We have talked about the data a little bit, and the identity. So you mentioned the data a little bit, and the identity. So you mentioned the Data Warehouse, and my question is, all this data and the identity, how is it represented inside the Data Warehouse? Let's say if I set up today a Data Warehouse and put some data on top of it, look inside the Data Warehouse, what I'm going to see there. I mean, it's all a matter of what you have. Yeah, and what you have is most likely a reflection of what's important for the business.
Starting point is 00:26:54 And where you're going is going to be a reflection of how your business teams put pressure and align strategy with your initiatives to further build your data warehouse. So again, I think, I turned that question around and ask, what should the data journey be as a business is looking to evolve where they are today, which is probably all the data is there, some metrics are defined, but some of the aggregates, some of the nuance and the specifics around various aspects of the nuance and the specifics around various
Starting point is 00:27:26 aspects of the business are still on the one or two-year roadmap. How do you prioritize that roadmap? How do you take what you have and drive value today? And how do you align the interests of the business stakeholders with the strategic priorities across the data team to make sure that you're being implemented? Certainly going into next year in this macroeconomic climate, you know, certainly going into, you know, next year in this macroeconomic climate, I think there's going to be very little patience, you know, for, you know, for big science experiments and wandering strategies that don't align
Starting point is 00:27:55 with what absolutely needs to happen to show clear revenue. Are there some minimal requirements or like best practices in terms of like what data should exist in the data warehouse before someone starts the journey of building a CDP on top of that data? I mean, the first question to ask is, there's our strategy. Yeah, there, yeah. And other CDPs have other strategies as well. Look, our strategy starts with our customers looking at the warehouse as a source of truth for what they're trying to do. If you are a Salesforce shop, this is irrelevant probably to nine out of 10 people on the podcast,
Starting point is 00:28:38 but if you're a Salesforce shop, you're going to buy Salesforce CDP because it collects other Salesforce data. If you have big gaps around data collection from web and mobile, then you're going to look at a solution like Rudderstack and that will populate data in your warehouse. And while Rudderstack activates, our perspective is to, in some sense,
Starting point is 00:29:00 take a view of data that's well outside of what Rudderstack might be collecting and look at a broader view of data that exists within the warehouse that might touch offline context and beyond. If you come to us and you have nothing today, you have no cloud data warehousing, no cloud data warehouse strategy beyond that, it's not a fit. You're going to want something else, something that's out of the box. I can just provide end-to-end value that starts with data collection and ends with activation. But if building a data strategy that's extensible is core to what you're trying to affect,
Starting point is 00:29:33 then we have a story that can at least be considered. All right. And okay, let's assume here that I have my data in the data warehouse. The data looks good, clean. We make sure that we have all the identities there. Like we can act upon this data. Like now we have to do something, right? With this data. So what's next?
Starting point is 00:29:55 Like how does, let's say, the lifecycle of like a CDP looks like? What's like after we have the data that we need and we can access it with like simon data right like what are we going to do next what the marketer is going to do next with this data 100 so in marketing terms there's a buzzword and i know you guys are going to beat me up for even going here in marketing terms is a buzzword around something called one-to-one personalization and i'd bet that actually 10 out of 10 people on the podcast are familiar with all the marketing mumbo-jumbo. We like to use a term that we call one-to-one data.
Starting point is 00:30:34 Okay, what does this mean? This means that if I'm a marketer, I want to have access to the data that I need to build segments and to personalize. I want it in a one-to-one context. I want an application that is actually designed to integrate and ingest and affect segmentation on the data at the granularity at which the data exists.
Starting point is 00:30:56 One of the challenges with approaches like reverse ETL is you have your data in Snowflake. Congratulations. High fidelity. It's fully clean. It is number four point, but it's fully clean. Yeah. And you have rich schemas that represent event history, online, offline, object metadata,
Starting point is 00:31:17 inventory, you name it. And suddenly you need to reverse ETL that data into your marketing tool. But guess what? Your marketing tool is built on MongoDB. And suddenly you're faced with a set of pretty difficult design trade-offs around what data am I now throwing out? And then you go to your marketing team and you have these lengthy conversations around, well, what are you trying to do?
Starting point is 00:31:35 And the marketing doesn't know, okay, make the phone be agile. And they're also not data engineers. So it's incredibly hard for them to have a productive conversation. So to answer your question directly, Costas, really our vision is to put the data in front of the end business stakeholder. And we've invested materially around incredibly flexible schemas and powerful segmentation capabilities that allow our end users to access the data and use the data in the finest of granularities. I'll give an example here around what this loss of data fidelity or throwing out data actually can look like. if you have a segmentation layer that can only represent, say, the number of purchases or the dollar value of the purchases that you've made over the last year. But if you have a question that is effectively, let's say, I want to identify anyone who's bought a full price item over $100 in the last 12 months, well, suddenly that's going
Starting point is 00:32:41 to require going back to the source and doing that analysis. But if instead you can actually have the interfaces in the application that allows for that data to be queried directly by the end business stakeholder without SQL, then suddenly you've saved an entire round trip, which if you have a functioning feedback loop between marketing and data, it can be within a day. But for most enterprises, it's a sprint period, which is a couple of weeks or a month. But for most enterprises, it's a sprint period, which is a couple of weeks or a month. And then you have to ask, how often does this happen? And the answer is it happens all the time.
Starting point is 00:33:11 And this is really where a lot of the friction comes into play. We believe in a world where data teams and marketing teams collaborate in a very productive way. But we also believe in a world where when you look at roles and responsibilities and workflows, they should be separated in a way that allows each of them to do their jobs independently. Okay, that makes total sense. That's a very good example. What is segmentation?
Starting point is 00:33:36 I want to make sure that, and try also to talk about segmentation first from the marketing perspective. And then I'll ask the same question also from the data engineering perspective. And try to communicate this to both audiences out there. But let's start with the marketeer. What does it mean? I'm a marketeer and I want to segment my data. What does this mean? So the inputs to a segmentation interface are properties
Starting point is 00:34:06 on the customer or fields that relate to the customer. We've gone through enough examples over the last 35 minutes here that I won't rehash them again. The outputs are a subset of your customers that display a set of properties. Imagine behind segmentation
Starting point is 00:34:22 manifests in terms of a powerful UI allows end business stakeholders to filter and refine that set of 1003% of customers who experienced or exhibited behaviors, Y or Z, or any conditions that are specified, specifiable within the UI. And the basic optimization problem around segmentation is to provide a data model and an interface which is as powerful as possible. Out of the last 100 questions
Starting point is 00:35:02 that a marketing team has tried to do in their segmentation UI, how many were they able to actually figure out and do on their own? How many did they just say, oh, well, I give up too hard? And how many did they actually then have to go and escalate to the data team
Starting point is 00:35:15 to add new fields to get it done? Because the generous case of segmentation is to have a segmentation UI with one field, which is the latest field that your data team put in. And when you want to segment, you ask your data team to build some thousand line query. They build the thousand line query. It's ready two weeks later.
Starting point is 00:35:33 Hopefully it's correct. You create a new segment with condition X and then you're done. Yeah. And why is this like such a hard problem? Like why we need like a user interface that it's so sophisticated for the marketer such a hard problem? Why we need a user interface that is so sophisticated
Starting point is 00:35:47 for the marketer to create these segments? Why it's such a hard problem? It all comes down to use cases. It all comes down to... It all comes down to... Especially in today's macro environment,
Starting point is 00:36:04 we're understanding, look, customer behaviors are customer behaviors. You know, they're changing all the time. You know, when COVID first hit, everyone went indoors, you know, and then everyone thought it was over, you know, and then Delta came, you know, then Omicron came, and now everyone has RSV apparently, and the hospitals are overflowing. You know, and on top of that, the economy is going south, you know, and all the data and the assumptions around the 12 fields of customer data that existed in the fall of 2019, those assumptions are gone. They're violated, I should say. And today's world requires much, much deeper access to data to really better understand and respond to the needs of the customers.
Starting point is 00:36:46 And if you look at the composition of 99% of operational marketers today, they're non-technical. They know SQL. They need a composable and reusable construct that's easy to use so the rest of their team can have visibility in it, so their CMO can look and be like, what are we actually doing here? It's fundamentally a non-technical. Yeah, it makes a lot of sense.
Starting point is 00:37:11 I've heard both from you and Eric during the conversation today about complex queries. Just a few moments ago, you talked about the data team that will go and build a query of a thousand lines. Let's say it will take about a week and all these things. What makes the process of creating these queries on the data
Starting point is 00:37:31 warehouse for the particular work that a CDP does? So complicated and hard. And I'm talking from the... We can assume here a technical person who has to do this job, right? We're not talking about marketers. Because obviously, when we are talking about non-technical personas, they shouldn't have to write any code, right? But still, it seems that there is intrinsic, let's say, complexity in representing the processing for the data warehouses and executing this logic over there. So why is this happening?
Starting point is 00:38:08 Why is it challenging? Yeah, I mean, I think also to maybe provide a mini segmentation one-on-one tutorial over the next two minutes to... Okay, that would be awesome. Let's do that. Yeah, you can build a segment of anyone who bought in the last year.
Starting point is 00:38:24 Yeah, but in reality, that's not how it works. Marketers, for one, have a notion of personas. Personas might be early adopters of your product. And the definition might be anyone who's bought a product within seven days of launch. Personas might be longtime customers. They may be folks who have, they might be high margin customers. Anyone who's purchased over $300 across a set of high margin products with margins above 40%. Every business is different.
Starting point is 00:38:52 BarkBox was one of our first customers. Actually, it's a customer we share with Rudderstack. They have corporate centers running heavy chewers. People who have dogs and very aggressively chew their toys. There are not many businesses out there with that kind of segment, but that's a core persona for them. It defines their brand, and everything they do, in some sense, considers that. The first layer is around what we call base segments from our platform, or core personas. On top of that, there are exclusions.
Starting point is 00:39:22 These are people who you don't want to market to. If someone is actively engaging with your support team and they're really not happy with the business, you don't want to send them promotional offers. There sometimes can be compliance issues or legal issues where you need to exclude people from audiences as well. So here are two sets of segments that on top of anything you might want to do need to be considered and overlaid. And then on top of that, you have all the examples I just went through that require segments to consider behavioral, non-customer objects, and then bespoke customer behavior, either in the last few minutes or in the last few years. Okay, that makes total sense.
Starting point is 00:40:01 All right. And one last question from me, and then I'll give the microphone back to Eric, because we're getting close to the end of this episode. So you mentioned, let's say, a foundational part of the architecture that you are operating on is the data warehouse, right? So from the data warehouses that we have today, like BigQuery, Snowflake, Redshift, etc., what you would like to see in the future to be implemented by them that would make you happy for the stuff that you are doing at Simon Data, like as a CDP that has to work on top of these technologies?
Starting point is 00:40:42 Real-time as well. I think technologies like Kafka and Confluent have had good adoption in certain pockets of large businesses that have massive throughput requirements. SQL is a standard. SQL as a language doesn't really map very well to real-time data. I think as a category, we have real work to do. And it's not because it's an infrastructure problem. It's just a core abstraction problem. For us, when we think about
Starting point is 00:41:12 the world of data, we can route real-time data to the warehouse, but it comes with real problems. Every year, those problems get better. But again, the basic abstraction problems around SQL aren't getting any better. So when I sort of look at what, when I ask into the future, a big question we always ask is, what does cloud-enabled real-time data look like? Can there be another set of players that are equal to scale as a Snowflake or BigQuery, but instead bring a similar set of capabilities to real-time in the cloud. And I think it's going to happen. And it could very well be Rudderstack.
Starting point is 00:41:58 But we're certainly not there today. So while I think the warehouse and the Cloud Data Warehouse represents a generalized and extensible platform for us to operate a lot of core operations on, real-time is still sort of this end-around that we fully support, but
Starting point is 00:42:17 doesn't have nearly the type of elegant solution as I would expect to evolve in the coming years. Yeah, makes a little sense. All right, Eric, all yours. Okay, time for one more question, although I often break that rule. Jason, I'm interested to know, we've talked a ton about, you know, Simon Data and all the use cases there, but, you know, you are a recovering, you a recovering machine learning algorithm builder who studied at a PhD level.
Starting point is 00:42:50 If we just step back and look at the data landscape, as someone who has built data teams and worked with data tools, is there anything out there that just excites you in the data space in general, whether or not it's related to Simon or any of the other technologies we talked about? A hundred percent. I mean, look, I'll answer your question indirectly and hopefully when I get to the end, you can tell me whether it's a satisfactory answer. When I look at the problem of machine learning problems, I see two camps.
Starting point is 00:43:21 There are problems where the inputs can be fully describable. Yeah, cell machine translation, computer vision, self-driving cars. Yeah, all the information that a human has, a machine has. There are other problems that are not fully describable. You know, I'm a customer of BarkBox. Like, am I having a good experience? Well, like last night, my dog threw up the toys. BarkBox is never going to figure that out.
Starting point is 00:43:48 And marketing teams are never going to figure that out. And maybe the support team will figure that out. But at the end of the day, there are a lot of clues and context that can be used to understand some of the generalizations and the broader macros and zooming that in as specifically as possible. And when I look at the future of AI and machine learning, it's about taking all the clues that we have, you know, in a depiction of a world that is inherently, you know, and then filling in the gaps. Yeah. So I think, look, you know, mapping that back into, to TransLog, I think the stuff,
Starting point is 00:44:22 I think the way chat GPT is interactive is interesting. Obviously, chat GPT has no idea what my intentions are, what my questions are, so it's a back-and-forth interactive context. But I think by and large, what's most exciting to me is anything that has a human interaction element to the machine learning. So in some sense, the problems we're talking about on the show, the feedback loop is around developing a hypothesis, leveraging the data and the AI that might drive it, and then testing in the market and then iterating. Yep. I love it. Yes, indeed. Yeah. We need to do a whole episode on chat GPT, but that's a whole other subject. If you guys want downloads,
Starting point is 00:45:01 I think that's the way to do it. Yeah, probably. All right. Well, Jason, this has been wonderful. I learned a ton. I know our listeners learned a ton as well. So thank you for joining us. Thanks for having me on, guys. All right, Costas. One of my big takeaways is that I'm so glad to finally hear about a marketing tool that, as Jason described it,
Starting point is 00:45:29 is a good citizen on either end of the data pipeline, both in terms of ingestion and then pushing data back in. Because that's the whole challenge with so many marketing tools is that they're terminal destinations, which has been just a huge pain point for me over the years, specifically in terms of data infrastructure. That was great. And I'm super excited to hear that kind of thinking is being done, you know, even for tools that are built specifically for marketers. Yeah. I'll keep the last part of the conversation that we had about real time and streaming data and that this is, let's say, the next frontier of innovation when it comes to data infrastructure for marketeers.
Starting point is 00:46:14 And in a way also, let's say, the next frontier for the data infrastructure out there, right? Because as he said, the technology is not there yet. Yeah, we can ingest real-time data into the data warehouse, but how we do it, how fast we do it, how hard it is to do it, and what kind of tools we have to work with real-time data still has like a lot of space for improvement. So I'll keep that and I'll be looking around to see how the industry is going to address that stuff. All right. Well, we will keep an eye out and we will catch you on the next one.
Starting point is 00:46:51 Subscribe if you haven't. And of course, tell a friend. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:47:14 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.