The Data Stack Show - 61: What is Data Design? With Kevin Gervais of Touchless

Episode Date: November 10, 2021

Highlights from this week’s conversation include:Kevin’s interaction with data at an early age (2:35)Working with telecom data (5:08)Analyzing emojis in customer sentiment (8:44)Infrastructure nee...ded for diverse data (12:22)Building better interfaces and looking out for human error (24:17)Dealing with differences in identities in different layers of the stack (41:21)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're going to chat with Kevin Gervais. He's done a lot of interesting things with data. He's been working with data for a very long time. And the topic that
Starting point is 00:00:40 we want to get into today with him is data design. And his philosophy is that before you start talking about any of the technology or data flows or infrastructure involved with data, you need to design the data itself. Really fascinating stuff. Kostas, I'm interested to know what led Kevin to this philosophy. You don't arrive at sort of a thesis like data design without having gone through probably a lot of painful experiences with data, which isn't uncommon. So I'm really interested to hear what his background is and where he sort of built the foundations of this theory. Yeah, a hundred percent. And I'd love to get into more detail about what this whole data design thing is.
Starting point is 00:01:29 We keep forgetting that data has a shape and we define this shape and it has some properties where again, we define these properties, but we don't talk about that much, mainly because we have like some other more primitive problems to solve. But when it comes to building like sustainable data infrastructure, it's inevitable to get
Starting point is 00:01:51 into like this kind of design conversations and like how you model the world. And it becomes like a little bit more philosophical, let's say, but it's a very, very, very important aspect of working with data. And probably like the first time that we are going to discuss about that stuff. So I'm very, very excited. Great. Well, let's jump in and chat with Kevin. Let's do it. Kevin, welcome to the DataSec show. We're super excited to chat with you about all things data and specifically data design. Great to be here. It's a good day to talk about data. Always is, right? Every day.
Starting point is 00:02:25 Give us a little background. I mean, even actually working with data since you were a child, literally, which is pretty wild, but just give us a little bit of your story and then tell us what you're doing today. Yeah, absolutely. My life started once I got into data. Now, I've always been really fascinated about organizing things. And I mean, we were chatting earlier about, like even just growing up, we were using Macs that didn't have many games on them.
Starting point is 00:02:53 And so back then, that's where I learned that we could use AppleScript to change the data inside of an app. And by changing a couple of things, you can make a storm trooper replace the icon of what was some boring thing before. Or you could use data to, you change one piece of data and you can change the sound of something. And so just learning that working with data can provide an immediate gratification.
Starting point is 00:03:21 You can actually see the impact of it instantly has always fascinated me and and even just the the progression of it like if you i like someone i think when i was getting into web design when someone was asking us to build sites for organizing like embroidery like shirts that would get embroidered and they handed us a bunch of CDs or handed us a bunch of catalogs and say, go figure it out. And just being able to have, like, it was realizing the satisfaction that you can get out of just organizing stuff has always been a passion. So yeah, I spent about 15 years doing that in the website of things and working on e-commerce and different web projects and then got into the telecom sector the last eight years. And that's where you start to see what does data look like when it's really clean
Starting point is 00:04:12 or when it is standardized. And even then it's not beautiful. But just kind of seeing how have others dealt with some of these things and how do they organize their things has been fascinating. And I've been privileged to have been exposed to so many different situations of how data can be organized, good and bad. So yeah, it's something I love talking about. Yeah. One question on the telecom side of things, what kinds of data were you dealing with in telecom in terms of format? I mean,
Starting point is 00:04:46 is it sort of standard stuff or like, I'm just interested to know in the telecom sector, what are the most sort of common types of data? And then maybe some of the more challenging types of data that you dealt with in telecom. That's a good question. No one's ever asked me that. No. It's so personal. What data did you get exposed to oh my we try to dive deep here on the data so this is a little personal so i'm just gonna get emotional because it's gonna bring back memories i i so so the business we were in was trying to help telecom companies better serve their customers so have a better life cycle with them. So instead of a random person from a call center, talk to someone out of the blue that they've never met. We were working
Starting point is 00:05:31 with a telecom to remember of them to make it. So the person that sold somebody a phone or a tablet was the person that would follow up and that would keep that relationship alive for years and do that over text. So in order to have a great relationship like that, they had to have context. So you're working with transactional records, purchase records, what packages are they on? How long has it been since they were talked to last? Notes, history. And then, so as we got into that, then we had to deal with conversational data. So you'd have to deal with like, how do you determine sentiment when most of the APIs that are out there
Starting point is 00:06:14 are trained on like say email communication or well-formatted sentences, but how do you look at the sentiment of somebody who's replying with acronyms over text or an emoji, right? And so we had to deal with a lot of data that, millions of records of data that you couldn't just apply these standards to. And then we got into POS data too, because the whole idea too, if you're trying to figure out how do you have a good conversation with someone, or is this conversation working, or is this script working? You have to tie it back to transactional data and bringing in not just, even in that scenario, we had to deal with the carrier would have certain data about a customer only the products that they sold but then the store that sold stuff to them would know about accessories and other stuff that the carrier doesn't know and so we had to marry these two things without worry about duplicates and so
Starting point is 00:07:17 it it accidentally ended up that we got into the like we put ourselves in the middle of all of these crazy data problems and and we had to actually solve a lot of them in order for us to do our job and have accurate reporting right like is this campaign working you need to need to deal with all these different things so yeah it was like it was a very interesting experience of having to be exposed to different formats. And also I think the surprising thing out of that is just seeing that even these large companies that spend hundreds of millions of dollars on some of their systems, they don't have the cleanest data either. Right. So everyone seems to maybe, maybe dream of the day that, Oh, we like at some day, I'm going to have everything all perfectly clean. It's not, no, I'm sorry.
Starting point is 00:08:12 Like it's not going to happen. It's just, it's just how much, how much mess are you willing to, to, to have today, but there'll always be a mess. Yeah. Cause I mean, Kostas's words ring in my, ring in my ears all the time. I mean, Kassus's words ring in my ears all the time. I mean, data in general is messy. Customer data tends to be very messy. I have one very click-baity question for you. How did you deal with emojis and sentiment? That's just a really interesting topic that I actually think is probably pretty relevant. Well, emoji is data too, right?
Starting point is 00:08:47 It's all converted into an ASCII code or basically, so just being able to understand which code means what, but knowing also that which ones are inappropriate, which ones are inferring something very negative in some cases. You could have a very positive statement with a series of emojis after and the emojis cancel out the meaning of the words. And so, yeah, it was interesting. In the heart of COVID, when that was happening and a lot of these telecoms shut their stores right away. Since I'd opened them up, right as soon as COVID hit last year, everything shut down. 80% of them just shut their stores. And we were trying to understand what was the pulse of people who were still buying.
Starting point is 00:09:41 Or when carriers were reaching out to customers, what were they saying? So we, in that respect, we came up with a model where we, we detected that certain phrases or series of emojis could dictate whether someone was afraid or joy or, or were they sad? And then we compared that to prior periods to come up with a bit of an index of what,
Starting point is 00:10:08 what is the consumer sentiment during this time of crisis. And we did see a difference. We saw more like whenever the stocks would dive, we saw an increase of fear in the way that people replied. Yeah, so I think what was fascinating, actually, and when we got into helping people do outreach right away we created the this concept of standardized lists and standardized chat starters and so since the beginning with that business we were always able to know like because the chat starters in some cases never changed over the years. And so like for, for a given campaign to a given segment, this is what replies we should expect to see. And you'd be able to know this because we were kind of, it was all kind of standardized right at the beginning. And that, because we did that, that allowed us to come up with these patterns that you wouldn't otherwise get.
Starting point is 00:11:07 Because if you weren't always asking the same question, you wouldn't be able to know is the sentiment changing or not. If you're trying to measure sentiment just based on a random conversation that people can just type, the data is going to be all over the place. So yeah, I guess like the learning was that because we worried about being able to tie it back to specific baselines, right. Like cohorts and scripts right at the beginning, that was an enabler for us to do some of these types of sentiments and sentiment analysis, because we had something to go back to. We knew how people replied to that same question that people would ask like, hey, is the phone you're using working out or how many questions about your phone? We knew how people replied to that
Starting point is 00:11:54 the week before all those things to shut down. And then so when people reacted differently to that same question, it was like, huh. Interesting. That there was interest. Yeah, it was some good learnings from that. So, Kevin, what kind of infrastructure do you need in order to deal with such a diversity in the data
Starting point is 00:12:14 that you are working with? How did you manage to work with all this data in a consistent way, right? It took us a while to figure that out. I know I could tell you how not to do it. Well, I think actually what we had to deal with is what I think a lot of companies do. It's the reality of a lot of folks because our business started out where people would upload CSVs. Everyone knows what a CSV is.
Starting point is 00:12:43 Okay, so upload CSVs. They would give us structured data and we would upload CSVs. Everyone knows the CSVs. Okay. So upload CSVs, they would give us structured data and we would upload it. And so at the time when we first started the company, it was like, oh, this is what we do. Somebody gives us a file that always looks like this. And so we will have columns in the database that are exactly the columns that we received, no problem. And we did that for years. And then once people started giving us new types of files, we were like, oh, okay, I guess we got to jam these into these columns we had before. And then as we got into more and more types of data,
Starting point is 00:13:15 it became messy, right? For us to figure out. I think the main thing that we came away with later is that we shouldn't have been so opinionated at the beginning of having columns for specific type. We shouldn't always assume that there's a column called subscriber ID. Yeah. Because maybe it's not a subscriber ID. Maybe it's an accounting ID or maybe it's a Salesforce ID. And so I think the lesson out of that is we should have structured the data based on what type it was, right?
Starting point is 00:13:52 Was it an identity? Was it a person record? Was it an org record? Or is it an event? Like what we end up moving to with the new architecture is move everything to an event-based cqrs model where okay you're you're uh event sourcing right so you're actually we you're designing the domains of your data yep and then you're we're using axon uh db and a bunch of other stuff to kind of force everything into events and then that creates your your your model but that was a lot of work
Starting point is 00:14:27 and and extremely difficult and i think yeah if we had put the if we had put the data into a more universal format at the beginning like just realize that the names of our columns probably will matter or like let's not always expect everything to be perfect integers in a column we could have saved ourself a lot of pain and and and i and i think that's i think most businesses will like maybe they're not working with the same scale kind of of data but yeah they i think every business uh has a life cycle to the data that they have. The data that they collect at the beginning is different than the types of data that they collect five years down the road. They might change their billing system out.
Starting point is 00:15:18 They might change out their CRM. They might want to change their CRM out in the future. And so I think just designing for agility, right, becomes really important. That's a great point. Actually, and I'd like to hear like your opinion on that because I think that models change not just because of ignorance in the beginning, right?
Starting point is 00:15:38 It's also because at the end, we are building these data models to represent somehow reality the business reality and the business reality changes right like if we think about i don't know like a company like like rather suck like a startup right what rather suck was like a year ago compared to what is today it's a completely different thing and of course this is also represented in the data models that we have could we have done like a much better job back then of course but i think that even if we managed to do the best possible job in like modeling our world back then just
Starting point is 00:16:11 because we didn't know the world that well yet would lead us like at some point like to to change things so that's why i think that's what you said about building all these components with agility in mind and being able to change and adapt your data, I think it's super important. So how do you do that? How do you build agile data models? What principles drive this design? That's a very good question. There's 17. Let me tell you all 17. Too personal?
Starting point is 00:16:52 No, that's good. that's a good one no it it it here's how i think about i think first all the businesses is a bunch is the flow of data everything serves the data in the end i mean meaning like let's talk about like a website for an example we you you put a website up if you if you create a design you put a website up what's the whole point of the site well you want someone to call right or you want someone to text in or you want someone to fill out a form. Okay, once they fill out a form and maybe start an order, what is it now? It's data. So the point of actually a website is to trigger either a data connection to make a phone call or a data connection to start a text or capture some information and grab that as data
Starting point is 00:17:41 and then flow that to somewhere. That's really the sole job of the site, right? And especially if you're trying to, even if you're branding, if your site is just to help provide, make people feel good about the brand, then how do you know if you're doing that? Well, then the job of the site is to collect data
Starting point is 00:18:01 to see if you're accomplishing that goal, which is time on, in that case, time on site. Are they interacting with the cool pieces that you've put in there that are branding elements? Are they watching the videos, et cetera? So really like data is so important and usually it's thought of as an afterthought. So I think just recognizing the fact that the flow, the capture, the transformation, and the flow of data is kind of what drives business, right? And remembering that, I think, is just important because it helps us with the design process.
Starting point is 00:18:38 That where you collect the data is not usually where you want it to end up. And then also just remembering that where it ends up today is not necessarily where you want it to end up tomorrow. Most businesses go through a life cycle or even an evolution in the systems that they use. And so to answer your question like how you go about designing for it i think first you have to know your inputs like first we have to be able to to track all kinds of things right all kinds of events we should be able to identify the types
Starting point is 00:19:18 of things that we're tracking and we should be able to move those things into different systems without a whole bunch of work. And what happens if you do want to switch systems? Because at some point you're going to want to switch systems out and you're in one CRM one day and you want to go to another. So I think just having those as inputs into the design process shows some of the variables that you have to consider. And so then what I've noticed is that you actually can design your data. So in the web world or even application design, there's a thought there of user interface design or user experience design, right?
Starting point is 00:20:03 That's a function where everyone kind of understands, okay, I need to have a person draw up something that someone will interact with. Where should the button go, right? And it's very easy to start there and kind of only focus on that because you get that immediate benefit, right? You can draft it, put it out there and someone interacts with it and you think your job's done. But data needs more design than an interface because data integration, data transformation doesn't happen by accident. Like if you want your data to flow seamlessly between systems and to be future-proof, you should design it as much, and I would argue more than any graphical interface
Starting point is 00:20:51 that you have. And so just like there's standards to user experience design, like don't put your close button in a random place off to the side of the screen that you have to like shake your phone in order to see. That'd be bad design. You should think of that.
Starting point is 00:21:12 There's similar things in the world of data where there's like we know what a person looks like. A person, as an example, has a name. That name can change. They have a birth date date they have a death date and they have probably an identifier attached to that but that's a person i mean it's sad but that's like a person is a name an identifier a birth date and a death date yeah now a person can then have identities attached to them. It can have traits attached to them, but those traits and identities can change over time, right?
Starting point is 00:21:48 Names can change. Addresses can change. Even interests, right? Personalities, gender, those sort of things, people could change that. And so when you actually look at how most CRMs treat that data, if you think that's going to be your perfect data model, if they think of a person as first name, last name, gender, I don't know, like address and phone number, and that's like a contact, it's no wonder you can see why that doesn't fit many situations. You end up with duplicates if somebody belongs to multiple organizations or
Starting point is 00:22:26 et cetera. So I think going back to how do you fix it, it's extracting away what's fixed and what could change. So if we get a person, you'd have a birth name, actually. If I think of what is an actual person, you have a given name, you have maybe a gender at birth that's if i think of what what is an actual person you have a birth a given name you have a maybe it may be a gender at birth right that might be on the record and then you might have you're gonna birth date and a death date and that's it everything else is changeable and then a person can be related to various places and reuse and if we design for things like that i think we would end up with a better understanding of the relationships across our data. Let's say, let's take like a real life situation. You have like an annoying salesperson who decides to go on your sales force and put a flag there just to remind them if they have visited like a contact or not,
Starting point is 00:23:22 and they have like reached out to a contact or not without consulting the data model, without reaching out to the person who is responsible for the data model or whatever. How do you deal with that? And what I mean is like the question is like, how do you deal with the human nature of like taking control of things to achieve what they want at the end, right? Because
Starting point is 00:23:45 the problem that I have seen so far, like with all these things that have to do like with modeling and having like a very crystal clear, let's say, way of like understanding and distilled way of understanding like the world around us is that the biggest enemy of this is the rest of the people involved. They make mistakes or they decide that I need something else, but I need it now. I'm going to change it. How do we deal with that? How do we deal with humans? I think we can build better interfaces.
Starting point is 00:24:21 I think, like like with recent situation a client i'm working with has had messy or had messy you know contact records and messy addresses and they wanted to understand what are the patterns amongst you know the people like or is there a pattern to customers living in a certain area like do they they seem to be getting more people from a certain area and in order to do that we we looked at the data and it was it was human entry error where so many addresses would have like notes in them dashes weird quotes and oh it's the new instead of having like a unit number it was like right in the actual address and and and we recently fixed you know over there's like 50 000 records last weekend
Starting point is 00:25:11 just to kind of you know get to some sense of standardization and then once we did we provided instructions and please make things all on capitals and even still because it's human nature to your point someone even if you do all the cleaning right because this was this is like an extreme example where we actually cleaned everything standardized everything and we gave instructions and even still because the interfaces allowed for it people would go in and just put a they'd skip through it with just putting a period and and or they would type the name of the city wrong it wasn't on purpose it's not because they like wanted to mess with the model it was because the interface let them so i think i think ultimately to like you need to know where you want to to end up but then to actually solve it, don't give people the ability
Starting point is 00:26:06 to mess it up. So I think just being willing to enforce that and build interfaces that check for quality or check for duplicates, that's really the responsibility of a business providing a tool to their staff, it's humane. It's more humane. It's more empathetic for a business to put those filters in place to prevent issues ahead of time. Because when they don't, you're just going to frustrate everybody. You're actually going to like, you're going to get inefficiency. You're going to have a bad reporting. You're going to now try and tell people something that like, you may even get angry at them. Why did you put space there? Did you put a dot? And they actually can't help it because the interface is letting them. So I think first just being willing to fix the interface so you don't have bad data coming in.
Starting point is 00:27:08 And then the other thing is I would call it data management. Like I think the other thing that we're noticing is even if someone were to go through all of the filters somehow and found their way to put bad data in, having a way of going through the warehouse and cleaning it automatically, just like watching for issues. It is something that you can detect as a business and fix and then push those things to the various
Starting point is 00:27:34 sources once you've corrected it. Because knowing that there is probably going to be someone who will find a way around all the controls you put. But don't accept it, right? Like a lot of people throw their hands up. Sorry, go ahead. Yeah. I think I have a good example that is going to resonate very well with Eric. One of the most frustrating things that happen when you build a new product is when your developers they start signing up to test things right so you have to get like into this situation where you want to start tracking signups of course but at the same time you have people who are signing up that you don't want to include in your measurements because they're your developers right and you have to clean this data of course
Starting point is 00:28:25 and that's um one day you come and they're like listen guys we have to fix this problem okay so from now on you are going to be using like a specific format of email that you'll be using so i can go and easily filter it well guess what everyone agrees that, but it's not happening. Yeah. I mean, I mean, that's it. It sounds it did. I mean, it's hilarious because that it sounds like such a simple problem to solve. But there's always an edge case, right? Like to your point, Kevin, like people always figure out a way around it. And that's actually true. It's really interesting because just thinking back to some of my previous experiences, the same is actually true for direct to consumer products, right?
Starting point is 00:29:17 If you think about a business creating a user experience or user interface for their own employees or staff to do their job, someone's always going to find a way to sort of shortcut the process. And the same problem actually applies with, let's just say a consumer mobile app, right? You try to set these guardrails for onboarding and activation, and inevitably someone figures out a way to do something weird that creates a poor experience, both for them as a user, and then also the business who's trying to optimize the experience. And so it is. Well, you just to that point, if you accept it. So first it's like, yeah, except there's like, it's going to happen.
Starting point is 00:29:59 But previously to solve for this, it was really, really hard, right? Like this is something like, that's why I think a lot of people would throw their hands up and like, really, really hard, right? Like this is something like, that's why I think a lot of people would throw their hands up and like, Oh, it happens. Right. But it's almost like accepting a margin of, of error or sort of, and I've actually seen this before. It's just like, okay, well, our reporting is probably just going to be X percent off because there are these sort of edge cases, right? And so fine, like we'll just deal with that.
Starting point is 00:30:29 But accepting that sometimes having those margin of error exceptions is, it really ruins the reporting too. Like even, especially when you, if you're trying to understand like, like adoption patterns in your app and you've got a bunch of employees that slip through the cracks, right? That all of a sudden their interactions
Starting point is 00:30:59 are now being tracked, right? It throws all of your understanding off because maybe those employees are doing things with the app that no other user is doing or maybe they are going in trying to look at one thing and then leave and then so your metrics are like oh no we've got a massive churn problem like it could waste huge amounts of money time and energy because the reporting is a bit off, quote unquote. Totally. Or actually, time to activation is another one. If you have people who are very familiar with an app and they go through and activate very quickly to do a demo or walk
Starting point is 00:31:36 through or test something, but they already know the user flow ahead of time that they're testing or whatever, your activation time can be skewed significantly by people who complete the process really, really fast, right? So then you have a huge derivation that is pulling the average way down. And so you think that people are actually onboarding to the product way faster than they are. Or I see this all the time on web, especially for people who have signup processes, let's say an app, and they'll have a bunch of their users will go to the website.
Starting point is 00:32:16 They might Google, if they've got users that log into their app, right? And maybe it's a B2B SaaS product or even consumer app product. But it's like you go to the website to log in. There's a bunch of those users that are known customers. They're known identities. But yet they often are showing up in Google Analytics reports or things as just regular visitors. So you could be looking at a bunch of reports. And if you're not segmenting bunch of reports and if you're not
Starting point is 00:32:45 segmenting your data properly, if you're not accounting for the fact that this stuff happens and filtering it, then it can throw off all these other metrics. So someone could look at the reporting and go, wow, our campaign's working when really 80% of those are all just people going to the site to log in. Well, you really should actually be removing all of those visitors from your reporting because they're not marketing visitors. They are known visitors. And so if you're marketing, trying to figure out if your campaign's working, maybe it isn't. Maybe most of your visits are just people who will come back anyway. And so figuring out when to flag these things and how to filter them at the point of collection, I think, is really important.
Starting point is 00:33:33 I've seen this actually in a situation where someone is thinking they have a massive churn problem when really it was just a data problem. Like they were measuring churn improperly or they didn't know how to measure it. And so maybe they were going based on number of unique identities in the system. But what they really should be doing is looking at people who were built and go from that as the source of truth.
Starting point is 00:34:02 So sometimes it's just changing your source to power a certain metric or accounting for the fact that you might have duplicates. There is sometimes a data solution to first figure out what your baseline is at. Because it can completely change your decision-making and you might invest in fixing a problem that actually isn't a problem, right?
Starting point is 00:34:34 A quick question. You mentioned a bit earlier that the company can establish, let's say, the right mechanism there to figure out when issues with data and around quality specifically happen. Can you give us a little bit more context around that? What kind of mechanism a company can use to detect that, for example, the addresses problem that you mentioned, right?
Starting point is 00:35:02 Addresses is a big problem. So what I usually start with is, I mean, there's been very, I have more recent theses on this since, but where I started from, which I think is a good baseline, is even if you only service a certain market, right? A certain area of your state, or maybe you're only in US or you're only in Canada. You should store your data in ISO format or in, if you look at Smarty Streets or some of these other APIs that are available, there are these like international APIs that show you what an international address should look like, right? Like don't store things in a way that says, okay, like zip province city. What if it's a rural route?
Starting point is 00:35:51 Like if you ever look at a rural address, sometimes it's like counter road 46, rural route three, intersection of this. Like you can't just kind of assume that everyone can fit into this like address one, address two, city, state, province, or city, state, you know, country. So thinking of things of like, yeah, localities, administrative areas, sub-administrative areas, accounting for the fact that maybe there's not a real address and you have to have latitude and longitude. So if you can just, but you don't have to invent these things. These models already generally exist in certain APIs. Again, Smarty Streets is a good one. Or you could look at ISO standards. And if you stored your stuff in that format and start to create structure to
Starting point is 00:36:39 where the stuff should go, it's like putting it in the right filing cabinet. So at least you can know where to look. And then once you've done that and you have the have the right data model and you don't have to overthink it like i think just like in starting with these well-known international formats um is a good start then other so let's say that you're doing that in postgres as an example, or SQL server or some sort of database. Then you can put things like Sura or Prisma or something on top of that database, which gives you triggers like on update or on insert or on deletion, you could trigger little micro
Starting point is 00:37:22 functions, right? Which could be hosted somewhere. And those micro functions could be things that know that bad data could make its way in accidentally. And at the point of insertion, then start a transformation step that then extracts the unit number from the first part of the address. Like maybe some people put in 200-1 Main Street where 200 is the unit number.
Starting point is 00:37:49 We'll pull that out if you notice there's a dash and convert that to unit 200 and put that in the unit field. So I think like from a tool perspective, previously to do that would have been a lot of work, right? But now because you can basically have your data go into a nice warehouse, you can have an API layer for free to sit on top of that to look for changes. And then that can trigger effectively free functions,
Starting point is 00:38:18 which can clean up these little patterns. You can actually make the data clean itself. Right. And, and hopefully, yeah, force it into, into a standard format and then push that to the various places. So what are the, okay, we talked about addresses and you said that they are like a very common source of issues with data. What other issues you have seen, like more, more commonly, like together with addresses, what else you have seen there? I think person records jump out.
Starting point is 00:38:51 Or if you're using something like Salesforce, where I think a lot of people, where they don't set things up and it causes issues is they don't put unique identifiers. They don't put like unique identifiers they don't similar contacts and so if you have a person that is across multiple accounts like the same contact is in three different accounts as an example yeah in salesforce you should be having a field to store like the unique identifier and that way you can start to tie together in in the future that these contacts are related to each other and then you can basically set up rules to like
Starting point is 00:39:32 sync the three so i think one of the biggest issues i see is just duplicates right and then the second piece is just the quality of what's in a name, right? You'll see a lot of folks put either names, a first name field, or they don't fit. They put the first name and last name in the first name field and keep the last name blank. So like just what gets put in the fields, I think is often an issue. And then even just formatting formatting like i see this all the time from like marketing cloud data but you'll have some contacts that are all caps and some are capitalized and some are all are all small and that would be how it goes out in an email right so you usually be seeing this stuff in the data side and because that actually will reduce your click rate and could
Starting point is 00:40:26 cause more opt-outs if you're saying hey fred and it's all caps so like capitalization putting the right thing in the right field and tracking that these three different contacts might be the same one i would say like the top yeah the top ones actually Actually, it's a very interesting problem, which has to do with identity in general. And especially now that we are using so many different SaaS applications, which each one imposes a data model on their own. When you use Zendesk, they have their own way of representing what a user is. When you are using Salesforce, the same. Your marketing tools,
Starting point is 00:41:07 probably they have a little bit of a difference. And of course, like the people involved that are also different, right? So what's your suggestion on like how to deal with this problem, which is inevitable, right? Like that's how life is. Like we have all these different systems.
Starting point is 00:41:21 Yeah, I think a good, all right, at its most extreme, the best one that I've seen that does a good job of this is the adobe identity i mean it's normally used by very large orgs but i think most orgs can learn from that even if they do a portion of what they do adobe identity says and you can just see all this from their development docs as inspiration but they look at everything's an identity that's attached to a person and the identities can change so there's like you have an identity record and you can set
Starting point is 00:41:52 what is that type of identity and is it a permanent identity is it a is it a ticket like a zendesk identity you can basically come up with your own, like, what is the type of identity that this is? And then you can attach it and detach it from a record, from a contact at any time. It's a little bit overkill for most people. I think if you just were to simplify it, just keeping a record of relationships of, this is a list of identities. And then you have a table that says, okay, this identity is related to this record. Having that somewhere, it can go a long way to at least keep track of these things. Instead of assuming that you'll always be able to correlate them to each other. Just creating this type of relationship mapping is an easy way to
Starting point is 00:42:47 keep track of it. And yes, sorry, go ahead. No, I find very interesting what you're saying. I'm just trying to think of who is managing these identities? Who is responsible at the end? Because what you are doing here is we are trying to solve this problem by adding another level of interaction, let's say. So we say, let's create this concept of identity. And instead of mapping Zendesk to Salesforce, Zendesk to Marketo, Marketo with Salesforce, let's go and do Marketo identity, Salesforce identity, Zendesk identity. And if we do that, of course, then all of them are like mapped, right?
Starting point is 00:43:34 But still, like someone has to manage like this mapping to the reference identity that we're creating on this identity management system. So who does that? Oh, yeah, it's, I don't know if that role exists yet. It's like, I think it's what we'll find. I think we'll find, though, is over time that data quality will become a function of a business. Right? And I think it should.
Starting point is 00:44:00 I mean, it's unfortunate that that's required, but it is a role that is realistically required today to kind of manage the fact that this is going to always occur. And the ones that do invest in managing this are the ones that are going to get way more out of their base because they can infer things that the others can't. And I think just as a quick example, even with phone numbers, I was working with a bunch of records today, a fintech trying to do some outreach to customers over text. And the records provided from the marketing system, some have pluses in front of them, some have brackets, some are just too many digits, some are missing digits. So that affects your ability to reach people, right? If you're expecting the format to always be clean, we would have rejected 30 to 40% of the records. And so you'd be marketing to less people. Once we were able to standardize all that in an international format, now you're reaching 80 or 90 something percent of the records that were provided because we
Starting point is 00:45:09 were able to standardize it. But if you don't put someone in that role of responsibility to ensure quality, you actually could be really hampering your ability to do marketing or to infer things. Sure. Yeah. I I was gonna, jumping back to Adobe, it's such an interesting point. And Kevin, having been a past user of some of the Adobe Marketing Cloud products, I think they get a bad rap for being a huge, expensive monolith
Starting point is 00:45:40 and in many ways they are, but it's really powerful technology. And I think it's a great, I just loved your comment of sort of looking at their developer docs for inspiration. I think the challenge that a lot of companies face is one, it's unattainable from a cost perspective. And then two, the question Costas asked is who manages sort of the central identity? Well, in the Adobe world, it's Adobe, right? And so you're locked in and it creates a huge amount of inflexibility, which I think is very problematic in many ways. Well, I think most people should manage that in their own warehouse.
Starting point is 00:46:23 Now, the question is, is okay what does that look like right what's the schema for that and what's the like what's what's the turnkey way for them to manage their own identity system in their own environment and i think i think there's a lot of folks trying to get solutions in the market to solve that. It might still take some time, right? I don't know that this is something people can just buy an out of the box thing today and it will just work magically to solve all their identity issues and run in their own environment. I think it's only become clear that this is a problem that has to get fixed. So it's going to take some time before anyone, just anyone can do this. I think the, you could start, you can start in a simple way, right? You could basically
Starting point is 00:47:14 have a con, you could have a table that just stores like really hard coded things, like have a column for like, here's an ID, like here's your main ID and you have a column for Zendesk ID, you have a column for Salesforce ID, you have a column for Mercado ID and then kind of just track that. And that might be, it's like a shortcut, right? Where you're not trying to manage a whole identity layer,
Starting point is 00:47:37 but you're at least trying to map the relationship that these three ideas are all tied to the same contact. That's like, it's just a little step up from what someone might do on their own. I mean, if you really want to cheat, you could just have a contact record. When we talk about person record, you could have a person record
Starting point is 00:47:57 with some extra columns in it. And if you don't want to get into the whole relationship mapping piece, just add some columns for these different identities. And that will allow you to eventually tie them together. But just having them stored somewhere is better than it just being all up in the air and hoping that you can always match based on email address. Because that's usually what people do. They'll go and they'll try and just match on that.
Starting point is 00:48:25 But maybe Salesforce has someone's working mail and HubSpot has somebody's Gmail. So you won't really be able to match them if you just think that you're always going to be able to go based on email. So yeah, I think there's shortcuts. To solve this, I don't think you have to jump right ahead to this perfect world identity management thing. I would agree with you that relying on a vendor to hold all of those identities is dangerous because what if you want to move? What if you want to take control of that? You're not going to be able to get that perfect export of all the Adobe
Starting point is 00:49:06 IDs that they've created. Sure. Yeah. They do make it easy, but also like Adobe, the Adobe identity is kind of an overkill solution for most companies that don't have that type of complexity. So yeah, yeah. It's definitely something people should take on themselves. For sure. Well, you answered the question. We're at the buzzer here. Brooks is telling us that we're at the buzzer. So we need to close the show out, but I was going to ask you what's the starting point, but I couldn't agree more that the starting point is actually just beginning to tie together some of the basic pieces of unique identifiers from the various places in your stack to build a foundation for that unified profile in the warehouse. And even if you do the basics, like you said, where you're literally just sort of mapping the
Starting point is 00:49:58 unique ideas across tools is such a useful foundation to build for the future. Kevin, one thing we didn't talk about that we discussed before the show is you have built some unbelievably fast and SEO performant websites, literally just using technology to sort of push pages to the first page of Google. We didn't get a chance to talk about that, but would you come back on the show and can we break down the stack for sort of the latest, greatest SEO performant website stack, especially relative to the data piece? Would love to have you come back on the show if you'd be willing. Yeah, that'd be great. And especially because to be able to do those cool things with fast web you know experiences the data model really is important like you
Starting point is 00:50:48 you need to put your data in a certain format you need to have a certain flow working because that you have to make it so the browser doesn't do any of the work and the reason why things are slow is because sites generally, 99-ish percent, 98% of the web works this way where they put all the work on the browser. Someone has to, they go to the site and then it has to make a whole bunch of stops to get all the information that the user might be asking for. And all those things take time. And there's a whole bunch of calculations and work done by the browser to present it. And so if you want the browser to do no work and just present information instantly in under half a second, the data model needs to be pretty clean on the back end to make that possible. But yeah, once you get there, the benefit is you can do some pretty cool stuff. So yeah, happy to walk through
Starting point is 00:51:49 how someone could go about setting that up. Love it. Well, that was a great preview. We'll have Kevin back on the show. Kevin, awesome discussion. I learned a ton. Thank you so much for giving us some time and we'll talk with you again soon.
Starting point is 00:52:06 Yeah, thanks. It was great much for giving us some time. And we'll talk with you again soon. Yeah, thanks. It was great to chat about all things data. It always is. Fascinating conversation. My big takeaway was when Kevin said, all a business is, is the flow of data. I haven't really chewed on that statement enough to know whether I have a strong conviction about it, but it was very thought provoking. And in many ways, I think makes sense when you sort of break a business down into its component parts,
Starting point is 00:52:38 even the conversation that maybe a salesperson is having with a prospect, the content of that conversation is data. And so that was very thought provoking to me. So I think that's probably what I'll be chewing on this week is that statement. How about you? Yeah, absolutely. I really enjoyed the conversation that we had with him about modeling and abstractions around data. I think what I'll keep from this conversation is that in order to be as correct as possible or be able to have the right mechanisms in place to monitor quality or like reacting issues, you need to have a good abstract model of how your world and how your company and how all the functions and your interactions with the customers are going to be. That's what I'm going to keep. I think it's a very,
Starting point is 00:53:34 it's a piece of wisdom that we took from him. And I think it's a great advice for every engineer out there that before you start implementing, like spending time in designing things and thinking about why things should be organized in a certain way. It's something that's super, super, super important. And it comes with maturity. I mean, it's not a coincidence that he had to mess with so many issues related to data to come to this conclusion at the end. So yeah, that was, I think, a very important part of our conversation. And that's something that I definitely think about and keep.
Starting point is 00:54:18 Absolutely. Well, thanks again for joining us on the Data Stack Show, and we will catch you on the next episode. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 00:54:50 Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.