The Data Stack Show - 61: What is Data Design? With Kevin Gervais of Touchless
Episode Date: November 10, 2021Highlights from this week’s conversation include:Kevin’s interaction with data at an early age (2:35)Working with telecom data (5:08)Analyzing emojis in customer sentiment (8:44)Infrastructure nee...ded for diverse data (12:22)Building better interfaces and looking out for human error (24:17)Dealing with differences in identities in different layers of the stack (41:21)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today, we're going to chat with Kevin Gervais. He's done a lot
of interesting things with data. He's been working with data for a very long time. And the topic that
we want to get into today with him is data design. And his philosophy is that before you start
talking about any of the technology or data flows or infrastructure involved with data,
you need to design the data itself. Really fascinating stuff. Kostas, I'm interested to know
what led Kevin to this philosophy. You don't arrive at sort of a thesis like data design without having gone through
probably a lot of painful experiences with data, which isn't uncommon. So I'm really interested to
hear what his background is and where he sort of built the foundations of this theory.
Yeah, a hundred percent. And I'd love to get into more detail
about what this whole data design thing is.
We keep forgetting that data has a shape
and we define this shape
and it has some properties
where again, we define these properties,
but we don't talk about that much,
mainly because we have like some other
more primitive problems to solve.
But when it comes to building like sustainable data infrastructure, it's inevitable to get
into like this kind of design conversations and like how you model the world.
And it becomes like a little bit more philosophical, let's say, but it's a very, very, very important
aspect of working with data.
And probably like the first time that we are going
to discuss about that stuff. So I'm very, very excited. Great. Well, let's jump in and chat
with Kevin. Let's do it. Kevin, welcome to the DataSec show. We're super excited to chat with
you about all things data and specifically data design. Great to be here. It's a good day to
talk about data. Always is, right? Every day.
Give us a little background.
I mean, even actually working with data since you were a child, literally, which is pretty
wild, but just give us a little bit of your story and then tell us what you're doing today.
Yeah, absolutely.
My life started once I got into data.
Now, I've always been really fascinated about organizing things. And I mean, we were chatting earlier about,
like even just growing up,
we were using Macs that didn't have many games on them.
And so back then, that's where I learned
that we could use AppleScript
to change the data inside of an app.
And by changing a couple of things,
you can make a storm trooper
replace the icon of what was some boring thing before.
Or you could use data to, you change one piece of data and you can change the sound of something.
And so just learning that working with data can provide an immediate gratification.
You can actually see the impact of it instantly has always fascinated me
and and even just the the progression of it like if you i like someone i think when i was getting
into web design when someone was asking us to build sites for organizing like embroidery
like shirts that would get embroidered and they handed us a bunch of CDs or handed us a bunch of catalogs and say, go figure it out. And just being able to have, like, it was realizing the satisfaction
that you can get out of just organizing stuff has always been a passion. So yeah, I spent about 15
years doing that in the website of things and working on e-commerce and different web projects
and then got into the telecom sector the last eight years.
And that's where you start to see what does data look like when it's really clean
or when it is standardized.
And even then it's not beautiful.
But just kind of seeing how have others dealt with some of these things
and how do they organize their things has been fascinating.
And I've been privileged to have been exposed to so many different situations of how data
can be organized, good and bad. So yeah, it's something I love talking about.
Yeah. One question on the telecom side of things, what kinds of data were you dealing with in
telecom in terms of format? I mean,
is it sort of standard stuff or like, I'm just interested to know in the telecom sector,
what are the most sort of common types of data? And then maybe some of the more challenging types
of data that you dealt with in telecom. That's a good question. No one's ever asked me that.
No. It's so personal. What data did you get exposed to oh my we try
to dive deep here on the data so this is a little personal so i'm just gonna get emotional because
it's gonna bring back memories i i so so the business we were in was trying to help telecom
companies better serve their customers so have a better life cycle with them. So instead of a random
person from a call center, talk to someone out of the blue that they've never met. We were working
with a telecom to remember of them to make it. So the person that sold somebody a phone or a tablet
was the person that would follow up and that would keep that relationship alive for years and do that over text.
So in order to have a great relationship like that, they had to have context.
So you're working with transactional records, purchase records, what packages are they on?
How long has it been since they were talked to last?
Notes, history.
And then, so as we got into that, then we had to deal with conversational data. So
you'd have to deal with like, how do you determine sentiment when most of the APIs that are out there
are trained on like say email communication or well-formatted sentences, but how do you
look at the sentiment of somebody who's replying with acronyms over text
or an emoji, right? And so we had to deal with a lot of data that, millions of records of data
that you couldn't just apply these standards to. And then we got into POS data too, because
the whole idea too, if you're trying to figure out how do you have a good conversation with someone, or is this conversation working, or is this script working?
You have to tie it back to transactional data and bringing in not just, even in that scenario, we had to deal with the carrier would have certain data about a customer only the products that they sold
but then the store that sold stuff to them would know about accessories and other stuff that the
carrier doesn't know and so we had to marry these two things without worry about duplicates and so
it it accidentally ended up that we got into the like we put ourselves in the middle of all of these crazy
data problems and and we had to actually solve a lot of them in order for us to do our job and have
accurate reporting right like is this campaign working you need to need to deal with all these
different things so yeah it was like it was a very interesting experience of having to be exposed
to different formats. And also I think the surprising thing out of that is just seeing
that even these large companies that spend hundreds of millions of dollars on some of their
systems, they don't have the cleanest data either. Right. So everyone seems to maybe, maybe dream of the day that, Oh, we like at some day,
I'm going to have everything all perfectly clean. It's not, no, I'm sorry.
Like it's not going to happen. It's just, it's just how much,
how much mess are you willing to, to, to have today,
but there'll always be a mess. Yeah. Cause I mean,
Kostas's words ring in my, ring in my ears all the time. I mean, Kassus's words ring in my ears all the time. I mean, data in general is
messy. Customer data tends to be very messy. I have one very click-baity question for you.
How did you deal with emojis and sentiment? That's just a really interesting topic that I actually
think is probably pretty relevant.
Well, emoji is data too, right?
It's all converted into an ASCII code or basically, so just being able to understand which code
means what, but knowing also that which ones are inappropriate, which ones are inferring
something very negative in some cases. You could have a very positive statement
with a series of emojis after and the emojis cancel out the meaning of the words. And so,
yeah, it was interesting. In the heart of COVID, when that was happening and a lot of these
telecoms shut their stores right away. Since I'd opened them up,
right as soon as COVID hit last year, everything shut down. 80% of them just shut their stores.
And we were trying to understand what was the pulse of people who were still buying.
Or when carriers were reaching out to customers, what were they saying? So we,
in that respect,
we came up with a model where we,
we detected that certain phrases or series of emojis could dictate whether
someone was afraid or joy or,
or were they sad?
And then we compared that to prior periods to come up with a bit of an index
of what,
what is the consumer sentiment during this time of crisis.
And we did see a difference.
We saw more like whenever the stocks would dive, we saw an increase of fear in the way that people replied.
Yeah, so I think what was fascinating, actually, and when we got into helping people do outreach right away we created the this concept of standardized lists and standardized chat starters and so since the
beginning with that business we were always able to know like because the chat starters in some cases never changed over the years. And so like for,
for a given campaign to a given segment, this is what replies we should expect to see.
And you'd be able to know this because we were kind of, it was all kind of standardized right
at the beginning. And that, because we did that, that allowed us to come up with these patterns that you wouldn't otherwise get.
Because if you weren't always asking the same question, you wouldn't be able to know is the sentiment changing or not.
If you're trying to measure sentiment just based on a random conversation that people can just type, the data is going to be all over the place. So yeah, I guess like the learning was that because we worried about being able to tie it back to specific
baselines, right. Like cohorts and scripts right at the beginning, that was an enabler for us to
do some of these types of sentiments and sentiment analysis, because we had something to go back to.
We knew how people replied to that same question that people would ask like,
hey, is the phone you're using working out
or how many questions about your phone?
We knew how people replied to that
the week before all those things to shut down.
And then so when people reacted differently
to that same question, it was like, huh.
Interesting.
That there was interest.
Yeah, it was some good learnings from that.
So, Kevin, what kind of infrastructure do you need
in order to deal with such a diversity in the data
that you are working with?
How did you manage to work with all this data
in a consistent way, right?
It took us a while to figure that out.
I know I could tell you how not to do it.
Well, I think actually what we had to deal with is what I think a lot of companies do.
It's the reality of a lot of folks because our business started out where people would upload CSVs.
Everyone knows what a CSV is.
Okay, so upload CSVs.
They would give us structured data and we would upload CSVs. Everyone knows the CSVs. Okay. So upload CSVs, they would give us structured data and we would upload it. And so at the time when we first started the company, it was like, oh,
this is what we do. Somebody gives us a file that always looks like this. And so we will have
columns in the database that are exactly the columns that we received, no problem. And we did
that for years. And then once people started giving us new types of files,
we were like, oh, okay,
I guess we got to jam these into these columns we had before.
And then as we got into more and more types of data,
it became messy, right?
For us to figure out.
I think the main thing that we came away with later
is that we shouldn't have been so opinionated at the beginning of having columns for specific type.
We shouldn't always assume that there's a column called subscriber ID.
Yeah.
Because maybe it's not a subscriber ID.
Maybe it's an accounting ID or maybe it's a Salesforce ID. And so I think the lesson out of that is we should have structured the data based on what type it was, right?
Was it an identity?
Was it a person record?
Was it an org record?
Or is it an event?
Like what we end up moving to with the new architecture is move everything to an event-based cqrs model where
okay you're you're uh event sourcing right so you're actually we you're designing the domains
of your data yep and then you're we're using axon uh db and a bunch of other stuff to kind of
force everything into events and then that creates your your your model but that was a lot of work
and and extremely difficult and i think yeah if we had put the if we had put the data into a more
universal format at the beginning like just realize that the names of our columns probably
will matter or like let's not always expect everything to be
perfect integers in a column we could have saved ourself a lot of pain and and and i and i think
that's i think most businesses will like maybe they're not working with the same scale kind of
of data but yeah they i think every business uh has a life cycle to the data that they have.
The data that they collect at the beginning is different than the types of data that they collect five years down the road.
They might change their billing system out.
They might change out their CRM.
They might want to change their CRM out in the future.
And so I think just designing for agility, right,
becomes really important.
That's a great point.
Actually, and I'd like to hear like your opinion on that
because I think that models change
not just because of ignorance in the beginning, right?
It's also because at the end,
we are building these data models
to represent somehow reality the business reality
and the business reality changes right like if we think about i don't know like a company like
like rather suck like a startup right what rather suck was like a year ago compared to what is today
it's a completely different thing and of course this is also represented in the data models that
we have could we have done like a much better job back then of course but i think
that even if we managed to do the best possible job in like modeling our world back then just
because we didn't know the world that well yet would lead us like at some point like to to change
things so that's why i think that's what you said about building all these components with agility in mind and being able to change and adapt your data, I think it's super important.
So how do you do that?
How do you build agile data models?
What principles drive this design?
That's a very good question.
There's 17. Let me tell you all 17.
Too personal?
No, that's good. that's a good one no it it it here's how i think about i think first all the businesses is a bunch is the flow of data
everything serves the data in the end i mean meaning like let's talk about like a website
for an example we you you put a website up if you if you create a design you put a website up
what's the whole point of the site well you want someone to call right or you want someone to
text in or you want someone to fill out a form. Okay, once they fill out a form and maybe start an order, what is it now?
It's data.
So the point of actually a website is to trigger either a data connection to make a phone call
or a data connection to start a text or capture some information and grab that as data
and then flow that to somewhere.
That's really the sole job of the site, right?
And especially if you're trying to,
even if you're branding,
if your site is just to help provide,
make people feel good about the brand,
then how do you know if you're doing that?
Well, then the job of the site is to collect data
to see if you're accomplishing that goal,
which is time on, in that case, time on site.
Are they interacting with the cool pieces that you've put in there that are branding
elements?
Are they watching the videos, et cetera?
So really like data is so important and usually it's thought of as an afterthought.
So I think just recognizing the fact that the flow, the capture, the transformation, and the flow of data is kind of what drives business, right?
And remembering that, I think, is just important because it helps us with the design process.
That where you collect the data is not usually where you want it to end up.
And then also just remembering that where it ends up today is not necessarily where
you want it to end up tomorrow.
Most businesses go through a life cycle or even an evolution in the systems that they
use.
And so to answer your question like how you go about
designing for it i think first you have to know your inputs like first we have to be able to to
track all kinds of things right all kinds of events we should be able to identify the types
of things that we're tracking and we should be able to move those things into different systems
without a whole bunch of work.
And what happens if you do want to switch systems? Because at some point you're going to want to
switch systems out and you're in one CRM one day and you want to go to another. So I think
just having those as inputs into the design process shows some of the variables that you
have to consider. And so then what I've noticed is that you actually can design your data.
So in the web world or even application design,
there's a thought there of user interface design or user experience design, right?
That's a function where everyone kind of
understands, okay, I need to have a person draw up something that someone will interact with.
Where should the button go, right? And it's very easy to start there and kind of only focus on
that because you get that immediate benefit, right? You can draft it, put it out there and someone interacts with it
and you think your job's done. But data needs more design than an interface because data integration,
data transformation doesn't happen by accident. Like if you want your data to flow seamlessly
between systems and to be future-proof, you should design it as much,
and I would argue more than any graphical interface
that you have.
And so just like there's standards
to user experience design,
like don't put your close button in a random place
off to the side of the screen
that you have to like shake your phone in order to see.
That'd be bad design.
You should think of that.
There's similar things in the world of data where there's like we know what a person looks like.
A person, as an example, has a name.
That name can change.
They have a birth date date they have a death date
and they have probably an identifier attached to that but that's a person
i mean it's sad but that's like a person is a name an identifier a birth date and a death date
yeah now a person can then have identities attached to them. It can have traits attached to them, but those traits and identities can change over time,
right?
Names can change.
Addresses can change.
Even interests, right?
Personalities, gender, those sort of things, people could change that.
And so when you actually look at how most CRMs treat that data, if you think that's going to be your perfect
data model, if they think of a person as first name, last name, gender, I don't know, like
address and phone number, and that's like a contact, it's no wonder you can see why that
doesn't fit many situations. You end up with duplicates if somebody belongs to multiple organizations or
et cetera. So I think going back to how do you fix it, it's extracting away what's fixed and
what could change. So if we get a person, you'd have a birth name, actually. If I think of what
is an actual person, you have a given name, you have maybe a gender at birth that's if i think of what what is an actual person you have a birth a given name you have a maybe it may be a gender at birth right that might be on the record and then you
might have you're gonna birth date and a death date and that's it everything else is changeable
and then a person can be related to various places and reuse and if we design for things like that
i think we would end up with a better understanding of the relationships across our data. Let's say, let's take like a real life
situation. You have like an annoying salesperson who decides to go on your sales force
and put a flag there just to remind them if they have visited like a contact or not,
and they have like reached out to a contact or not without consulting the data model,
without reaching out to the person
who is responsible for the data model or whatever.
How do you deal with that?
And what I mean is like the question is like,
how do you deal with the human nature
of like taking control of things
to achieve what they want at the end, right? Because
the problem that I have seen so far, like with all these things that have to do like with modeling
and having like a very crystal clear, let's say, way of like understanding and distilled way of
understanding like the world around us is that the biggest enemy of this is the rest of the people involved. They make mistakes or they decide that I need something else,
but I need it now.
I'm going to change it.
How do we deal with that?
How do we deal with humans?
I think we can build better interfaces.
I think, like like with recent situation a client i'm working with
has had messy or had messy you know contact records and messy addresses and they wanted
to understand what are the patterns amongst you know the people like or is there a pattern to
customers living in a certain area like do they they seem to be getting more people from a certain
area and in order to do that we we looked at the data and it was it was human entry error
where so many addresses would have like notes in them dashes weird quotes and oh it's the new
instead of having like a unit number it was like right in the actual
address and and and we recently fixed you know over there's like 50 000 records last weekend
just to kind of you know get to some sense of standardization and then once we did we provided
instructions and please make things all on capitals and even still because it's human nature to your point someone
even if you do all the cleaning right because this was this is like an extreme example where
we actually cleaned everything standardized everything and we gave instructions and even
still because the interfaces allowed for it people would go in and just put a they'd skip through it
with just putting a period and and or they would type the name of the city wrong it wasn't on purpose it's not because they
like wanted to mess with the model it was because the interface let them so i think i think
ultimately to like you need to know where you want to to end up but then to actually solve it, don't give people the ability
to mess it up. So I think just being willing to enforce that and build interfaces that check for
quality or check for duplicates, that's really the responsibility of a business providing a tool to their staff, it's humane. It's more humane.
It's more empathetic for a business to put those filters in place to prevent issues ahead of time.
Because when they don't, you're just going to frustrate everybody. You're actually going to like, you're going to get inefficiency. You're going to have
a bad reporting. You're going to now try and tell people something that like, you may even
get angry at them. Why did you put space there? Did you put a dot? And they actually can't help
it because the interface is letting them. So I think first just being willing to fix the interface
so you don't have bad data coming in.
And then the other thing is I would call it data management.
Like I think the other thing that we're noticing
is even if someone were to go through all of the filters somehow
and found their way to put bad data in,
having a way of going through the warehouse
and cleaning it automatically,
just like watching for issues.
It is something that you can detect as a business and fix and then push those things to the various
sources once you've corrected it. Because knowing that there is probably going to be someone who
will find a way around all the controls you put. But don't accept
it, right? Like a lot of people throw their hands up. Sorry, go ahead.
Yeah. I think I have a good example that is going to resonate very well with Eric.
One of the most frustrating things that happen when you build a new product is when your developers they start signing up to test things
right so you have to get like into this situation where you want to start tracking signups of course
but at the same time you have people who are signing up that you don't want to include in
your measurements because they're your developers right and you have to clean this data of course
and that's um one day you come and they're like listen guys we have to fix this problem okay
so from now on you are going to be using like a specific format of email that you'll be using so
i can go and easily filter it well guess what everyone agrees that, but it's not happening.
Yeah. I mean, I mean, that's it. It sounds it did.
I mean, it's hilarious because that it sounds like such a simple problem to solve.
But there's always an edge case, right? Like to your point, Kevin, like people always figure out a way
around it. And that's actually true. It's really interesting because just thinking back to some of
my previous experiences, the same is actually true for direct to consumer products, right?
If you think about a business creating a user experience or user interface for their own employees or staff to do their job,
someone's always going to find a way to sort of shortcut the process. And the same problem
actually applies with, let's just say a consumer mobile app, right? You try to set these guardrails
for onboarding and activation, and inevitably someone figures out a way to do something weird that creates a poor experience,
both for them as a user, and then also the business who's trying to optimize the experience.
And so it is.
Well, you just to that point, if you accept it.
So first it's like, yeah, except there's like, it's going to happen.
But previously to solve for this, it was really, really hard, right?
Like this is something like, that's why I think a lot of people would throw their hands up and like, really, really hard, right? Like this is something like,
that's why I think a lot of people would throw their hands up and like,
Oh, it happens. Right.
But it's almost like accepting a margin of,
of error or sort of, and I've actually seen this before. It's just like,
okay, well,
our reporting is probably just going to be X percent off because there are these sort of edge cases, right? And so fine, like we'll just deal with that.
But accepting that sometimes having those margin of error exceptions
is, it really ruins the reporting too.
Like even, especially when you,
if you're trying to understand like,
like adoption patterns in your app
and you've got a bunch of employees
that slip through the cracks, right?
That all of a sudden their interactions
are now being tracked, right?
It throws all of your understanding off
because maybe those employees are doing things
with the app that no other user is doing or maybe they are going in trying to look at one thing and
then leave and then so your metrics are like oh no we've got a massive churn problem like it could
waste huge amounts of money time and energy because the reporting is a bit off, quote unquote.
Totally. Or actually, time to activation is another one. If you have people who are very
familiar with an app and they go through and activate very quickly to do a demo or walk
through or test something, but they already know the user flow ahead of time that they're testing
or whatever, your activation time can be
skewed significantly by people who complete the process really, really fast, right?
So then you have a huge derivation that is pulling the average way down.
And so you think that people are actually onboarding to the product way faster than
they are.
Or I see this all the time on web, especially for people who have signup processes, let's
say an app, and they'll have a bunch of their users will go to the website.
They might Google, if they've got users that log into their app, right?
And maybe it's a B2B SaaS product or even consumer app product. But it's like you go to the website to log in.
There's a bunch of those users that are known customers.
They're known identities.
But yet they often are showing up in Google Analytics reports or things as just regular
visitors.
So you could be looking at a bunch of reports.
And if you're not segmenting bunch of reports and if you're not
segmenting your data properly, if you're not accounting for the fact that this stuff happens
and filtering it, then it can throw off all these other metrics. So someone could look at the
reporting and go, wow, our campaign's working when really 80% of those are all just people
going to the site to log in. Well, you really should actually be removing all of those visitors from your reporting
because they're not marketing visitors.
They are known visitors.
And so if you're marketing, trying to figure out if your campaign's working, maybe it isn't.
Maybe most of your visits are just people who will come back anyway. And so figuring out when to flag these things and how to filter them at the point of collection, I think, is really important.
I've seen this actually in a situation where someone is thinking they have a massive churn problem when really it was just a data problem.
Like they were measuring churn improperly
or they didn't know how to measure it.
And so maybe they were going based on
number of unique identities in the system.
But what they really should be doing
is looking at people who were built
and go from that as the source of truth.
So sometimes it's just changing your source
to power a certain metric
or accounting for the fact that you might have duplicates.
There is sometimes a data solution
to first figure out what your baseline is at.
Because it can completely change your decision-making
and you might invest in fixing a problem
that actually isn't a problem, right?
A quick question.
You mentioned a bit earlier that the company can establish,
let's say, the right mechanism there to figure out
when issues with data
and around quality specifically happen.
Can you give us a little bit more context around that?
What kind of mechanism a company can use to detect that, for example,
the addresses problem that you mentioned, right?
Addresses is a big problem.
So what I usually start with is, I mean, there's been very, I have more recent theses on this since, but where I started from, which I think is a good baseline, is even if you only service
a certain market, right?
A certain area of your state, or maybe you're only in US or you're only
in Canada. You should store your data in ISO format or in, if you look at Smarty Streets or
some of these other APIs that are available, there are these like international APIs that show you
what an international address should look like, right? Like don't store things in a way that says, okay, like zip province city.
What if it's a rural route?
Like if you ever look at a rural address, sometimes it's like counter road 46,
rural route three, intersection of this.
Like you can't just kind of assume that everyone can fit into this like address one, address two, city, state, province, or city, state, you know, country. So thinking
of things of like, yeah, localities, administrative areas, sub-administrative areas, accounting for
the fact that maybe there's not a real address and you have to have latitude and longitude.
So if you can just, but you don't have to invent these things. These models
already generally exist in certain APIs. Again, Smarty Streets is a good one. Or you could look
at ISO standards. And if you stored your stuff in that format and start to create structure to
where the stuff should go, it's like putting it in the right filing cabinet. So at least you can
know where to look. And then once you've done that and you have the have the right data model
and you don't have to overthink it like i think just like in starting with
these well-known international formats um is a good start then other so let's say that you're
doing that in postgres as an example, or SQL server or some sort
of database.
Then you can put things like Sura or Prisma or something on top of that database, which
gives you triggers like on update or on insert or on deletion, you could trigger little micro
functions, right?
Which could be hosted somewhere.
And those micro functions could be things that know that bad data could make its way in accidentally.
And at the point of insertion,
then start a transformation step that then extracts the unit number from the
first part of the address.
Like maybe some people put in 200-1 Main Street
where 200 is the unit number.
We'll pull that out if you notice there's a dash
and convert that to unit 200
and put that in the unit field.
So I think like from a tool perspective,
previously to do that would have been a lot of work, right?
But now because you can basically have your data go into a nice warehouse,
you can have an API layer for free to sit on top of that to look for changes.
And then that can trigger effectively free functions,
which can clean up these little patterns.
You can actually make the data clean itself.
Right.
And, and hopefully, yeah, force it into, into a standard format and then push that to the various places.
So what are the, okay, we talked about addresses and you said that they are like a very common
source of issues with data.
What other issues you have seen, like more, more commonly, like together with addresses, what else you have seen there?
I think person records jump out.
Or if you're using something like Salesforce,
where I think a lot of people, where they don't set things up
and it causes issues is they don't put unique identifiers.
They don't put like unique identifiers they don't similar contacts
and so if you have a person that is across multiple accounts like the same contact is in
three different accounts as an example yeah in salesforce you should be having a field to store
like the unique identifier and that way you can start to tie together in in the future
that these contacts are related to each other and then you can basically set up rules to like
sync the three so i think one of the biggest issues i see is just duplicates right and then
the second piece is just the quality of what's in a name, right? You'll see a lot of folks put either names,
a first name field, or they don't fit. They put the first name and last name in the first name
field and keep the last name blank. So like just what gets put in the fields, I think is often
an issue. And then even just formatting formatting like i see this all the time from
like marketing cloud data but you'll have some contacts that are all caps and some are capitalized
and some are all are all small and that would be how it goes out in an email right so you usually
be seeing this stuff in the data side and because that actually will reduce your click rate and could
cause more opt-outs if you're saying hey fred and it's all caps so like capitalization putting the
right thing in the right field and tracking that these three different contacts might be the same
one i would say like the top yeah the top ones actually Actually, it's a very interesting problem, which has to do with identity in general.
And especially now that we are using so many different SaaS applications, which each one
imposes a data model on their own.
When you use Zendesk, they have their own way of representing what a user is.
When you are using Salesforce, the same.
Your marketing tools,
probably they have a little bit of a difference.
And of course, like the people involved
that are also different, right?
So what's your suggestion
on like how to deal with this problem,
which is inevitable, right?
Like that's how life is.
Like we have all these different systems.
Yeah, I think a good,
all right, at its most extreme,
the best one that I've seen that does a
good job of this is the adobe identity i mean it's normally used by very large orgs but i think most
orgs can learn from that even if they do a portion of what they do adobe identity says and you can
just see all this from their development docs as inspiration but they look at everything's an
identity that's attached to a
person and the identities can change so there's like you have an identity record and you can set
what is that type of identity and is it a permanent identity is it a is it a ticket like a
zendesk identity you can basically come up with your own, like, what is the type of identity
that this is? And then you can attach it and detach it from a record, from a contact at any
time. It's a little bit overkill for most people. I think if you just were to simplify it, just
keeping a record of relationships of, this is a list of identities. And then you have a table
that says, okay, this identity is related to this record. Having that somewhere, it can go a long
way to at least keep track of these things. Instead of assuming that you'll always be able
to correlate them to each other. Just creating this type of relationship mapping is an easy way to
keep track of it. And yes, sorry, go ahead. No, I find very interesting what you're
saying. I'm just trying to think of who is managing these identities? Who is responsible
at the end? Because what you are doing here is we are trying to solve this problem by adding another level of interaction, let's say.
So we say, let's create this concept of identity.
And instead of mapping Zendesk to Salesforce,
Zendesk to Marketo, Marketo with Salesforce,
let's go and do Marketo identity, Salesforce identity, Zendesk identity.
And if we do that, of course, then all of them are like mapped, right?
But still, like someone has to manage like this mapping to the reference identity that
we're creating on this identity management system.
So who does that?
Oh, yeah, it's, I don't know if that role exists yet.
It's like, I think it's what we'll find.
I think we'll find, though, is over time that data quality will become a function of a business.
Right?
And I think it should.
I mean, it's unfortunate that that's required, but it is a role that is realistically required today to kind of manage the fact that this is going to always occur.
And the ones that do invest in managing this are the ones that are going to get way more out of their base because they can infer things that the others can't. And I think just
as a quick example, even with phone numbers, I was working with a bunch of records today,
a fintech trying to do some outreach to customers over text. And the records provided from the
marketing system, some have pluses in front of them, some have brackets, some are just too many digits, some are missing digits. So that affects your ability to reach
people, right? If you're expecting the format to always be clean, we would have rejected 30 to 40%
of the records. And so you'd be marketing to less people. Once we were able to standardize all that
in an international format, now you're reaching 80 or 90 something percent of the records that were provided because we
were able to standardize it. But if you don't put someone in that role of responsibility to
ensure quality, you actually could be really hampering your ability to do marketing or to
infer things. Sure. Yeah. I I was gonna, jumping back to Adobe,
it's such an interesting point.
And Kevin, having been a past user
of some of the Adobe Marketing Cloud products,
I think they get a bad rap
for being a huge, expensive monolith
and in many ways they are,
but it's really powerful technology. And I think it's a
great, I just loved your comment of sort of looking at their developer docs for inspiration. I think
the challenge that a lot of companies face is one, it's unattainable from a cost perspective.
And then two, the question Costas asked is who manages sort of the central identity?
Well, in the Adobe world, it's Adobe, right?
And so you're locked in and it creates a huge amount of inflexibility, which I think is very problematic in many ways.
Well, I think most people should manage that in their own warehouse.
Now, the question is, is okay what does that look like
right what's the schema for that and what's the like what's what's the turnkey way for them to
manage their own identity system in their own environment and i think i think there's a lot
of folks trying to get solutions in the market to solve that. It might still take some time, right? I don't know
that this is something people can just buy an out of the box thing today and it will just work
magically to solve all their identity issues and run in their own environment. I think it's only
become clear that this is a problem that has to get fixed. So it's going to take some time before anyone, just anyone can
do this. I think the, you could start, you can start in a simple way, right? You could basically
have a con, you could have a table that just stores like really hard coded things, like have
a column for like, here's an ID, like here's your main ID
and you have a column for Zendesk ID,
you have a column for Salesforce ID,
you have a column for Mercado ID
and then kind of just track that.
And that might be, it's like a shortcut, right?
Where you're not trying to manage a whole identity layer,
but you're at least trying to map the relationship
that these three ideas are all tied to the same contact.
That's like, it's just a little step up
from what someone might do on their own.
I mean, if you really want to cheat,
you could just have a contact record.
When we talk about person record,
you could have a person record
with some extra columns in it.
And if you don't want to get into
the whole relationship mapping piece,
just add some columns for these different identities.
And that will allow you to eventually tie them together.
But just having them stored somewhere is better than it just being all up in the air and hoping that you can always match based on email address.
Because that's usually what people do.
They'll go and they'll try and just match on that.
But maybe Salesforce has someone's working mail and HubSpot has somebody's Gmail.
So you won't really be able to match them if you just think that you're always going
to be able to go based on email.
So yeah, I think there's shortcuts.
To solve this, I don't think you have to jump right ahead to this perfect world identity management thing.
I would agree with you that relying on a vendor to hold all of those identities is dangerous because what if you want to move?
What if you want to take control of that?
You're not going to be able to get that perfect export of all the Adobe
IDs that they've created. Sure. Yeah. They do make it easy, but also like Adobe, the Adobe identity
is kind of an overkill solution for most companies that don't have that type of complexity. So yeah,
yeah. It's definitely something people should take on themselves. For sure. Well, you answered the question. We're at the buzzer here. Brooks is
telling us that we're at the buzzer. So we need to close the show out, but I was going to ask you
what's the starting point, but I couldn't agree more that the starting point is actually
just beginning to tie together some of the basic pieces of unique identifiers from the various
places in your stack to build a foundation for that unified profile in the warehouse.
And even if you do the basics, like you said, where you're literally just sort of mapping the
unique ideas across tools is such a useful foundation to build for the future. Kevin, one thing we didn't talk about
that we discussed before the show is you have built some unbelievably fast and SEO performant
websites, literally just using technology to sort of push pages to the first page of Google.
We didn't get a chance to talk about that, but would you come back on the show
and can we break down the stack
for sort of the latest, greatest SEO performant website stack, especially relative to the data
piece? Would love to have you come back on the show if you'd be willing. Yeah, that'd be great.
And especially because to be able to do those cool things with fast web you know experiences the data model really is important like you
you need to put your data in a certain format you need to have a certain flow working because
that you have to make it so the browser doesn't do any of the work and the reason why things are slow is because sites generally, 99-ish percent, 98% of the web works this way where they put all the work on the browser.
Someone has to, they go to the site and then it has to make a whole bunch of stops to get all the information that the user might be asking for.
And all those things take time.
And there's a whole bunch of calculations and work done by the browser to present it. And so if you want the browser to do no work and just present
information instantly in under half a second, the data model needs to be pretty clean on the back
end to make that possible. But yeah, once you get there, the benefit is you can do some pretty cool stuff.
So yeah, happy to walk through
how someone could go about setting that up.
Love it.
Well, that was a great preview.
We'll have Kevin back on the show.
Kevin, awesome discussion.
I learned a ton.
Thank you so much for giving us some time
and we'll talk with you again soon.
Yeah, thanks. It was great much for giving us some time. And we'll talk with you again soon. Yeah, thanks.
It was great to chat about all things data.
It always is.
Fascinating conversation.
My big takeaway was when Kevin said, all a business is, is the flow of data.
I haven't really chewed on that statement enough to know
whether I have a strong conviction about it, but it was very thought provoking. And in many ways,
I think makes sense when you sort of break a business down into its component parts,
even the conversation that maybe a salesperson is having with a prospect,
the content of that conversation
is data. And so that was very thought provoking to me. So I think that's probably what I'll be
chewing on this week is that statement. How about you? Yeah, absolutely. I really enjoyed the
conversation that we had with him about modeling and abstractions around data. I think what I'll keep from this conversation is that in order to be
as correct as possible or be able to have the right mechanisms in place to monitor quality or
like reacting issues, you need to have a good abstract model of how your world and how your company and how all the functions and your
interactions with the customers are going to be. That's what I'm going to keep. I think it's a very,
it's a piece of wisdom that we took from him. And I think it's a great advice for every engineer
out there that before you start implementing, like spending time in designing things
and thinking about why things should be organized in a certain way.
It's something that's super, super, super important.
And it comes with maturity.
I mean, it's not a coincidence that he had to mess with so many issues related to data
to come to this conclusion at the end. So yeah, that was, I think, a very important part of our
conversation. And that's something that I definitely think about and keep.
Absolutely. Well, thanks again for joining us on the Data Stack Show,
and we will catch you on the next episode.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.