The Data Stack Show - Data Council Week (Ep 4) - Using Data Anonymization for Identity Protection With Will Thompson of Privacy Dynamics
Episode Date: April 26, 2023Highlights from this week’s conversation include:Will’s background in data (0:28)Privacy dynamics and data anonymization (4:18)Addressing data privacy problems in the space (10:33)Developer experi...ence with Privacy Dynamics (13:49)How does Privacy Dynamics work? (21:09)Update of real-time anonymized data (26:29)The problem of dates and other complexities in data (31:24)Being a data engineer in a startup (34:44)Moving at the speed of a startup (41:01)Connecting with Will and Privacy Dynamics (43:28)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
All right, we are here. If you're following along at Data Council Austin
with a chance to record some shows in person, which is great. Usually we're on Zoom,
but today we have got Will Thompson here at the table with us. He's the head of engineering at
Privacy Dynamics. I'm Brooks. I'm filling in for Eric this week. Again, if you are following along,
he couldn't make it to the conference, so you're stuck
with me, but Kostas is here.
But we are excited to talk with Will Thompson today.
Will, to start us off, could you just share a little bit about your background with us?
Sure.
So, originally my background was, it was very data-oriented, but kind of in a different
way. I was,
I worked in like a document centric world and I worked for a legal publisher and we had, we were building a,
a legal research platform. And so, you know,
you're dealing with text. We were all, you know, you're dealing with text. We were all, you know, essentially XML shop.
And so, you know, we had a search engine and, you know, note-taking tools and, you know, a very specific customer, which is lawyers who are trying to, you know, work on their cases.
And, yeah, and so then that company was bought by Thomson Reuters.
And I joined the startup Privacy Dynamics,
which was a completely different tech stack,
completely different problem.
So that was a huge shift for me.
But yeah, so I've like dove into this world of python and data science and you know kind of
enterprise application b2b type um type software which was a huge change but i found it super
interesting yeah so cool and i mean such a big, both on the data side and just I'm sure your kind of day to day work, work life going from the legal industry and working for a publishing company, building out the kind of digital platform.
And then, you know, straight into the fire of working on a startup. Yes.
You mentioned before the show, at the publishing company, you benefited from, you know, building out this digital platform.
But you, the business had this extremely successful publishing business.
So the pressure to that, you know, in startup world is just always there to move fast was not necessarily there in the same way.
So you mentioned, I think you're like, we got to kind of do things the right way.
We had a clear vision of what we needed to do and when it executed.
Totally different story working at a startup.
Could you just talk a little bit about even maybe from a personal perspective, like what's that been like well yeah it was like i had never worked at
a startup so i really had no i had no idea what to expect and yeah i yeah it's obvious it's totally
different you know it's just an it's an entirely different set of challenges but it is definitely
more challenges like so you have to you have to prioritize and you have to, you know,
you have to be really careful with how you're allocating your time.
That's the main thing I've learned is, you know, you can't,
you always, you have to stay focused and you can't,
you can't get too married to any particular idea because as you're,
you know, before we knew exactly who our customer was and, you know, an early stage startup,
you're learning who the customer is. And as you, you know, you think you have your customer
figured out and then it needs to shift. Then you have to change your engineering priorities, but you don't want to
leave a trail of garbage every, you know, at every turn in this little road. So that's the different,
that's the different challenge. That's fascinating. We want to talk a little more about privacy
dynamics and data anonymization, which you even said yourself, that's kind of an overloaded term.
I don't love to use it.
Costas loves to talk definitions.
I'm going to hand it off to him and let y'all dig into kind of defining that from a couple of different perspectives.
Yeah.
So what is anonymization?
Let's start with that.
It depends on who you are, right?
So in our case, we're talking about data anonymization.
And so this is like in a lot of cases, people think of anonymization more as a security problem, which is who has access to what data.
So there'll be encryption, tokenization, that type of thing.
We're more about the assumption is someone needs access to this data,
but we don't want to identify any individuals in the data.
And so anonymization in our context is protecting identities
rather than in data that
you need to use rather than, you know, hiding information specific, like, you know, we can
still tokenize things if you need it, but generally people need to, you know, have a
format consistency or do research on some data, but you don't want to make it possible
for anyone to figure out who is in it.
Okay, that's super interesting.
So let's start with, I guess, the first thing that comes in mind for anyone who has written
some code in their life is like, okay, I have an email field somewhere, I hash this thing, I get a random string there, and I use that.
I have a feeling that's something much more than that.
Sure.
Let's talk a little bit about the technical side of things,
and how anonymization is actually built and implemented on top of data, especially how it relates to
how it works with data that we might not think that are important to anonymize. Like, okay,
the email or my social security number are very obvious. But there might be very clever ways to identify a person, right?
Yeah.
So let's talk about that.
Sure.
So the identification of these attributes are typically categorized into two categories.
One is direct identifiers.
The other is indirect identifiers. People call
them quasi identifiers. So direct identifiers are what you just went over names, addresses,
social security numbers. And so that's, you know, a lot of people are working on that
and you know, how you treat those are, it just depends on who the user is. A lot of cases like tokenization is fine.
In other cases, like in DevTest,
you know, someone, developer's going to scream bloody murder
if your email address is some random string of characters.
They're like, I need it to be an email address.
Some people need it to be a, you know,
like a routable email address.
So, you know, you have all these different concerns for that.
Indirect identifiers are where it gets tricky. And that's where we, that's what we were focused
on initially, because that's, it's really important to healthcare. And it's also important to
CCPA, some GDPR things as well. But this is where, you know, you can't identify someone
directly with their zip code or their gender or their data birth.
But if you combine those together, it becomes more unique.
And then you can identify people.
And so the risk is what's referred to as a linkage attack.
And so you go get some data somewhere that has people in it, and then you do a statistical
attack. Essentially, you try to relink these people based on the sequence of quasi-identifiers. And then you can assign probabilities to, you know, what's the likelihood that this person here who's anonymous is this person, this real person that I know. And, you know, it doesn't have to be a hundred
percent to be a risk, but sometimes you can match with a lot of certainty. And so
this type of anonymization, we use what's called K anonymity. The concept is you create groups and our algorithm is a category of algorithms called micro aggregation.
And the idea is you essentially, you create, you cluster, you do, you create clusters for everybody in the, in the data set.
And then, so, you know, you cluster people based on similarity. And then the more, the more anonymity, the more protection you need, the larger the group.
And so, you know, we'll take, we cluster all these people together and then we, we find the center of the cluster and then we we make everybody match the center and so essentially you know so
we'll maybe we find some somebody who's lives close to your zip code who's the same gender
close age will shift you guys to be the same and then you know you will no longer exist in the data set.
Maybe you don't change at all in the data set, but now there's at least one other identity in the data set
that matches your combination of quasi-identifiers exactly.
And so now, at a minimum,
if someone is trying to link across the data set, there's two.
And then you can increase it to
make it even more more difficult okay that's fascinating actually so how i mean i would
imagine that like if i was the data scientist in let's say inTech company. And that's not a random example. I chose it on purpose because
we had a conversation at some point in the past in the show with some people from InsurTech,
and we were talking about that, like privacy, right? They were like data scientists, saying I mean it isn't an issue in the way that like I we need to remove
anonymity to do our job in a way right like to build these models because it's
like we need this information like to go and like to like risk assessment of like
whatever right so I from what I understand, everything is like a matter of making the right trade-offs, right?
How do we do that?
Because, okay, in theory, I get it, what we are saying, but in a real setup, right?
Let's say I'm that data scientist and I'm going to use your platform.
How do I choose the right parameters there? big the case should be like right and all that
stuff yeah so all right so there's a flip side of this and so we have this kind of privacy dashboard
for everything and so we do we have a set of tools with two things right so whenever you treat data
like this something is following falling on this privacy utility curve right so you increase privacy
up to a point and if you do 100 privacy your data is 100 noise right and then you slide it all the
way other way back and there's no privacy and so yeah so you want to find that sweet spot and so
we so privacy you can measure and do this risk assessment.
This is almost pulled directly from healthcare literature on how to...
Essentially, we do these like...
It's like a Monte Carlo simulation attack.
And so we do this simulated linkage attack.
And then we say, here is how approximately linkable we think your data is.
And then we put them into these categories, basically low, medium, high risk.
And so that's risky, a risk analysis.
And then the other one is we have tools for measuring distortion.
And so we'll run the data set through the system and we'll show you how have your distributions changed?
How have your main top level statistics
changed? We recently added something that shows relationship distortion. So like how have the
relationship between age and some other column that maybe has non-identifiable information and
it changed. And so this way a data scientist
can look and see you know where is my privacy according to the risk assessment and how bad is
storing the data and so like ideally what you would do is you know dial it to as much privacy
as you can get for the you know the distortion that you can accept and then and then set that and then you know let that be your baseline
yeah yeah that makes total sense like how how let's say more complexity it adds to the life
of a data scientist like to do that i mean we would hope not that much we would hope you know
this is one of those things where you know we where we want to iterate on this if anybody gets blocked,
but we try to make it as frictionless as possible.
But ultimately, we want to just give you all the information you need and say,
all right, this is too much distortion, or maybe we should nosh this up, dial it, run it again.
Hopefully, you kind of experiment a little bit
until it's what you need,
and then you don't worry about it anymore.
Maybe you come back and check,
maybe set an alert if the risk level changes
more than some percent, something like that.
But essentially our idea, what we want to do is
let the data
scientists work on another problem like we will handle the anonymization and then you know you
can come check the dashboard you can integrate it in your system and then and then you work on
whatever it is that your company does yeah yeah yeah 100 okay let's go back to the other type of anonymization, which is like the social security number and all that stuff.
And it was interesting because you mentioned developers being like, okay, I need something to look like an email.
Obviously, I get that.
If you have somewhere, let's say, a regular expression to match something and you want to test for that, that it actually works.
If you have a random string there, that's a problem.
Tell us a little bit more about that, because that's a part of, okay, we talked about the
data scientists, but there are always, like, also developers and engineers, like, involved.
And they have different needs, right?
And anonymization, let's say, affects their work in a different way.
Tell us a little bit more about that, because it sounds very interesting, and especially around
the developer experience, like working with their tools and how it affects their job.
Yeah, it's an overlapping problem, but yeah, they have these unique concerns. So if you're a data
scientist, if you want to work on anonymized
healthcare data, it's probably just one data set or like a handful of data sets that may or may not
actually be linked together. Whereas a developer, they have a database with tables and those things
have foreign key, private key relationships, and you need to maintain those relationships.
So that's something, so know, so we started this,
these are the features we started adding for developers. They're like, well,
you know, we want to copy all these tables over and, you know, we want,
we want, we don't want to have, we don't expose the same keys,
but we need to maintain the same key relationships. So you have to, you know,
tokenize those a certain way. Email addresses,
we had to build format, consistent email. And like, and you run into all these like little
problems. One of them was like, they actually, their system was actually sending emails.
And so, you know, it needed to be a valid email, but then it needed to not be routable. And so,
you know, off I go into the, what is it,
like the IETF document on email domain naming.
And it's like, oh, well, yeah,
there are actually a handful of these top-level domains.
And so you build it in your format thing.
And.example, what is it?
I don't know.
But so you actually have these things
that will pass their regex, but bounce you know if they try to send an email so yeah and then you know social
security numbers those are those are just numbers but yeah like names one of the problems people
have is you know yeah you can generate names you know kind of like random normal looking
names but they want it to be the same name for this record when it comes through the next time
and so you know we can do that in some cases but not not all cases. It's hard to, you know, you've anonymized this, but then you need to make it possible.
You need to be able to make sure that it, you essentially want, it's like a cryptographic
hash, but with someone's name.
So that like this row comes back again and it gives you the same name.
Yeah.
So yeah, so that's like, these are the kinds of things we're working on now to improve the, you know, developer workflow.
Yeah, that's super interesting.
What other types, because, okay, we talked about, like, names, like, foreign keys.
What other types are, like, tricky and challenging and, like, developers care about?
Like, what about timestamps or, like, dates, for example?
Yeah, with timestamps, like, that's something,
it's a rabbit hole.
Like, you're like, oh, timestamps.
It's also, you know, talk to a developer who's,
like, a senior developer who's worked with a lot of data
about time zones, and they'll just, you know,
the color will wash from their face, right? But this just, you know, the color will wash from their eyes.
Right.
But this is,
you know,
it's the same thing with dates because how many date formats are there?
Right.
So,
you know,
you like that is like,
it's just one of those problems.
That's not,
it's a big messy problem.
There's not like a beautiful,
simple, you know simple you know you know beautiful design that solves it you know it's like you just have to build it out as kind of as needed luckily it's not you know
each additional thing is not an enormous challenge some of some things are trickier than others but
it's the trickier stuff is more in that
we also try to identify all of these things up front.
We try to, so that you don't have to go
and like, if you have, you know,
thousands of columns,
you don't have to go through
and maybe you just go through and check
if you got stuff right.
Because it's like,
it's probably impossible to get everything
everywhere 100% right.
So, you know, people are always just going to have to check this stuff. But, you know, Because it's probably impossible to get everything everywhere 100% right.
So people are always just going to have to check this stuff. But our goal is to have it as automated as possible.
But some things are just like trying to find the U.S. Postal Service rules on what is a valid address.
And even then, let's say you get that right like like I got it I got most of it but a lot of data is entered
by humans and they will enter it wrong and so you have to handle that too so
those are like not those are not fun because they're just messy and kind of
annoying and like the hardest thing about it is you have to build a system that can withstand
all these additions and kind of bolt-on exceptions and things without making it incomprehensible
every time because it's you're never going to stop adding stuff to it yeah and then if it just
turns into a pile of spaghetti it becomes unmaintainable. So that's a totally different challenge. Yeah.
No, and it's a very interesting problem, to be honest.
Like it's... All right.
So talking about working like with the data,
let's talk a little bit about more about the actual,
like the product experience, right?
Like let's say I'm a developer.
We have, let's say, database somewhere.
And I want to take your product the private dynamics product and
use it on my database like how does it work how what it takes how easy it is how transparent it
is and like what's the process after that i mean ideally it's super easy. Let's say you have Postgres, BigQuery, whatever.
You sign up.
You create a connector.
You enter the credentials and location for the source database.
And then you create another one for the target database.
And you walk through a wizard.
We introspect the tables and columns, and then,
you know, we'll try to auto detect there. You kind of check which ones you want to keep and what settings you want, what anonymization do you need, what defaults, you know, unless you have,
you know, hundreds of tables, it's a pretty quick process. And then you go through and
you set a schedule and it runs. And then assuming, you know, you don't need to make a bunch of
changes to the, what, what is included or excluded from the project. You know, all you would need to
do is check the dashboard, see if everything, you know, if the data looks like you expect it to in terms of like, did the distributions look good?
Did the, you know, the auto detection work like you expected?
And then after that, you know, hopefully you don't need to use it that much, except if maybe if you wanted to integrate it with some part of your process.
Okay.
So let's say we set it up and who's usually like inside the engineering work that is doing the setup and installation.
What type of like engineer is usually like involved in that?
Is it like a DB admin?
Is it like someone from security, from InfoSec?
Is it someone from...
I don't know.
Yeah, it's usually an admin.
I haven't come across anybody
who doesn't have
good experience
programming.
Usually they're working in infrastructure operations.
I mean, they're the ones who are setting it up.
Because this deals with sensitive data, we have a SaaS product,
but also we did a lot of work to make sure that we can install this on-prem as well.
And so those are much more involved because we work with their ops people,
things like that.
If you use a SaaS, it's just you know you just you sign up and then all you need is access to the database so you know if it's a small company you might just have to look up the credentials and
then you're good yeah because we want cso's to be able to just you know click and then they have
the information they need yeah 100 and okay let's say now i'm i don't
know like a product engineer right like i'm building a front end and i'm going to have like
access to this production database do i have to know about the existence of privacy dynamics like
how do i interact with the data right so it would fit in your pipeline your et then there would just be, you know, the way we would recommend setting it up
is, you know, very few people have access to the sensitive database.
And then, you know, you rope that off, and then, you know,
those credentials are encrypted on our system or in your
infrastructure. And then there's a less
private database where you know
more engineers have access to it that's maybe in the lower environment so like
that and so then you know you give that to the engineers oh so they don't even
there's just they just know there's a database and that thing is kept up to
date we run batches okay so it's the thing you know on whatever kind of
increment you need all right okay I get it so the thing, you know, on whatever kind of increment you need. Oh, right.
Okay.
Oh, I get it.
So the anonymization or encryption of the processing of the data doesn't happen like
on the fly when I execute the query.
Right.
It happens like you create a replica of the database anonymized and then people go and
access that.
Yeah.
The anonymization process, no matter what, it's somewhat expensive.
And also we have to have a picture of,
of all the data in order to anonymize it.
Yeah.
And also to do the risk assessment,
we need to know everything that's in it to say,
you know,
because one unique row increases,
increases the linkability.
So we have to see everything.
Yeah.
Make,
okay.
Streaming is something that has,
like,
it's definitely something we've discussed and want to do because we'll need to
do it for extremely large data sets,
but it's a,
it would be a very large project,
but it'd be something really fun to work on,
but I have to,
you know,
got to stay focused on what everybody needs.
100%.
No, that makes total sense.
So, okay, from what I understand, like, we are talking about use cases
that are more like in the analytical use cases, right?
Like, so someone's going to work, like, with a static data set
that they are going to extract, like, from the database,
like a data scientist who wants to
build, let's say, a model.
And not that
much use cases where, for example,
you would have, let's say, a real-time
application who is attached
to the database and needs
to have very consistent and up-to-date
data that are also
anonymized.
Is this correct? Do I get it right or do you also see
more real-time use cases? I mean, it wouldn't be actually truly real-time, but you can,
depending on the size of the data, we can run it pretty quickly. We can run it, you know, hourly or even,
you know,
every 10 minutes if you needed to,
if it wasn't an enormous data set.
So we can keep data pretty up to date.
Okay.
But yeah,
it has to,
it,
well,
you know,
also like if you have big data and you can install on-prem,
we can outfit you with a really large instance
and it'll go faster.
But yeah.
That's very interesting.
So you mentioned big data
and one of the most important,
let's say, jobs that a data engineer has
is to make the pipelines incremental, right?
Because when you have billions and billions of rows going
and processing everything from the beginning, like every time it's over the slide or it
can overkill, how you can do that when you need to have access in a way to the whole
data set to do the...
Oh, no. Well, we have to reread it.
Okay.
We have to reread it. And so, yeah, incremental is,
that's something that we've sketched out as an idea,
but it's really hard because you have to essentially,
we're clustering everything.
Right.
And so we have to up,
how do you update,
you know,
you create all these clusters and then you add a thousand rows yeah how do you
how do these clusters change that's complicated yeah and so managing that like that's pretty
like i think we could handle the more like data side you know streaming streaming, streaming the data, running our like transformations that that's all,
you know, what they call a SMOP, right.
Simple matter of programming.
The, the like updating a cluster, you know, like a cluster data set that's going to take
some, that's going to take some tinkering.
Yeah.
Yeah.
But yeah, it's something we want to do.
Yeah. Yeah. It's something we want to do. Yeah.
Outside of tabular data, do you see
other data that are also
part of the images,
PDF files?
How do you work with
this type of data?
We don't yet.
The thing we've gotten the most requests for
is more semi-structured data, JSON or, you know, just arrays, things like that.
And like, yeah, that's something we need to do, but it's also really challenging, but it's for the same reason we, you know, dates are challenging where it's like by an order of magnitude.
Right. So it's like you had these weird dates. Well, we were just talking to somebody recently at this conference, and they were talking about this column that was like a JSON plot.
Yeah.
And there's no schema for it.
And so I can't even assume that row to row, it's doable.
It's just, it's a big lift.
So, yeah.
So what we would have to do is like take that data, normalize it, run our anonymization, have a map back to the original data and then, you know, and then do that to maintain that format consistency of just completely arbitrary.
Yeah.
Yeah.
No, it's, I mean, it's, I can't feel what you're talking about.
Very rewarding if you figure out all these, like, little, like, things that can go wrong.
Like, but it's a very challenging, like, problem that you are dealing with.
I'd love to be able to just sink my teeth into some of those problems.
Those... They are... That's fun. I mean, I don't know. I'd love to be able to just sink my teeth into some of those problems.
That's fun.
I mean, I don't know.
I think even if you might not like to solve dates,
your name is going to live in history. Yeah.
I always...
When was it?
I have a problem with dates in databases.
I always forget these stupid languages where you define the format.
I always have to go back to documentation for each database and see what the format is, when do I need it,
why that's capital when it's not capital.
It always looks like a regex, but it's not a regex.
Yeah, exactly.
And I was going through that again. I'm too old for that. when it's not capital. It always looks like a regex but it's not a regex. Yeah, exactly. And
I was going through that again. I'm too old for that. And I was like, we live in an age where we have open AI that is going to, I don't know, make us all obsolete or whatever they say on Twitter today, but I still cannot give a date and software tells me this is
the format in this language.
Or at least I'm not aware of this.
If someone is aware of a library that does that, please let me know.
You will make me a much happier person.
So it's like the reason I'm saying that is because, you know, there's always a lot of
hype around what is currently happening, but people don't realize how much real hard engineering,
boring in some ways, needs to happen for all these things to actually work at scale at
the end. From one side you have like, okay, open AI, ask if there's a god and he's replying to you,
and on the other hand, yeah, you have to go and still struggle with dates, right? And it's not a solved problem. Like it's still there. So
I can feel you and it's like I think you should be talking more about that stuff
Like I don't know if you have like a blog or something like talk about all these like little problems
like what you said about like the email like that has to be like like we need to test it and make sure that like
It's sent, you know's sent and goes through a mail server or something
even if it bounces. All these
little things that
nobody
cares about until they have
to. And they're there.
99% of the engineers
out there, that's
why they get grumpy every day because they have
to deal with these things.
It is.
So it is important to talk with these things. It is. Right? Yeah.
So it is important to talk about that stuff, I think.
Anyway.
It's not glamorous, so people don't want to talk about it.
Yeah, yeah.
But, I mean, I don't know.
I think, like, we can make it glamorous, like, if we talk about it
and be, like, realists at the end.
Like, it's not just, like, all these small things together
is what changes the world, you know, like, at the end. It's not just like all these small things together is what changes the world, you know, like at the end. It's not just like suddenly one
day you come up with a trained model on OpenAI and it happened like out of the
blue. No, there are many people that had to figure out a lot of like
wrong dates to train this thing. Exactly, yeah. The real world is very messy and
solving problems in the real world requires addressing that messiness.
Yeah, yeah.
A hundred percent.
And we have to embrace it, actually.
That's also important.
Talking about messiness, let's go back to your experience being an engineer,
founding engineer in a startup, right?
Tell us a little bit more about how it feels, what kind of experience it is.
It looks like how different it is because, okay, I think people can imagine probably.
But what do you have to go through as an engineer to make yourself productive in such an environment?
Yeah, it was certainly for a while uncomfortable, right?
Like just the real shift in what my objectives were,
which went from being, you know, we know the customer,
we know exactly what they need, we're going to build this feature
and we can, you know, know it's like it's clear to to going to this
the situation where you know the ground's moving you know like i had gotten into a
you know comfort zone where i was able to keep things neat everything's tidy like yeah everything's
just so i know you know it's easy to figure out where everything is. And I, you know, it was nice. I was the kid who cleaned his room. Right. So, and so, but then going into this startup world, it's like, you don't, it's not a lot. It's a luxury. You don't, you can't really have all the time. And that's not to say you you like you have to embrace creating messes you just have to
prioritize very someone called it brutal prioritization yeah i think it is but it's
like it's uncomfortable you have to say when do i have to stop on this and also like you know what
do you have to set down and yeah and deal with what's like, you really have to think hard about like,
what is the most consequential thing right now?
And you know,
my,
so like,
I always have this paranoia.
Like,
that's how I,
that's drives a lot of my design is like,
what is the most likely thing to like come up and bite me in the ass?
Like,
what is going to like,
what is something we're going to forget about?
And it's just going to ruin our day someday.
Like these little,
you know,
time bombs.
And so you really want to try to not set those things up.
So when you're just like running full speed ahead,
six months later,
you just like trip and eat it.
And you're just like cursing your former self.
So it's like a lot of like what is
what's gonna hurt the least yeah you know and so and you know that's it's just
you do have to i think be okay with being a little uncomfortable and yeah and that's the that's kind
of the big change in startup world for me yeah. Yeah. And dude, you chose to go and work in a problem that's just an infinite number of exceptions
that you can have in your mind beforehand.
You really said yourself.
Working on anonymity stuff was different, right?
There's literature.
You can read all the stuff that people are working on in research.
But then it's like, oh, we need to automate the format detection.
Oh, okay.
Well, I've done messy stuff like this in the past.
In the legal platform, this is all human-inhered stuff.
Yeah.
And then we had this case database of all these cases.
And some of those are dating back you know hundreds of
years entered on some of them were probably entered on typewriters and then se art or whatever
so you know this messy stuff so i was used to kind of like working around that but this is
like that was our messy data yeah now it's like everyone's messy data. So it wasn't a problem I wasn't familiar with.
It's just a different scale.
Yeah, yeah.
That's so interesting.
I think there's also, especially when you're at pre-product market fit,
because after that, I think things get more normalized, right?
You at least have, let's say, six months ahead of you
that you know what you are going to be developing.
But before that, I think, like, the way that I visualize it,
like, the process, and it's not just for engineering.
I think it's just, like, much more uncomfortable for engineering.
It's for everyone.
Is, like, doing this thing where you're like in a sauna you have to be
in a sauna and get like really hot and be like yeah like we're doing like to build this like
it's going to be like yeah like we are going to own the world and then suddenly you are doing a
nice bath because you put this thing out there and suddenly it's like what the fuck is this like
no i'm not going to pay you for this shit you And you have to go back and forth and not have a heart attack.
That's emotionally the thing that you have to go through.
And for a salesperson or a marketing person that they have, let's say,
they grew working in an environment where everything is unexpected,
it might be a little bit easier but like for engineering where okay at the end we live in a very deterministic world
right like for us everything is like it has to be boolean in a way like it works or it doesn't work
like there's no like in between there like if it doesn't pass the tests we don't push in production
you know like that's like a very
i think like from emotional standpoint like point of view at least like it's a very different
experience and it is brutal like 100 yeah i completely agree and it's like it's hard to
accept that you know you know this thing that you built, you know,
it's not, it's the uptake on it isn't what you expected,
but we're still doing really well right now. You know, it's like,
we're still going to need this. It's just, that wasn't the,
that wasn't the like unlock. Right. And so, yeah, it's,
it is, you know, you have to steal yourself a little bit more emotionally for sure yeah one
one thing i've shared with our marketing team is we just you know hey we need to shift this
project on a super aggressive timeline it's a quote from mario and judy you know legendary
race car driver and he said if it feels like you're in control you're not going fast enough
and i think that's like i mean it applies like everybody at a startup right it's like you have
to go faster than you're comfortable with and just like know that you can maintain control you're just
not going to feel like you know you have as much as you want like your room's not as clean as you
want there are dishes in the sink you know kind of like you just have to get comfortable with it
not being as tidy as you want.
And just keep moving.
Because I think a lot of times as we move, that's how things get better faster.
Instead of like, I got to get this thing perfect first.
But man, yeah, it's emotionally taxing. Oh, yeah.
And I think a way to think about it is, you mentioned like data out there is
messy, right?
And like at the end, like data is like a very simplified model of the world that we live
in.
So the world is even messier than that.
So you just have to embrace that.
And yeah, okay, it's easier to say than do, but yeah, at the end, that's what you have
to go through.
But it can be fun. So don't be discouraged. It's easier to say than do, but at the end, that's what you have to go through.
But it can be fun.
So don't be discouraged.
At the end, it can be fun.
Yeah, the fun stuff is definitely very fun, right? Just because once you get something right, it's like looking back on all the work it took to get there.
Yeah, you can kind of impress yourself, and that's really gratifying yeah yeah and now i remember like someone said that
some probably was a tweet or something like many years ago so like having a startup is like having
like a newborn it's like 99 of times like crying and full of shit but this one percent of like
when it smiles at you and like at you, it's so rewarding.
That's very apropos.
I joined this startup months before we had our first kid.
Oh, wow.
Yeah.
And so, yeah.
So it was two babies at once.
Yeah.
I think that's an apt comparison.
Yeah.
You have quite a medal if you could handle that. That's amazing. Yeah. We're at the buzzer here, but we
will be, we'll be on the lookout one for your blog about hacking the email problem that send,
but don't deliver. If folks want to learn more about Privacy Dynamics though and check
out what you're doing, what you're building, where can they find that? Head over to privacydynamics.io
and we have a doc site and it goes into, you know, if you want to learn more about anonymization,
we have detailed, you know, literature explaining how we do it, we do it how it all works we have blogs that show
like how to get started you know quick starts for all these different
you know types of setups so yeah just head over there's a lot of good information
yeah and actually i would say i know that's okay our audience is probably like more on the technical
side but i think anonymization
is one of the things that everyone should read about.
And read not just about the legal aspect of that, but just to see the effort that goes
into engineering for these things to happen.
And we should all be at least a little bit aware of what is going on.
Because at the end, it's our data, right?
The medical records belong to me like yeah someone's like storing that but it is my data so we should all be more literate around that stuff and it's amazing that like you are building
that kind of knowledge base so we should spread the word around yes it's great and if anybody has
any questions just reach out to us.
Our emails should be on our website.
So yeah, we're happy to answer questions.
Awesome.
Thank you so much.
Thank you guys.
I had a lovely conversation.
Yeah, I really enjoyed it.
Well, thanks.
Yeah, thanks for joining us.
Listeners, thank you all for joining us as well.
Check out privacydynamics.io and we will catch you on the next episode.
We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds,
at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.