The Data Stack Show - Data Council Week (Ep 4) - Using Data Anonymization for Identity Protection With Will Thompson of Privacy Dynamics

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, we are here. If you're following along at Data Council Austin with a chance to record some shows in person, which is great. Usually we're on Zoom, but today we have got Will Thompson here at the table with us. He's the head of engineering at

Starting point is 00:00:38 Privacy Dynamics. I'm Brooks. I'm filling in for Eric this week. Again, if you are following along, he couldn't make it to the conference, so you're stuck with me, but Kostas is here. But we are excited to talk with Will Thompson today. Will, to start us off, could you just share a little bit about your background with us? Sure. So, originally my background was, it was very data-oriented, but kind of in a different way. I was,

Starting point is 00:01:12 I worked in like a document centric world and I worked for a legal publisher and we had, we were building a, a legal research platform. And so, you know, you're dealing with text. We were all, you know, you're dealing with text. We were all, you know, essentially XML shop. And so, you know, we had a search engine and, you know, note-taking tools and, you know, a very specific customer, which is lawyers who are trying to, you know, work on their cases. And, yeah, and so then that company was bought by Thomson Reuters. And I joined the startup Privacy Dynamics, which was a completely different tech stack, completely different problem.

Starting point is 00:01:58 So that was a huge shift for me. But yeah, so I've like dove into this world of python and data science and you know kind of enterprise application b2b type um type software which was a huge change but i found it super interesting yeah so cool and i mean such a big, both on the data side and just I'm sure your kind of day to day work, work life going from the legal industry and working for a publishing company, building out the kind of digital platform. And then, you know, straight into the fire of working on a startup. Yes. You mentioned before the show, at the publishing company, you benefited from, you know, building out this digital platform. But you, the business had this extremely successful publishing business. So the pressure to that, you know, in startup world is just always there to move fast was not necessarily there in the same way.

Starting point is 00:03:07 So you mentioned, I think you're like, we got to kind of do things the right way. We had a clear vision of what we needed to do and when it executed. Totally different story working at a startup. Could you just talk a little bit about even maybe from a personal perspective, like what's that been like well yeah it was like i had never worked at a startup so i really had no i had no idea what to expect and yeah i yeah it's obvious it's totally different you know it's just an it's an entirely different set of challenges but it is definitely more challenges like so you have to you have to prioritize and you have to, you know, you have to be really careful with how you're allocating your time.

Starting point is 00:03:52 That's the main thing I've learned is, you know, you can't, you always, you have to stay focused and you can't, you can't get too married to any particular idea because as you're, you know, before we knew exactly who our customer was and, you know, an early stage startup, you're learning who the customer is. And as you, you know, you think you have your customer figured out and then it needs to shift. Then you have to change your engineering priorities, but you don't want to leave a trail of garbage every, you know, at every turn in this little road. So that's the different, that's the different challenge. That's fascinating. We want to talk a little more about privacy

Starting point is 00:04:42 dynamics and data anonymization, which you even said yourself, that's kind of an overloaded term. I don't love to use it. Costas loves to talk definitions. I'm going to hand it off to him and let y'all dig into kind of defining that from a couple of different perspectives. Yeah. So what is anonymization? Let's start with that. It depends on who you are, right?

Starting point is 00:05:07 So in our case, we're talking about data anonymization. And so this is like in a lot of cases, people think of anonymization more as a security problem, which is who has access to what data. So there'll be encryption, tokenization, that type of thing. We're more about the assumption is someone needs access to this data, but we don't want to identify any individuals in the data. And so anonymization in our context is protecting identities rather than in data that you need to use rather than, you know, hiding information specific, like, you know, we can

Starting point is 00:05:55 still tokenize things if you need it, but generally people need to, you know, have a format consistency or do research on some data, but you don't want to make it possible for anyone to figure out who is in it. Okay, that's super interesting. So let's start with, I guess, the first thing that comes in mind for anyone who has written some code in their life is like, okay, I have an email field somewhere, I hash this thing, I get a random string there, and I use that. I have a feeling that's something much more than that. Sure.

Starting point is 00:06:36 Let's talk a little bit about the technical side of things, and how anonymization is actually built and implemented on top of data, especially how it relates to how it works with data that we might not think that are important to anonymize. Like, okay, the email or my social security number are very obvious. But there might be very clever ways to identify a person, right? Yeah. So let's talk about that. Sure. So the identification of these attributes are typically categorized into two categories.

Starting point is 00:07:21 One is direct identifiers. The other is indirect identifiers. People call them quasi identifiers. So direct identifiers are what you just went over names, addresses, social security numbers. And so that's, you know, a lot of people are working on that and you know, how you treat those are, it just depends on who the user is. A lot of cases like tokenization is fine. In other cases, like in DevTest, you know, someone, developer's going to scream bloody murder if your email address is some random string of characters.

Starting point is 00:07:54 They're like, I need it to be an email address. Some people need it to be a, you know, like a routable email address. So, you know, you have all these different concerns for that. Indirect identifiers are where it gets tricky. And that's where we, that's what we were focused on initially, because that's, it's really important to healthcare. And it's also important to CCPA, some GDPR things as well. But this is where, you know, you can't identify someone directly with their zip code or their gender or their data birth.

Starting point is 00:08:25 But if you combine those together, it becomes more unique. And then you can identify people. And so the risk is what's referred to as a linkage attack. And so you go get some data somewhere that has people in it, and then you do a statistical attack. Essentially, you try to relink these people based on the sequence of quasi-identifiers. And then you can assign probabilities to, you know, what's the likelihood that this person here who's anonymous is this person, this real person that I know. And, you know, it doesn't have to be a hundred percent to be a risk, but sometimes you can match with a lot of certainty. And so this type of anonymization, we use what's called K anonymity. The concept is you create groups and our algorithm is a category of algorithms called micro aggregation. And the idea is you essentially, you create, you cluster, you do, you create clusters for everybody in the, in the data set.

Starting point is 00:09:40 And then, so, you know, you cluster people based on similarity. And then the more, the more anonymity, the more protection you need, the larger the group. And so, you know, we'll take, we cluster all these people together and then we, we find the center of the cluster and then we we make everybody match the center and so essentially you know so we'll maybe we find some somebody who's lives close to your zip code who's the same gender close age will shift you guys to be the same and then you know you will no longer exist in the data set. Maybe you don't change at all in the data set, but now there's at least one other identity in the data set that matches your combination of quasi-identifiers exactly. And so now, at a minimum, if someone is trying to link across the data set, there's two.

Starting point is 00:10:44 And then you can increase it to make it even more more difficult okay that's fascinating actually so how i mean i would imagine that like if i was the data scientist in let's say inTech company. And that's not a random example. I chose it on purpose because we had a conversation at some point in the past in the show with some people from InsurTech, and we were talking about that, like privacy, right? They were like data scientists, saying I mean it isn't an issue in the way that like I we need to remove anonymity to do our job in a way right like to build these models because it's like we need this information like to go and like to like risk assessment of like whatever right so I from what I understand, everything is like a matter of making the right trade-offs, right?

Starting point is 00:11:48 How do we do that? Because, okay, in theory, I get it, what we are saying, but in a real setup, right? Let's say I'm that data scientist and I'm going to use your platform. How do I choose the right parameters there? big the case should be like right and all that stuff yeah so all right so there's a flip side of this and so we have this kind of privacy dashboard for everything and so we do we have a set of tools with two things right so whenever you treat data like this something is following falling on this privacy utility curve right so you increase privacy up to a point and if you do 100 privacy your data is 100 noise right and then you slide it all the

Starting point is 00:12:34 way other way back and there's no privacy and so yeah so you want to find that sweet spot and so we so privacy you can measure and do this risk assessment. This is almost pulled directly from healthcare literature on how to... Essentially, we do these like... It's like a Monte Carlo simulation attack. And so we do this simulated linkage attack. And then we say, here is how approximately linkable we think your data is. And then we put them into these categories, basically low, medium, high risk.

Starting point is 00:13:09 And so that's risky, a risk analysis. And then the other one is we have tools for measuring distortion. And so we'll run the data set through the system and we'll show you how have your distributions changed? How have your main top level statistics changed? We recently added something that shows relationship distortion. So like how have the relationship between age and some other column that maybe has non-identifiable information and it changed. And so this way a data scientist can look and see you know where is my privacy according to the risk assessment and how bad is

Starting point is 00:13:51 storing the data and so like ideally what you would do is you know dial it to as much privacy as you can get for the you know the distortion that you can accept and then and then set that and then you know let that be your baseline yeah yeah that makes total sense like how how let's say more complexity it adds to the life of a data scientist like to do that i mean we would hope not that much we would hope you know this is one of those things where you know we where we want to iterate on this if anybody gets blocked, but we try to make it as frictionless as possible. But ultimately, we want to just give you all the information you need and say, all right, this is too much distortion, or maybe we should nosh this up, dial it, run it again.

Starting point is 00:14:44 Hopefully, you kind of experiment a little bit until it's what you need, and then you don't worry about it anymore. Maybe you come back and check, maybe set an alert if the risk level changes more than some percent, something like that. But essentially our idea, what we want to do is let the data

Starting point is 00:15:05 scientists work on another problem like we will handle the anonymization and then you know you can come check the dashboard you can integrate it in your system and then and then you work on whatever it is that your company does yeah yeah yeah 100 okay let's go back to the other type of anonymization, which is like the social security number and all that stuff. And it was interesting because you mentioned developers being like, okay, I need something to look like an email. Obviously, I get that. If you have somewhere, let's say, a regular expression to match something and you want to test for that, that it actually works. If you have a random string there, that's a problem. Tell us a little bit more about that, because that's a part of, okay, we talked about the

Starting point is 00:15:53 data scientists, but there are always, like, also developers and engineers, like, involved. And they have different needs, right? And anonymization, let's say, affects their work in a different way. Tell us a little bit more about that, because it sounds very interesting, and especially around the developer experience, like working with their tools and how it affects their job. Yeah, it's an overlapping problem, but yeah, they have these unique concerns. So if you're a data scientist, if you want to work on anonymized healthcare data, it's probably just one data set or like a handful of data sets that may or may not

Starting point is 00:16:31 actually be linked together. Whereas a developer, they have a database with tables and those things have foreign key, private key relationships, and you need to maintain those relationships. So that's something, so know, so we started this, these are the features we started adding for developers. They're like, well, you know, we want to copy all these tables over and, you know, we want, we want, we don't want to have, we don't expose the same keys, but we need to maintain the same key relationships. So you have to, you know, tokenize those a certain way. Email addresses,

Starting point is 00:17:05 we had to build format, consistent email. And like, and you run into all these like little problems. One of them was like, they actually, their system was actually sending emails. And so, you know, it needed to be a valid email, but then it needed to not be routable. And so, you know, off I go into the, what is it, like the IETF document on email domain naming. And it's like, oh, well, yeah, there are actually a handful of these top-level domains. And so you build it in your format thing.

Starting point is 00:17:37 And.example, what is it? I don't know. But so you actually have these things that will pass their regex, but bounce you know if they try to send an email so yeah and then you know social security numbers those are those are just numbers but yeah like names one of the problems people have is you know yeah you can generate names you know kind of like random normal looking names but they want it to be the same name for this record when it comes through the next time and so you know we can do that in some cases but not not all cases. It's hard to, you know, you've anonymized this, but then you need to make it possible.

Starting point is 00:18:29 You need to be able to make sure that it, you essentially want, it's like a cryptographic hash, but with someone's name. So that like this row comes back again and it gives you the same name. Yeah. So yeah, so that's like, these are the kinds of things we're working on now to improve the, you know, developer workflow. Yeah, that's super interesting. What other types, because, okay, we talked about, like, names, like, foreign keys. What other types are, like, tricky and challenging and, like, developers care about?

Starting point is 00:19:03 Like, what about timestamps or, like, dates, for example? Yeah, with timestamps, like, that's something, it's a rabbit hole. Like, you're like, oh, timestamps. It's also, you know, talk to a developer who's, like, a senior developer who's worked with a lot of data about time zones, and they'll just, you know, the color will wash from their face, right? But this just, you know, the color will wash from their eyes.

Starting point is 00:19:25 Right. But this is, you know, it's the same thing with dates because how many date formats are there? Right. So, you know, you like that is like,

Starting point is 00:19:36 it's just one of those problems. That's not, it's a big messy problem. There's not like a beautiful, simple, you know simple you know you know beautiful design that solves it you know it's like you just have to build it out as kind of as needed luckily it's not you know each additional thing is not an enormous challenge some of some things are trickier than others but it's the trickier stuff is more in that we also try to identify all of these things up front.

Starting point is 00:20:09 We try to, so that you don't have to go and like, if you have, you know, thousands of columns, you don't have to go through and maybe you just go through and check if you got stuff right. Because it's like, it's probably impossible to get everything

Starting point is 00:20:23 everywhere 100% right. So, you know, people are always just going to have to check this stuff. But, you know, Because it's probably impossible to get everything everywhere 100% right. So people are always just going to have to check this stuff. But our goal is to have it as automated as possible. But some things are just like trying to find the U.S. Postal Service rules on what is a valid address. And even then, let's say you get that right like like I got it I got most of it but a lot of data is entered by humans and they will enter it wrong and so you have to handle that too so those are like not those are not fun because they're just messy and kind of annoying and like the hardest thing about it is you have to build a system that can withstand

Starting point is 00:21:06 all these additions and kind of bolt-on exceptions and things without making it incomprehensible every time because it's you're never going to stop adding stuff to it yeah and then if it just turns into a pile of spaghetti it becomes unmaintainable. So that's a totally different challenge. Yeah. No, and it's a very interesting problem, to be honest. Like it's... All right. So talking about working like with the data, let's talk a little bit about more about the actual, like the product experience, right?

Starting point is 00:21:38 Like let's say I'm a developer. We have, let's say, database somewhere. And I want to take your product the private dynamics product and use it on my database like how does it work how what it takes how easy it is how transparent it is and like what's the process after that i mean ideally it's super easy. Let's say you have Postgres, BigQuery, whatever. You sign up. You create a connector. You enter the credentials and location for the source database.

Starting point is 00:22:20 And then you create another one for the target database. And you walk through a wizard. We introspect the tables and columns, and then, you know, we'll try to auto detect there. You kind of check which ones you want to keep and what settings you want, what anonymization do you need, what defaults, you know, unless you have, you know, hundreds of tables, it's a pretty quick process. And then you go through and you set a schedule and it runs. And then assuming, you know, you don't need to make a bunch of changes to the, what, what is included or excluded from the project. You know, all you would need to do is check the dashboard, see if everything, you know, if the data looks like you expect it to in terms of like, did the distributions look good?

Starting point is 00:23:13 Did the, you know, the auto detection work like you expected? And then after that, you know, hopefully you don't need to use it that much, except if maybe if you wanted to integrate it with some part of your process. Okay. So let's say we set it up and who's usually like inside the engineering work that is doing the setup and installation. What type of like engineer is usually like involved in that? Is it like a DB admin? Is it like someone from security, from InfoSec? Is it someone from...

Starting point is 00:23:48 I don't know. Yeah, it's usually an admin. I haven't come across anybody who doesn't have good experience programming. Usually they're working in infrastructure operations. I mean, they're the ones who are setting it up.

Starting point is 00:24:08 Because this deals with sensitive data, we have a SaaS product, but also we did a lot of work to make sure that we can install this on-prem as well. And so those are much more involved because we work with their ops people, things like that. If you use a SaaS, it's just you know you just you sign up and then all you need is access to the database so you know if it's a small company you might just have to look up the credentials and then you're good yeah because we want cso's to be able to just you know click and then they have the information they need yeah 100 and okay let's say now i'm i don't know like a product engineer right like i'm building a front end and i'm going to have like

Starting point is 00:24:53 access to this production database do i have to know about the existence of privacy dynamics like how do i interact with the data right so it would fit in your pipeline your et then there would just be, you know, the way we would recommend setting it up is, you know, very few people have access to the sensitive database. And then, you know, you rope that off, and then, you know, those credentials are encrypted on our system or in your infrastructure. And then there's a less private database where you know more engineers have access to it that's maybe in the lower environment so like

Starting point is 00:25:29 that and so then you know you give that to the engineers oh so they don't even there's just they just know there's a database and that thing is kept up to date we run batches okay so it's the thing you know on whatever kind of increment you need all right okay I get it so the thing, you know, on whatever kind of increment you need. Oh, right. Okay. Oh, I get it. So the anonymization or encryption of the processing of the data doesn't happen like on the fly when I execute the query.

Starting point is 00:25:55 Right. It happens like you create a replica of the database anonymized and then people go and access that. Yeah. The anonymization process, no matter what, it's somewhat expensive. And also we have to have a picture of, of all the data in order to anonymize it. Yeah.

Starting point is 00:26:13 And also to do the risk assessment, we need to know everything that's in it to say, you know, because one unique row increases, increases the linkability. So we have to see everything. Yeah. Make,

Starting point is 00:26:23 okay. Streaming is something that has, like, it's definitely something we've discussed and want to do because we'll need to do it for extremely large data sets, but it's a, it would be a very large project, but it'd be something really fun to work on,

Starting point is 00:26:41 but I have to, you know, got to stay focused on what everybody needs. 100%. No, that makes total sense. So, okay, from what I understand, like, we are talking about use cases that are more like in the analytical use cases, right? Like, so someone's going to work, like, with a static data set

Starting point is 00:27:02 that they are going to extract, like, from the database, like a data scientist who wants to build, let's say, a model. And not that much use cases where, for example, you would have, let's say, a real-time application who is attached to the database and needs

Starting point is 00:27:18 to have very consistent and up-to-date data that are also anonymized. Is this correct? Do I get it right or do you also see more real-time use cases? I mean, it wouldn't be actually truly real-time, but you can, depending on the size of the data, we can run it pretty quickly. We can run it, you know, hourly or even, you know, every 10 minutes if you needed to,

Starting point is 00:27:49 if it wasn't an enormous data set. So we can keep data pretty up to date. Okay. But yeah, it has to, it, well, you know,

Starting point is 00:27:57 also like if you have big data and you can install on-prem, we can outfit you with a really large instance and it'll go faster. But yeah. That's very interesting. So you mentioned big data and one of the most important, let's say, jobs that a data engineer has

Starting point is 00:28:19 is to make the pipelines incremental, right? Because when you have billions and billions of rows going and processing everything from the beginning, like every time it's over the slide or it can overkill, how you can do that when you need to have access in a way to the whole data set to do the... Oh, no. Well, we have to reread it. Okay. We have to reread it. And so, yeah, incremental is,

Starting point is 00:28:45 that's something that we've sketched out as an idea, but it's really hard because you have to essentially, we're clustering everything. Right. And so we have to up, how do you update, you know, you create all these clusters and then you add a thousand rows yeah how do you

Starting point is 00:29:08 how do these clusters change that's complicated yeah and so managing that like that's pretty like i think we could handle the more like data side you know streaming streaming, streaming the data, running our like transformations that that's all, you know, what they call a SMOP, right. Simple matter of programming. The, the like updating a cluster, you know, like a cluster data set that's going to take some, that's going to take some tinkering. Yeah. Yeah.

Starting point is 00:29:43 But yeah, it's something we want to do. Yeah. Yeah. It's something we want to do. Yeah. Outside of tabular data, do you see other data that are also part of the images, PDF files? How do you work with this type of data?

Starting point is 00:29:59 We don't yet. The thing we've gotten the most requests for is more semi-structured data, JSON or, you know, just arrays, things like that. And like, yeah, that's something we need to do, but it's also really challenging, but it's for the same reason we, you know, dates are challenging where it's like by an order of magnitude. Right. So it's like you had these weird dates. Well, we were just talking to somebody recently at this conference, and they were talking about this column that was like a JSON plot. Yeah. And there's no schema for it. And so I can't even assume that row to row, it's doable.

Starting point is 00:30:44 It's just, it's a big lift. So, yeah. So what we would have to do is like take that data, normalize it, run our anonymization, have a map back to the original data and then, you know, and then do that to maintain that format consistency of just completely arbitrary. Yeah. Yeah. No, it's, I mean, it's, I can't feel what you're talking about. Very rewarding if you figure out all these, like, little, like, things that can go wrong. Like, but it's a very challenging, like, problem that you are dealing with.

Starting point is 00:31:20 I'd love to be able to just sink my teeth into some of those problems. Those... They are... That's fun. I mean, I don't know. I'd love to be able to just sink my teeth into some of those problems. That's fun. I mean, I don't know. I think even if you might not like to solve dates, your name is going to live in history. Yeah. I always... When was it?

Starting point is 00:31:40 I have a problem with dates in databases. I always forget these stupid languages where you define the format. I always have to go back to documentation for each database and see what the format is, when do I need it, why that's capital when it's not capital. It always looks like a regex, but it's not a regex. Yeah, exactly. And I was going through that again. I'm too old for that. when it's not capital. It always looks like a regex but it's not a regex. Yeah, exactly. And I was going through that again. I'm too old for that. And I was like, we live in an age where we have open AI that is going to, I don't know, make us all obsolete or whatever they say on Twitter today, but I still cannot give a date and software tells me this is

Starting point is 00:32:30 the format in this language. Or at least I'm not aware of this. If someone is aware of a library that does that, please let me know. You will make me a much happier person. So it's like the reason I'm saying that is because, you know, there's always a lot of hype around what is currently happening, but people don't realize how much real hard engineering, boring in some ways, needs to happen for all these things to actually work at scale at the end. From one side you have like, okay, open AI, ask if there's a god and he's replying to you,

Starting point is 00:33:18 and on the other hand, yeah, you have to go and still struggle with dates, right? And it's not a solved problem. Like it's still there. So I can feel you and it's like I think you should be talking more about that stuff Like I don't know if you have like a blog or something like talk about all these like little problems like what you said about like the email like that has to be like like we need to test it and make sure that like It's sent, you know's sent and goes through a mail server or something even if it bounces. All these little things that nobody

Starting point is 00:33:51 cares about until they have to. And they're there. 99% of the engineers out there, that's why they get grumpy every day because they have to deal with these things. It is. So it is important to talk with these things. It is. Right? Yeah.

Starting point is 00:34:06 So it is important to talk about that stuff, I think. Anyway. It's not glamorous, so people don't want to talk about it. Yeah, yeah. But, I mean, I don't know. I think, like, we can make it glamorous, like, if we talk about it and be, like, realists at the end. Like, it's not just, like, all these small things together

Starting point is 00:34:24 is what changes the world, you know, like, at the end. It's not just like all these small things together is what changes the world, you know, like at the end. It's not just like suddenly one day you come up with a trained model on OpenAI and it happened like out of the blue. No, there are many people that had to figure out a lot of like wrong dates to train this thing. Exactly, yeah. The real world is very messy and solving problems in the real world requires addressing that messiness. Yeah, yeah. A hundred percent. And we have to embrace it, actually.

Starting point is 00:34:51 That's also important. Talking about messiness, let's go back to your experience being an engineer, founding engineer in a startup, right? Tell us a little bit more about how it feels, what kind of experience it is. It looks like how different it is because, okay, I think people can imagine probably. But what do you have to go through as an engineer to make yourself productive in such an environment? Yeah, it was certainly for a while uncomfortable, right? Like just the real shift in what my objectives were,

Starting point is 00:35:35 which went from being, you know, we know the customer, we know exactly what they need, we're going to build this feature and we can, you know, know it's like it's clear to to going to this the situation where you know the ground's moving you know like i had gotten into a you know comfort zone where i was able to keep things neat everything's tidy like yeah everything's just so i know you know it's easy to figure out where everything is. And I, you know, it was nice. I was the kid who cleaned his room. Right. So, and so, but then going into this startup world, it's like, you don't, it's not a lot. It's a luxury. You don't, you can't really have all the time. And that's not to say you you like you have to embrace creating messes you just have to prioritize very someone called it brutal prioritization yeah i think it is but it's like it's uncomfortable you have to say when do i have to stop on this and also like you know what

Starting point is 00:36:41 do you have to set down and yeah and deal with what's like, you really have to think hard about like, what is the most consequential thing right now? And you know, my, so like, I always have this paranoia. Like, that's how I,

Starting point is 00:36:54 that's drives a lot of my design is like, what is the most likely thing to like come up and bite me in the ass? Like, what is going to like, what is something we're going to forget about? And it's just going to ruin our day someday. Like these little, you know,

Starting point is 00:37:10 time bombs. And so you really want to try to not set those things up. So when you're just like running full speed ahead, six months later, you just like trip and eat it. And you're just like cursing your former self. So it's like a lot of like what is what's gonna hurt the least yeah you know and so and you know that's it's just

Starting point is 00:37:32 you do have to i think be okay with being a little uncomfortable and yeah and that's the that's kind of the big change in startup world for me yeah. Yeah. And dude, you chose to go and work in a problem that's just an infinite number of exceptions that you can have in your mind beforehand. You really said yourself. Working on anonymity stuff was different, right? There's literature. You can read all the stuff that people are working on in research. But then it's like, oh, we need to automate the format detection.

Starting point is 00:38:09 Oh, okay. Well, I've done messy stuff like this in the past. In the legal platform, this is all human-inhered stuff. Yeah. And then we had this case database of all these cases. And some of those are dating back you know hundreds of years entered on some of them were probably entered on typewriters and then se art or whatever so you know this messy stuff so i was used to kind of like working around that but this is

Starting point is 00:38:36 like that was our messy data yeah now it's like everyone's messy data. So it wasn't a problem I wasn't familiar with. It's just a different scale. Yeah, yeah. That's so interesting. I think there's also, especially when you're at pre-product market fit, because after that, I think things get more normalized, right? You at least have, let's say, six months ahead of you that you know what you are going to be developing.

Starting point is 00:39:11 But before that, I think, like, the way that I visualize it, like, the process, and it's not just for engineering. I think it's just, like, much more uncomfortable for engineering. It's for everyone. Is, like, doing this thing where you're like in a sauna you have to be in a sauna and get like really hot and be like yeah like we're doing like to build this like it's going to be like yeah like we are going to own the world and then suddenly you are doing a nice bath because you put this thing out there and suddenly it's like what the fuck is this like

Starting point is 00:39:39 no i'm not going to pay you for this shit you And you have to go back and forth and not have a heart attack. That's emotionally the thing that you have to go through. And for a salesperson or a marketing person that they have, let's say, they grew working in an environment where everything is unexpected, it might be a little bit easier but like for engineering where okay at the end we live in a very deterministic world right like for us everything is like it has to be boolean in a way like it works or it doesn't work like there's no like in between there like if it doesn't pass the tests we don't push in production you know like that's like a very

Starting point is 00:40:25 i think like from emotional standpoint like point of view at least like it's a very different experience and it is brutal like 100 yeah i completely agree and it's like it's hard to accept that you know you know this thing that you built, you know, it's not, it's the uptake on it isn't what you expected, but we're still doing really well right now. You know, it's like, we're still going to need this. It's just, that wasn't the, that wasn't the like unlock. Right. And so, yeah, it's, it is, you know, you have to steal yourself a little bit more emotionally for sure yeah one

Starting point is 00:41:07 one thing i've shared with our marketing team is we just you know hey we need to shift this project on a super aggressive timeline it's a quote from mario and judy you know legendary race car driver and he said if it feels like you're in control you're not going fast enough and i think that's like i mean it applies like everybody at a startup right it's like you have to go faster than you're comfortable with and just like know that you can maintain control you're just not going to feel like you know you have as much as you want like your room's not as clean as you want there are dishes in the sink you know kind of like you just have to get comfortable with it not being as tidy as you want.

Starting point is 00:41:47 And just keep moving. Because I think a lot of times as we move, that's how things get better faster. Instead of like, I got to get this thing perfect first. But man, yeah, it's emotionally taxing. Oh, yeah. And I think a way to think about it is, you mentioned like data out there is messy, right? And like at the end, like data is like a very simplified model of the world that we live in.

Starting point is 00:42:13 So the world is even messier than that. So you just have to embrace that. And yeah, okay, it's easier to say than do, but yeah, at the end, that's what you have to go through. But it can be fun. So don't be discouraged. It's easier to say than do, but at the end, that's what you have to go through. But it can be fun. So don't be discouraged. At the end, it can be fun.

Starting point is 00:42:33 Yeah, the fun stuff is definitely very fun, right? Just because once you get something right, it's like looking back on all the work it took to get there. Yeah, you can kind of impress yourself, and that's really gratifying yeah yeah and now i remember like someone said that some probably was a tweet or something like many years ago so like having a startup is like having like a newborn it's like 99 of times like crying and full of shit but this one percent of like when it smiles at you and like at you, it's so rewarding. That's very apropos. I joined this startup months before we had our first kid. Oh, wow.

Starting point is 00:43:16 Yeah. And so, yeah. So it was two babies at once. Yeah. I think that's an apt comparison. Yeah. You have quite a medal if you could handle that. That's amazing. Yeah. We're at the buzzer here, but we will be, we'll be on the lookout one for your blog about hacking the email problem that send,

Starting point is 00:43:40 but don't deliver. If folks want to learn more about Privacy Dynamics though and check out what you're doing, what you're building, where can they find that? Head over to privacydynamics.io and we have a doc site and it goes into, you know, if you want to learn more about anonymization, we have detailed, you know, literature explaining how we do it, we do it how it all works we have blogs that show like how to get started you know quick starts for all these different you know types of setups so yeah just head over there's a lot of good information yeah and actually i would say i know that's okay our audience is probably like more on the technical side but i think anonymization

Starting point is 00:44:25 is one of the things that everyone should read about. And read not just about the legal aspect of that, but just to see the effort that goes into engineering for these things to happen. And we should all be at least a little bit aware of what is going on. Because at the end, it's our data, right? The medical records belong to me like yeah someone's like storing that but it is my data so we should all be more literate around that stuff and it's amazing that like you are building that kind of knowledge base so we should spread the word around yes it's great and if anybody has any questions just reach out to us.

Starting point is 00:45:07 Our emails should be on our website. So yeah, we're happy to answer questions. Awesome. Thank you so much. Thank you guys. I had a lovely conversation. Yeah, I really enjoyed it. Well, thanks.

Starting point is 00:45:15 Yeah, thanks for joining us. Listeners, thank you all for joining us as well. Check out privacydynamics.io and we will catch you on the next episode. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.

Starting point is 00:45:32 We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - Data Council Week (Ep 4) - Using Data Anonymization for Identity Protection With Will Thompson of Privacy Dynamics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.