The Data Stack Show - 19: Defining Data Governance with Stephen Bailey from Immuta

Starting point is 00:00:00 Welcome back to the Data Stack Show. We have another fascinating guest for you, Stephen Bailey of Immuta. He works on data governance inside of a company that has a product that does data governance. So it's going to be really interesting to hear about potentially his own usage of the product in his work but he also has a fascinating background in studying the human brain which i hope we can talk with him about as well costas you are doing some data governance work in our own product right now what questions do you have for steven that you're interested to ask about yeah absolutely i mean data governance in general is a very hot topic lately. There are many things associated with it

Starting point is 00:00:47 from access control to the data to data quality, data catalogs, metadata management, all that stuff that they sound a little bit too enterprising many times. But actually, the more we work with data, the more of a necessity they become. And all of these are problems that we haven't solved yet.

Starting point is 00:01:07 So it's very interesting to have a company that is trying to solve this problem. So yeah, there are plenty of questions around how they do it, why they do it, what the use cases are, and how they approach, in general, the actual definition of what data governance is. So I think we are going to have a very interesting discussion and a very useful one for anyone that works with data today. I agree. Well, let's dive in and talk with Stephen.

Starting point is 00:01:33 Today we have Stephen Bailey from Immuta. Stephen, thank you so much for joining us. Thank you all. I'm excited to be here and chat through some interesting data governance and privacy topics. Well, that's a subject that we love. but before we get going, it'd be great. You have such an interesting background with a variety of different experiences. Would love to get a quick overview of your background and what led you to Immuta, and then also just give us an overview of what Immuta does.

Starting point is 00:02:01 What problem are you solving? Sure, I'd be happy to. So I have always been interested in a wide variety of things. In college, I did chemistry and philosophy major and really enjoyed digging into history and literature and intellectual ideas and bandying those about. But when it came time to get a job, I actually started in education and working in business operations for an education nonprofit. But through a series of turns of events, I went and got my PhD in cognitive neuroscience and investigated how kids learn to read and how the brain changed as kids grew from four to five to 15 to 55. What I found, you know, throughout all of the,

Starting point is 00:02:46 that journey was that I just really loved working with data. I loved asking questions. I loved, I loved figuring out what is valuable and what is not. And also even the process of managing data itself was, you know, there's, there's endless opportunities to optimize and change and improve things. And I just really fell in love with it. So as I was finishing up my PhD, I started looking for data science jobs, found Immuta, and it was just a perfect fit. Immuta is a startup that focuses on enabling data teams to have really fast, efficient, and understandable access controls on their data. And we use the word governance in most of our marketing materials,

Starting point is 00:03:35 but really it's all about enabling more efficient access control and more responsible access control. So technically, the way we work is we sit either in front of your database and mediate access to data and enforcing fine grain access controls like masking and low level security directly, or we have some plugins essentially that sit on the database systems themselves and can enforce access controls natively in the system. So these are for technologies like Databricks and Snowflake. So the cloud native technologies, what's really exciting to me as the, as someone on the team who works and leads our internal analytics efforts is

Starting point is 00:04:20 that access controls, data quality, data governance is really the place where data engineering meets data science meets the business requirements. And all these people have to come to the same place. And it's very much a not solved problem. There's, I think there's as many ways to define data governance and define what good data governance looks like as there are companies that are using data. And so it's just a really, really rich place to innovate in. Well, I know we have tons of thoughts and questions around data governance and would love to even discuss sort of the different definitions for that word, because as you said, you know, data control, data governance, data access, you know,

Starting point is 00:05:11 there's sort of overlapping components of those definitions. But before we get into that, I just have to ask this question because I know from researching you, you have young kids and you did a PhD in sort of understanding how kids learn to read. So I would love to know about your experience studying that at sort of a doctorate level, and then seeing your own kids learn to read and being part of that process. Is there anything interesting you can share from that experience of sort of studying it from an academic sort of data-driven perspective, then your, your own experience actually, actually doing that with your own kids? Oh man, that is, that is such a good question. And I think

Starting point is 00:05:51 that is such a good question. I, and the reason I love it is because it, it, it really showed me the experience of studying cognitive neuroscience and specifically like how the brain rewires itself when you're learning to read. Like the brain specifically takes visual circuitry and auditory circuitry and semantic association circuitry and makes a super efficient connection between those different systems in order to enable you to read rapidly and automatically. And that happens through practice, practice, practice, practice, practice. And you can actually observe this happening in the brain. And that's what, that's what my lab was focused heavily on doing using functional MRI. But, you know, I spent five

Starting point is 00:06:43 years learning the techniques to manipulate medical images and do these group analyses and clean all of the data and all this stuff. And then it really teaches you nothing about actually, actually teaching kids how to read. We're in the middle. I shouldn't tell you, I shouldn't say it teaches you nothing but it doesn't prepare you for the experience of actually teaching a child to read so i think there's there's some principles that you can get so you know rewiring the brain takes practice practice practice it takes attention so it's not just about you know the amount of hours but you've got to have good hours the kids have to be focused right like. Like quality versus quantity. It's not just brute force.

Starting point is 00:07:27 Yep. And you've got a scaffold that you're, you're learning. So you're, you're learning a bunch of skills that can kind of be learned independently, and then you've got to learn to associate them together and then you've got to practice. So, and then the other piece is, is the emotional piece. The more kids like to read and enjoy reading with you, the more susceptible, more open they'll be to, to additional practice, which leads to, to more, you know, neural refinement. And so it's, you have like, you can reduce the, the equation, so to speak to some very dry variables from a scientific perspective. But then when you, it comes to actually raising a kid who loves to read, you have to embrace the human elements of, of, you know, creating a, an environment where they enjoy it and creating

Starting point is 00:08:17 and finding books that they like. And all of these pieces are super important. So there's both the scientific question, and then there's the human question that you have to take into account in practice. Fascinating. And I mean, I would argue, and I'm monopolizing here, so I want Kostas to jump in because I know he's, I mean, honestly dealt with a huge number of data governance issues, but it's interesting. In many ways, I would say the same principles apply to even data within an organization, you know, where having clean data and focusing on a process is one thing, but you have real teams using real data, which is messy. And, you know, when the rubber meets the road in a fast moving company, it's,'s you know it's a little bit of a different game yeah but actually i have a question on similar to your question eric before we move to that so steven you said that like you studied how kids learn but then you also like try like to figure

Starting point is 00:09:20 out how this also happens like in later stages of like the growth of the kids so how does this I mean you mentioned some stuff earlier about like emotion attention are these things that like keep and are how to say that like sorry still important in later stages of our lives like for example how important are these like for a person at my age or your age right because we keep learning it's not like we stop learning at some point maybe not as rapidly or like so efficient as like the kid can do but a learning process is something that like continues in our life so how does these things change as you grow up and you get older that's another good question and i'm loving i'm really loving going back to this brain stuff because i haven't i haven't talked about this. And since I graduated, so this is a breath of

Starting point is 00:10:10 fresh air for me. This is awesome. The there's what in developmental neuroscience, there's what the they call critical periods where children or adolescents are particularly disposed to gain new skills. They can really just soak it up. And if you see like a child learning language when the time, when they're between two and six, they will just like, they just ambiently like pull it all in and, and it just kind of takes shape. What happens as you get older or what happens during those critical periods that doesn't happen when you get older is your brain is particularly plastic. It actually is going through and disposing quickly of connections that are as useful. So you have what's called pruning that happens. And as you get older, you sort of settle into a very efficient pattern.

Starting point is 00:11:06 So I would say like the general model that you can think of is when you're young, you're very disposed to create new connections very quickly. But as you get older, your brain basically figures out what are the most efficient paths for what I need to do. And it becomes more efficient and automatic at doing those things. Now, what's cool about the brain and why everyone loves studying it is you can change that as you get older. Like right now, for example, I'm learning guitar and I'm going from like zero to trying to be, you know, something of being able to play at least one song. Right.

Starting point is 00:11:42 And it's very, it's very challenging. It would be very challenging if I were eight years old. But as an adult, I have a lot more awareness and I know how to structure my practice in an effective way. So I'm not worried about like not being able to learn that thing. It's just that it's probably going to take me a little more time, focus and practice and kind of structure around the way i'm doing discipline around the way i'm doing it to to really be super effective at that yeah and probably you're also much better in like controlling your emotions something that the kids

Starting point is 00:12:18 needs like someone external to take care of that actually i found very interesting that you added like the concept of emotion like in the very interesting that you added like the concept of emotion like in the learning process that's very very fascinating but i think we need to arrange another recording just to discuss about that stuff i know i could go all day because this is so fascinating yeah yeah yeah emotion just one last thing and this will be a bridge to some data stuff, but I do, you know, anyone who studies the brain hopefully gets a little offended when people link neural networks, AI and neural networks directly to the brain. so much stuff to what the body does that supports brain functioning. Like that is just totally not even part of the conversation when you're building, when many people talk about that relationship between neural networks and the brain, like hormones, cortisol, attention, emotion, even like sensations from your body. Like all of these things are super important for,

Starting point is 00:13:23 for brain functioning and brain processing that there's just no real analog for in data, computer science, neural networks. Yeah, absolutely. And I totally understand. That was something that I was thinking while you were saying about like attention and emotion, because for example, one big thing right now with all the neural networks research that's going on is about how to use attention and how it's called. Because, of course, the attention in this context is much different than what attention is probably like in the human brain.

Starting point is 00:13:52 We keep trying to find some kind of parallels between how the human brain works and how these computational models work. So when you talked about the emotions is the point where i couldn't help and say like okay is this the next thing after the attention are we trying to put them there in the neural networks but anyway these are things that i think we need a lot of time to chat about and probably arrange another call to do that so yeah let's let's move forward with talking a little bit more about your role in in utah right now and what i wanted to ask you and i find quite interesting in your case is that you have a data related role inside the company that builds also a product that's around around data right like i assume and this is something that i would really like to find

Starting point is 00:14:39 out during our conversation that data governance is something that affects also the lives and the work of data scientists and data analysts. So how do you use that internally? What's BI and data analytics for Immuta, first of all? How do you use it? Is it for product? Is it for business decisions? And also, how Immuta is and the principles and the concepts of Ibuta are also like using

Starting point is 00:15:06 can you give us a little bit more information around that sure let let me I can break this into two responses the first is like we can talk a little bit about the technical kind of responsibilities and and stack and then maybe about the the organizational piece because I think both are very very relevant so we do we're heavy believers in dogfooding our own product. And so we, one of the first things I did when we started building out our internal infrastructure for analytics was get our product in front of our database and between and behind our analytic tool of choice. So our current stack is Stitch. It should be pretty familiar to anyone who's heard of the modern data stack, as it seems to be called now. But it's basically Stitch to Snowflake to Emuda to Looker. And that forms the core. We use Argo, which is a Kubernetes native container orchestrator for orchestrating jobs. But it's a pretty standard setup for a small company.

Starting point is 00:16:12 So Immuta's role, which I think is really the interesting piece here, is as an arbiter of access control, but also as a place to land and focus our metadata management. So we have information about jobs coming in from jobs and raw data coming in from Stitch and from some custom taps that we run. We have metadata about DBT and the models that we're building in DBT. We have usage data from Snowflake. And what we want to use Immuta for internally is to aggregate especially governance-related data, such as where personal information is stored, who should have access to data, identity management concerns, and to have Immuta push that to our consuming services, whether data scientists are accessing

Starting point is 00:17:12 data in Snowflake or in Looker. We're basically trying to build out a centralized governance or access control capability there. So from what I understand is that with Immuta right now, you're having two main components and two main functions. One is like the management, the aggregation and the management of metadata. And the other one is like access control, which probably also, I mean, access control might probably also need the metadata in order to be implemented. Is this correct?

Starting point is 00:17:39 Do I understand it correct? Yeah, that's correct. So how are these metadata defined? You as a data scientist, you have like to start to implement a new pipeline for your data. You have a new project. What are these metadata? How they come into existence? And how also at the end you use Immuta to store these metadata and to use them outside also of access management?

Starting point is 00:18:02 That's another good question. So the metadata that we leverage in Immuta is all built around enforcement policies. So it tends to be much simpler than the massive amounts of metadata you could associate with an individual data set or pipeline. In particular, we want to define a minimal set of tags that are related to any actions that are going to drive a decision about who has access to what data for what reason. And so it basically boils down to three things, user attributes, data attributes, and contextual attributes, like accessing data for a certain purpose. It's, you know, these are all elements of attribute-based access control, which a lot

Starting point is 00:18:52 of companies implement. But what we found in working with companies and employing Immuta internally is that you really have to take a step back at the beginning of building out your data warehouse and define what are really my hard requirements about what needs to be tagged, what data needs to be tagged, who should have access to what data, and for what reasons. And so at Immuta, we have a pretty transparent organization around data, but we still have heavy requirements around making sure that any data that comes in, we identify whether it has personal information in it, whether it has privileged information in it, such as, you know, like employee salaries, for example, and making sure we're tracking

Starting point is 00:19:33 that as it propagates along the data modeling layer. And then enforcing access control in our database system. So we were discussing right now about how you are using Emuta internally, and we use that also to, let's say, describe a very important use case on how the product is used. Is this the main use case that you see, or is it you've seen people using Emuta and deploying it also in different ways or trying to address also other problems outside of the things that you mentioned already. The main use case for Immuta is simplifying that

Starting point is 00:20:11 access control layer and uniting different systems with the same identity access control. In particular, one of the core innovations, I think, in our product is a global policy builder that's quite human comprehensible. So if you're familiar with AWS, IAM policies, you know how hard those can be to comprehend. Emuta makes it very easy to create a policy that a compliance person or a data access person or data engineer can understand and then apply it across any data set that's tagged a certain, you know, in a certain way. And so we actually at the, it was one of our core bets when we, when the product was originally built was that we, to do data governance better, we have to have better communication channels

Starting point is 00:21:02 around our data and understand if understand if I'm a data scientist and I can't get access to data, why? And what attributes do I need to get access to it? If I'm a compliance person, what is actually being implemented in Snowflake and who has access to it? So that's definitely the main use case. And it does, what is great about attribute-based access control and particularly policy-based access control, and particularly policy-based access control that's a little more human understandable, is that it can take a ton of policies that might be implemented in effect on a database down to a single policy in some cases or a couple of policies in many cases.

Starting point is 00:21:43 Oh, okay. That's great. Larry, sorry. Well, actually, I think you answered part of my question. I was going to ask in what ways, and I know it varies sort of on the level of complexity of the stack and the size of the organization and even probably the industry and type of data,

Starting point is 00:22:02 but you mentioned AWS you know, AWS, IAM policies, like, is that the primary way that people are solving this if they're not using a Muta or a similar tool or what other ways, what are, I guess, what are the ways that people are experiencing the pain that you solve and how are they trying to solve that outside of a Muta? I think to answer that question, you really have to be asking who are you talking about and where in the pipeline are you talking about? Because you take even a very simple pipeline like ours, we have to manage data access in Stitch. We have to manage it in the raw tables in the database. We need to manage it in the sort of a Muta sanctioned part

Starting point is 00:22:45 of the database. We need to manage any consuming application. So if you expose it in a looker, are you using a system user that has global access to access the data, the snowflake data? If a data scientist comes in and then wants to stand up like some infrastructure of their own, like how are you managing access to it? So I think there's, there's two real issues. One is there's just a huge proliferation of where data can be within an organization. And then the second issue is no one knows the answer to any questions of who

Starting point is 00:23:19 should have what data, like that's, that's really problematic. I think a lot of times, well, I won't say a lot of times, I've been in place organizations where there's some documents that exist somewhere on someone's computer or in some share drive about what, you know, a data policy is. But then in effect, like no one who's on the front lines knows what that policy really is. And so if someone asks for data, they just get data or, you know, or if they might ask for data and no one knows how to get them the data. So I think having clarity around how data should be used and also then of course, knowing where it is, those are the two biggest pain points that companies are facing.

Starting point is 00:24:06 Yeah, absolutely. No, I think it's very, very interesting to think about various levels of access at various points in the pipeline and sort of the points where you do need some sort of governance around access. One more specific question, and then I'll hand it back over to Kostas. But so in your pipeline, you said that you go from Stitch and some other sources into Snowflake to Immuta to Looker.

Starting point is 00:24:37 So is Immuta actually sort of sitting between Snowflake and Looker? I ask because we leverage Looker on top of Snowflake as well. And just as a user of that particular piece of the stack, I'm interested in what it's like to insert Immuta into that equation and what it's like to interact with Looker

Starting point is 00:24:58 running on Immuta, if that's actually what you meant by how it works. Yeah, so our Snowflake integration and our Databricks integration are what we call native workspaces, which means Immuta sits behind the scenes and actually creates views or secure views of your data within Snowflake

Starting point is 00:25:18 so that your looker would still be pointing to Snowflake. And so what we have internally, which is really actually a pretty neat experience, is Google single sign-on to Emuta, to Snowflake, to Looker. And so there's one identity. People don't have to know any passwords except for their Google password. And Emuta is enforcing access controls,

Starting point is 00:25:42 whether they're row-level security or column-level masking or just subscription level masking or access on Snowflake account without anybody ever even having to log into Immuta or like change where they're pointing Looker. Now, in other cases, for example, we started out on Redshift. In that case, Immuta does act as a proxy. And so you'd be accessing your Redshift data through Emuta and Looker would actually be pointing to Emuta's Postgres proxy engine. But the Snowflake integration is very cool because you can use all of the, you know, you can create different warehouses and everyone accesses the data through the public role, but they're having individualized access controls applied.

Starting point is 00:26:29 So it really eliminates some role management issues that you might have if you're trying to do dynamic access controls in Mucker. That's very cool. Very, very cool. Yeah, that's amazing. Especially when we are talking about managing access to many different products and tools. Like we already mentioned at least two, right? Like we have a database itself and then we have like the various different BI tools that are used there. So that's super cool what we are doing there, Stephen. Who is responsible for these policies? Who is usually, who has the role to create these policies in Immuta? Who is the user of Immuta? This is a question that the answer varies upon who you're talking to. And I think it also

Starting point is 00:27:12 varies heavily on the size of the organization. At a small startup, so speaking from experience, what I found is that the person who owns the data platform is the one who knows the most about the data. He knows he or she knows where the data is, you know, most sensitive. And they're also the ones actually enforcing the policies for real, right? So if there is no centralized policy defined, then whatever the database policies are, that's the actual policy that's being implemented for that company. And so, but in larger organizations, you might have compliance organizations that have standards and there's someone who's like job it is to make sure that warehouses are,

Starting point is 00:27:54 or data assets are up to that standard. What's challenging in that scenario is that it data changes so fast. I mean, it changes all the time. It changes so fast. I mean, it changes all the time. It changes so fast. And so if the person who's owning the data platform and actually releasing the data to people isn't the person who's most on top of the policies and maybe even defining the policies,

Starting point is 00:28:20 then it gets out of date. You know, whatever that downstream organization is gets out of date or there's a big, it takes time. It takes additional time to release a data product. Whereas if you have the data platform owning it, they're making sure that the data is up to snuff before then they can release it without, it's almost like a CICD process

Starting point is 00:28:41 for releasing data or data governance. And that's in some ways where I think the future is, you know, you can, it'd be awesome. And it's sort of how it works now for a Muta users. You, when you create a pull request against your, your data warehouse, as long as you have the right metadata attributes on it, then, and you put those metadata attributes in a Muta, as soon as that data is released to end users, the correct policies will be applied. And you've already defined those policies in the front on the first place. So, so it makes it easy for, for you to have like one big initiative to define all your policies and then just be confident

Starting point is 00:29:22 that that data is getting, having those policies applied as you add new data sets. That's very cool. All right. I think, I mean, the product itself has monopolized a little bit of our discussion, which, okay, makes sense because it's pretty interesting. And it's very interesting also like the kind of approach that you have and what you said about like the CICP.

Starting point is 00:29:43 But let's talk also a little bit more about your role inside Immuta. So what is your team doing? And what are the products that you are delivering? Great question. So I talked a little bit about my background. When I started at Immuta, I was a data scientist. I came in, I was focused on doing some ad hoc data science projects, looking at performance considerations or doing, you know,

Starting point is 00:30:06 maybe customer segmentation and things like that. As I pivoted more towards managing infrastructure and building a data platform for the organization for downstream users, you know, we've gone through that data maturity cycle of starting from, hey, let's just like get some basic counts that everyone can agree on, you know, count of customers, count of like opportunities, count of like these basic things, getting and getting consensus around that. So that's where we started. And then where we've been going as we've started growing is we've been building all of these, this great analytics expertise and operational expertise within all of our different departments. So within sales, within marketing, within product.

Starting point is 00:30:51 And so now our data team is focused really heavily on enablement and the development of new interdisciplinary products, data products. So finding ways to unite sales data and marketing data and product data, telemetry data into a single, for example, activity stream of users, a unified user activity stream that we can understand what the customer journey looks like. That's an example of something we're working on right now. And that's been, it's been great because it's positioned us both as partners with the different stakeholders in each team, but also as independent experts who are creating like custom data products that can accelerate the business's impact. That's super interesting. So can you give us a little bit more color around like how you unify the data? What kind of sources you have? what are the challenges of unifying and like where do you stand in terms of that how mature do you think that this

Starting point is 00:31:49 product that you are describing is right now inside the company yeah i think one of the biggest challenges is building the whole building sustainability across the whole data supply chain so from from the original source system, for example, Salesforce, making sure that that data is really high quality. And then you've got the technical infrastructure that extracts it, loads it, transforms it into a custom data product that we expose in Looker. That's a technical challenge. And then you have to train people on what that new product looks like. So you've got to have the high quality source data or the downstream product doesn't work or isn't valuable.

Starting point is 00:32:30 And then you have to start repeating that process across different domains. And each time you do that, you guys have worked in data. So you know, it's like you get excited, you build a proof of concept in two weeks and then it's six months of ironing out the kinks and realizing, oh, this doesn't mean that, or, you know, there's this weird, weird data quality thing here.

Starting point is 00:32:50 So it's really about building, building out that supply chain. And then, and then there's a really big element of team building and, and education as well. Yeah, that, that is is it's both exciting it's really like i really enjoy that aspect but it's it's easy to forget about yeah yeah absolutely i totally agree with you i mean we tend to forget how important the human factor is because at the end i mean all these numbers and all this data are going to be depredated by a human being right like they have to make sense for the humans that are involved. And of course you have also like to train them.

Starting point is 00:33:27 That's a very interesting topic actually. And like for us, the people that work in the technology, we keep to forget about it. But you also, that's another very, very interesting topic, which has to do with the quality. You said that the first thing that you have to do is to ensure the quality on the data supply chain. And you mentioned also Salesforce.

Starting point is 00:33:46 So I think it's a very good example that we can discuss a little bit. So what is quality? I mean, when you talk about the quality of the data, how do you define it? And how do you solve the problem of the data quality in the pipelines and the systems that you're building? That's a great question. I think of data quality, I think of it in sort of the same quality in the pipelines and the systems that you're building? That's a great question. I think of data quality, I think of it in sort of the same way that I think of access controls, actually. So access controls are, they're basically agreements between people about who should get access to what kind of data. And I think data quality is in a similar state where it's an agreement

Starting point is 00:34:26 between the person providing the data and the person using the data and the person, and even maybe downstream, like the person originally providing the data of what certain things mean and what the expectations should be across that data product. So, you know, we've recently embarked on a data quality project. And so we've been thinking a lot about it, in fact, and it's, you know, you could take one approach of adding data quality and schema tests to every single column, like when you build out the, the, the original data model, but, but it quickly leads to leads to noise and it becomes impossible to maintain because things are firing all the time. I think what we're trying to do currently is define

Starting point is 00:35:13 critical fields. So define sort of our key metrics that we want to back as a data science org and then work backwards from there to to identify what are the guarantees that we need to make as an organization to make sure that that final product that final number is quality and then build visibility into into that pipeline so that both the people like my team can maintain it and identify when something goes down quickly, but then also other people can look in and understand whether the number they're seeing is actually correct or whether there's some known issues around it. But it all comes back to taking the time to identify what are the most critical components, what supports those components, and then what is the agreement that we need to make with, that we have made with fascinating point around agreement. In my past, we referred to that as sort of the end-all be-all definition. And one example that keeps coming back is

Starting point is 00:36:34 I worked with a company who said, well, we need to track active users, right? And that sounds like a simple metric. It's just one metric, right? But when you started to ask people around the organization, what, you know, what is the definition of an active user? I mean, you would get wildly different responses, right? For what seems on face value, just like a very, like, well, this is easy. Let's just track active users, right? And it's like, okay, well, you start getting into it. And I mean, there's all sorts of edge cases and, you know, it can sort of cross different user actions that are difficult to track. I mean, there's all sorts of complications in there. And so the agreement that really resonated with me when you said agreement,

Starting point is 00:37:19 because that's actually, I mean, unrelated to the pipeline or the actual sort of data science work itself, the fundamental challenge of getting agreement is actually pretty formidable in a lot of organizations. Not because anyone's necessarily territorial, but you just have to do a lot of work across teams to get to a shared definition. Yep. And I've found, I mean, investment from executives and leadership is so key there, right? Like we've got, we couldn't be an effective data team without that investment because it forces the question of what does this number mean? And what are we going to like accept that it means? And also what do we accept is not known or knowable from this number? And that's, you know, that's hard. I think that is one of the things that

Starting point is 00:38:25 people find very hard because they, they do, they look at the active users and it's like, well, I want to know all of the information about the active users, but you know, as soon as you define it, you're defining it as also in the negative, like it's not this. Sure. Sure. So some questions become off bounds. One, one quick question, and this is just very practical. I'm just thinking about our own experience. I mean, I would say over the last two quarters, we went through a similar effort of, hey, let's just make sure that the numbers in marketing and sales are the same, right? And that we can agree upon all these numbers. But just thinking about a lot of our listeners are data engineers or working in or related to data engineering. What was that effort like for you through it ourselves and having done it before, you know, part of you wonders like, man, does every other company have this sorted out?

Starting point is 00:39:30 It seems like it's taking us forever to do this, you know, and in reality, it's something that every company struggles with. So we just love some practical, you know, you know, some practical thoughts on your experience with that. Yeah, I would definitely, you know, offer encouragement to anyone who's feeling discouraged from efforts like this. It has been a two-year rapid growth experience for me. I mean, the amount of conversations and like you said, the time it takes to implement these things is much longer than I expected. And I think a large part of that is trust.

Starting point is 00:40:08 You have to build trust among people and people have to trust the numbers. And if it's something new, there's an intrinsic skepticism. And so one of the first things for a couple of our more effective projects, I think the first thing we did was we got a graph and you start shopping it around to different stakeholders. And it's like, you, you,

Starting point is 00:40:32 you find a format that people are going to see over and over and over again and start shopping it around. And that starts to build familiarity. And then over time, as, as that graph is shared in meetings and stuff, that's when you start to build trust in it. And then you can start getting kind of derived information from it. But as a data scientist, I know my instinct a lot of times is to put like the, the big profiling dashboard together of like the 50 different, the 50 different ways we can slice this data model. And it's, it's too much too soon in many cases. I think it's better to have like one graph and then slice it in a couple of different facets,

Starting point is 00:41:11 get people to build trust in that and then kind of roll out new stuff. At least that that's been, you know, my experience. Yeah, that's, it's really interesting to think about that. And we don't have time to get to it today, but as a consumer of the type of data that you're talking about, in our organization, I would be one of your internal customers, when I hear you describe it that way, what comes to mind to me, and I don't know if I would have articulated it this way if you hadn't sort of given that explanation, but I'm making decisions with the data, right? And so it really does take time for me to sort of take, understand, and have enough confidence in a chart or a data set and make a decision on it and then sort of, you know, get feedback on are the decisions I'm making based on this data better? Are they helping, you know, the company? Are they helping my team? Are we progressing as a result of this? And that really does take time to build trust, not necessarily because I don't trust you,

Starting point is 00:42:24 but, you know, because there's a lot on the line as I'm making decisions with this data. And so I want to see that it will actually prove out to be producing results as I use it in my, in my job day to day. So I had this experience during my PhD, where we analyze brain images. And these can be three-dimensional image volumes, or they can be four-dimensional or even five-dimensional volumes. And so it's coming in, I didn't have any experience in this type of data, totally brand new data to me.

Starting point is 00:42:59 It took me four years of working with brain data day in, day out, running experiments, running processing on it to really gain a lot of trust into that data and understand at a sort of an intuitive, deep level what I was working with. Or when I saw like a blob in this part of the brain, that's like a statistically significant result in one part of the brain, I was like, Oh, I trust that. I know what it means. And that's a situation where I actually, I could have a hundred percent trust that the data that I was getting was correct. And, and so I had like a hundred percent data provenance over or understand

Starting point is 00:43:42 control over the data provenance. But it was all about the expertise of, you know, becoming a user of that data was all about building trust and building intuition and building knowledge. And that process just takes so much time. I share that just because it's one of the few times where I, you know, have like a totally unfamiliar data set and just had to build that intuition from the ground up. And, you know, it just a totally unfamiliar data set and just had to build that intuition from the ground up. And, you know, it just takes a long, it takes a long time to trust. That's actually super interesting. I mean, I'm observing like the conversation that the two of you had in this past few minutes.

Starting point is 00:44:18 And at the end, I think I went up to data governance again, because if you think about it, working with data, it's like it can't be distilled at the end, like in two things, right? It's actually one thing, and this is trust. It's trust to the data

Starting point is 00:44:33 and trust among the people, right? And the understanding that people have around the data. And I think this is, let's say, the broader definition of what data governance is trying to solve as a problem, how the people can work with the data and trust the data and also trust and communicate and come in an agreement between them of what like this data i'm into i mean i know that we said that about the muta that it's more around the access control around the data but this is like i think very foundational part

Starting point is 00:45:02 of like building trust both on your data and the processes and the people that you have inside the company. And then on top of that, you can build other layers. You were talking about the definition of a KPI and how we understand it. I don't know what the plans are that NUTA has around the product. That's my next question. But from my experience, at least, a big part of data governance also around that how we can have a data catalog where we agree upon like the definitions of the data that we track and the kpis that we measure and it's interesting because these are problems that the large enterprises have

Starting point is 00:45:38 been like solving for quite a while or they were trying like to show for quite a while but i think as more and more the whole industry becomes like data-driven, anyone will have a life to deal with these problems. So in the end we discussed so many different things, but I think all the stuff that we were discussing were around data governance at the end. And having said that, my last question for you, Stephen, what's next about Muta? I mean, you have solved from what it seems like a very core problem around data governance, a very important one and in a very elegant way. So what's next? So we've had a lot of conversations around this. I think one of the cool things that I've gotten to experience at Emuta is when we started two years ago, I didn't see any other

Starting point is 00:46:26 like access controls, entitlements and security startups really that were, that I would say were direct competitors. And we started to see more of a movement in this space. And it's been really exciting because I think there's an acknowledgement that governance has to be part of the data development lifecycle. And so we are starting to look into some of the adjacent governance responsibilities. I think there's a really good article by Andres and Horowitz's group on the modern architectures, and they define metadata or data governance in sort of four buckets. There's metadata management, which would be like your enterprise data catalogs, entitlements to security, which would be what Amita is currently doing, data quality, and then observability. And so data quality and observability are of high interest to us and really creating a centralized place for data engineers to understand what's going on in their data pipelines and then exposing that to end users. That's an area of intense, I'll say,

Starting point is 00:47:41 research interest right now, because I think it's a big gap. As a data platform owner at Immuta, I'm managing, you know, I've got my GitHub repos with Terraform in them. I've got a couple of AWS Lambda functions. I've got Snowflake. I've got an orchestrator, Stitch. We've used a little bit of Fivetran. I've got Looker. I've got Immuta.

Starting point is 00:48:00 It's like, I have this, I have a bunch of tools. Each of the tools does what they do really well and makes my life better. But now I have to manage all of these different tools and all of these tools create dependencies for my, my golden data products that I want to give to end users. So thinking about how do we, you know, extend beyond data sharing agreements and go more into maybe data quality agreements or or adjacent spaces that's really where our mind is at and then of course improving the core experience of making access control simple and easy and communicatable i think that's that's going to be there's so much to do there absolutely i mean it's, it's a very foundational problem, as we said.

Starting point is 00:48:46 And so, of course, there's still a lot of space for improvement. I'm really interested to see what's going to happen. Steven, it's very, very interesting what you described. And I'm also personally very interested in anything that has to do with data quality and observability with data. My feeling is that many things that we take for granted as engineers when we develop code, there are things that they are missing right

Starting point is 00:49:08 now when someone is working with data. So I think there's going to be very interesting times ahead of us and very interesting products are going to come into existence and I'm very excited about it. Thank you so much. It was a great time and

Starting point is 00:49:24 we really enjoyed the conversation with you. And I'm looking forward to connecting in the future and see how things are going with Ibuta and you and discuss more about data and the human brain again. This was really great, guys. One quote to end it on that I think is really relevant is I was talking to a colleague who runs a data team and he said, when it comes to data governance, it just feels like there are tons of wrong ways to do things, but not a really clear right way to do things right now. And so I just, that has stuck with me. And I think as a community, I'm just really excited to see how we grow in terms of sharing best practices and also technologies that help us build sustainable, confident, sustainable pipelines that we can be really confident in.

Starting point is 00:50:13 Absolutely. Well, again, thank you for spending time with us. Thank you for teaching us both about data governance and your work at Immuta, and also a little bit about how we can deal with kids learning to read, which I know is very relevant for me right now. So appreciate that from your background as well. And we'll catch up with you soon. Awesome. Thanks, guys. Well, that was fascinating, not only because I'm teaching my four-year-old son to read and sort of working on letters and recognizing words. So it was really interesting to hear Steven's take on that. But I think one of the things that I found most interesting, and this is somewhat of a theme we've seen on the show, is that the technical problems with data are

Starting point is 00:50:57 absolutely fascinating, but they really sort of are secondary to getting alignment within an organization around data. And that's a sort of a particular skill and particular endeavor on its own that, you know, doesn't even necessarily in its early stages relate to the technology. And I just found it really fascinating the way that Stephen talked about that dynamic within Emuta and within organizations in general. What stuck out to you, Kostas? Absolutely. I totally agree with you. Working with data is not just an institutional problem that every company has to solve.

Starting point is 00:51:33 I'm pretty sure that our listeners will notice how many times we use the word trust, right? And trust is like a human characteristic, right? We need to trust our data. We need to trust our technology. And above all, we need to trust the teams that work with the data and that we have a common understanding on how we interpret the data. So I think this is something that's like a big part of what data governance is trying to show. It's a very interesting problem.

Starting point is 00:52:01 It's, as Stephen said, we are still at a stage where all the problems we're trying to solve around that they have many bad ways to solve them, but we haven't figured out yet the good ways to solve them. So it's very fascinating. It's very exciting. And I think a couple of next months for like a year or something, we will see more and more companies and people trying to come up with interesting solutions to these problems. And of course, we'll see what Immuta is going to do.

Starting point is 00:52:35 I mean, they started with the access control to the data and from what it seems, they do like an excellent work product wise to solve this problem. But I'm pretty sure that they are going to also to attack other problems around data governance. So I'm very excited to see what's going to happen in the future. Me too. Well, thanks again for joining us on the DataSec show. As with many of our guests, we'll check back in with Stephen and Amita maybe in six months time or so and get updates on where they are

Starting point is 00:52:55 with the product and what his team is up to. We'll catch you next time.

Your Ad Here

The Data Stack Show - 19: Defining Data Governance with Stephen Bailey from Immuta

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.