The Data Stack Show - 19: Defining Data Governance with Stephen Bailey from Immuta
Episode Date: January 6, 2021This week on The Data Stack Show, Kostas and Eric are joined by Stephen Bailey, Director of Applied Data Science at Immuta. Immuta is a startup that focuses on enabling data teams to have really fast,... efficient, and understandable access controls on their data. Highlights from this week’s episode include:The problem that Immuta solves (2:04)Stephen’s background researching how the brain works (4:56)Immuta’s stack (15:09)Leveraging metadata (18:02)The main use case for Immuta is simplifying the access control layer (20:06)Unifying data (31:52)Defining the quality of data (34:04)Learning to trust the numbers (39:42)What’s next for Immuta (46:15)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome back to the Data Stack Show.
We have another fascinating guest for you, Stephen Bailey of Immuta.
He works on data governance inside of a company that has a product that does data governance.
So it's going to be really interesting to hear about potentially his own usage of the product in his work but he also has a fascinating background in studying the human brain which i
hope we can talk with him about as well costas you are doing some data governance work in our
own product right now what questions do you have for steven that you're interested to ask about
yeah absolutely i mean data governance in general is a very hot topic lately.
There are many things associated with it
from access control to the data
to data quality, data catalogs,
metadata management,
all that stuff that they sound
a little bit too enterprising many times.
But actually, the more we work with data,
the more of a necessity they become.
And all of these are problems that we haven't solved yet.
So it's very interesting to have a company that is trying to solve this problem.
So yeah, there are plenty of questions around how they do it, why they do it, what the use
cases are, and how they approach, in general, the actual definition of what data governance
is.
So I think we are going to have a very interesting discussion and a very useful one for anyone
that works with data today.
I agree.
Well, let's dive in and talk with Stephen.
Today we have Stephen Bailey from Immuta.
Stephen, thank you so much for joining us.
Thank you all.
I'm excited to be here and chat through some interesting data governance and privacy topics.
Well, that's a subject that we love. but before we get going, it'd be great.
You have such an interesting background with a variety of different experiences.
Would love to get a quick overview of your background and what led you to Immuta, and
then also just give us an overview of what Immuta does.
What problem are you solving?
Sure, I'd be happy to.
So I have always been interested in a wide variety of things. In college, I did chemistry and
philosophy major and really enjoyed digging into history and literature and intellectual
ideas and bandying those about. But when it came time to get a job, I actually started in
education and working in business operations for an education nonprofit. But through a series of turns of events, I went
and got my PhD in cognitive neuroscience and investigated how kids learn to read and how the
brain changed as kids grew from four to five to 15 to 55. What I found, you know, throughout all of the,
that journey was that I just really loved working with data. I loved asking questions. I loved,
I loved figuring out what is valuable and what is not. And also even the process of managing data
itself was, you know, there's, there's endless opportunities to optimize and change and improve things.
And I just really fell in love with it.
So as I was finishing up my PhD, I started looking for data science jobs, found Immuta,
and it was just a perfect fit.
Immuta is a startup that focuses on enabling data teams to have really fast, efficient, and understandable
access controls on their data. And we use the word governance in most of our marketing materials,
but really it's all about enabling more efficient access control and more responsible access control. So technically, the way we work is
we sit either in front of your database and mediate access to data and enforcing fine
grain access controls like masking and low level security directly, or we have some plugins
essentially that sit on the database systems themselves and can enforce access controls natively in the system.
So these are for technologies like Databricks and Snowflake.
So the cloud native technologies,
what's really exciting to me as the,
as someone on the team who works and leads our internal analytics efforts is
that access controls, data quality, data governance is really the place where data
engineering meets data science meets the business requirements.
And all these people have to come to the same place.
And it's very much a not solved problem.
There's, I think there's as many ways to define data governance and define what good data governance looks like as there are companies that are using data.
And so it's just a really, really rich place to innovate in.
Well, I know we have tons of thoughts and questions around data governance and would love to even discuss sort of the different definitions for
that word, because as you said, you know, data control, data governance, data access, you know,
there's sort of overlapping components of those definitions. But before we get into that, I just
have to ask this question because I know from researching you, you have young kids and you did
a PhD in sort of understanding how kids learn to read. So
I would love to know about your experience studying that at sort of a doctorate level,
and then seeing your own kids learn to read and being part of that process. Is there anything
interesting you can share from that experience of sort of studying it from an academic
sort of data-driven perspective, then your, your own experience actually,
actually doing that with your own kids? Oh man, that is, that is such a good question. And I think
that is such a good question. I, and the reason I love it is because it, it, it really
showed me the experience of studying cognitive neuroscience and specifically like
how the brain rewires itself when you're learning to read. Like the brain specifically
takes visual circuitry and auditory circuitry and semantic association circuitry and makes a
super efficient connection between those different systems in order to enable you
to read rapidly and automatically. And that happens through practice, practice, practice,
practice, practice. And you can actually observe this happening in the brain. And that's what,
that's what my lab was focused heavily on doing using functional MRI. But, you know, I spent five
years learning the techniques to manipulate medical images and do these group analyses and clean all of the data and all this stuff.
And then it really teaches you nothing about actually, actually teaching kids how to read.
We're in the middle.
I shouldn't tell you, I shouldn't say it teaches you nothing but it doesn't prepare
you for the experience of actually teaching a child to read so i think there's there's some
principles that you can get so you know rewiring the brain takes practice practice practice it
takes attention so it's not just about you know the amount of hours but you've got to have good
hours the kids have to be focused right like. Like quality versus quantity. It's not just brute force.
Yep. And you've got a scaffold that you're, you're learning. So you're, you're learning a bunch of
skills that can kind of be learned independently, and then you've got to learn to associate them
together and then you've got to practice. So, and then the other piece is, is the emotional piece. The more kids like to read and enjoy reading with you,
the more susceptible, more open they'll be to, to additional practice, which leads to,
to more, you know, neural refinement. And so it's, you have like, you can reduce the,
the equation, so to speak to some very dry variables from a scientific perspective. But
then when you, it comes to actually raising a kid who loves to read, you have to embrace the human
elements of, of, you know, creating a, an environment where they enjoy it and creating
and finding books that they like. And all of these pieces are super important. So there's both the scientific question, and then there's the human
question that you have to take into account in practice. Fascinating. And I mean, I would argue,
and I'm monopolizing here, so I want Kostas to jump in because I know he's, I mean, honestly
dealt with a huge number of data governance issues, but it's interesting. In many ways, I would say the same principles apply to
even data within an organization, you know, where having clean data and focusing on a process is
one thing, but you have real teams using real data, which is messy. And, you know, when the
rubber meets the road in a fast moving company, it's,'s you know it's a little bit of a different game yeah but actually i have a question on similar to your question eric before we move to that so
steven you said that like you studied how kids learn but then you also like try like to figure
out how this also happens like in later stages of like the growth of the kids so how does this I mean you mentioned some stuff earlier about like emotion attention are these
things that like keep and are how to say that like sorry still important in later stages of our lives
like for example how important are these like for a person at my age or your age right because we
keep learning it's not like we
stop learning at some point maybe not as rapidly or like so efficient as like the kid can do but
a learning process is something that like continues in our life so how does these things change as you
grow up and you get older that's another good question and i'm loving i'm really loving going
back to this brain stuff because i haven't i haven't talked about this. And since I graduated, so this is a breath of
fresh air for me. This is awesome. The there's what in developmental neuroscience, there's what
the they call critical periods where children or adolescents are particularly disposed to gain new skills. They can really just soak it
up. And if you see like a child learning language when the time, when they're between two and six,
they will just like, they just ambiently like pull it all in and, and it just kind of takes shape.
What happens as you get older or what happens during those critical periods
that doesn't happen when you get older is your brain is particularly plastic. It actually is
going through and disposing quickly of connections that are as useful. So you have what's called
pruning that happens. And as you get older, you sort of settle into a very efficient pattern.
So I would say like the general model that you can think of is when you're young, you're
very disposed to create new connections very quickly.
But as you get older, your brain basically figures out what are the most efficient paths
for what I need to do.
And it becomes more efficient and automatic
at doing those things. Now, what's cool about the brain and why everyone loves studying it is you
can change that as you get older. Like right now, for example, I'm learning guitar and I'm going
from like zero to trying to be, you know, something of being able to play at least one song. Right.
And it's very, it's very challenging. It would be very challenging if I were eight years old.
But as an adult, I have a lot more awareness
and I know how to structure my practice in an effective way.
So I'm not worried about like not being able to learn that thing.
It's just that it's probably going to take me a little more time,
focus and practice and kind of structure around the
way i'm doing discipline around the way i'm doing it to to really be super effective at that
yeah and probably you're also much better in like controlling your emotions something that the kids
needs like someone external to take care of that actually i found very interesting that you added
like the concept of emotion like in the very interesting that you added like the concept
of emotion like in the learning process that's very very fascinating but i think we need to
arrange another recording just to discuss about that stuff i know i could go all day because this
is so fascinating yeah yeah yeah emotion just one last thing and this will be a bridge to some data stuff, but I do, you know, anyone who studies the brain hopefully gets a little offended when people link neural networks, AI and neural networks directly to the brain. so much stuff to what the body does that supports brain functioning. Like that is just totally not
even part of the conversation when you're building, when many people talk about that
relationship between neural networks and the brain, like hormones, cortisol, attention, emotion,
even like sensations from your body. Like all of these things are super important for,
for brain functioning and brain processing
that there's just no real analog for in data, computer science, neural networks.
Yeah, absolutely.
And I totally understand.
That was something that I was thinking while you were saying about like attention and emotion,
because for example, one big thing right now with all the neural networks research that's
going on is about how to use attention and how it's called.
Because, of course, the attention in this context is much different than what attention is probably like in the human brain.
We keep trying to find some kind of parallels between how the human brain works and how these computational models work.
So when you talked about the emotions is the point where i couldn't help and say like okay
is this the next thing after the attention are we trying to put them there in the neural networks
but anyway these are things that i think we need a lot of time to chat about and probably arrange
another call to do that so yeah let's let's move forward with talking a little bit more about
your role in in utah right now and what i wanted to ask you and i find quite interesting
in your case is that you have a data related role inside the company that builds also a product
that's around around data right like i assume and this is something that i would really like to find
out during our conversation that data governance is something that affects also the lives and the work of
data scientists and data analysts.
So how do you use that internally?
What's BI and data analytics for Immuta, first of all?
How do you use it?
Is it for product?
Is it for business decisions?
And also, how Immuta is and the principles and the concepts of Ibuta are also like using
can you give us a little bit more information around that sure let let me I can break this
into two responses the first is like we can talk a little bit about the technical
kind of responsibilities and and stack and then maybe about the the organizational piece because
I think both are very very relevant so we do we're heavy believers in dogfooding our own product.
And so we, one of the first things I did when we started building out our internal infrastructure for analytics was get our product in front of our database and between and behind our analytic tool of choice. So our current stack is Stitch. It should be pretty familiar to anyone
who's heard of the modern data stack, as it seems to be called now. But it's basically Stitch to
Snowflake to Emuda to Looker. And that forms the core. We use Argo, which is a Kubernetes native container orchestrator for orchestrating jobs.
But it's a pretty standard setup for a small company.
So Immuta's role, which I think is really the interesting piece here, is as an arbiter of access control, but also as a place to land and focus our metadata management.
So we have information about jobs coming in from jobs and raw data coming in from Stitch
and from some custom taps that we run.
We have metadata about DBT and the models that we're building in DBT.
We have usage data from Snowflake.
And what we want to use Immuta for internally is to aggregate especially governance-related
data, such as where personal information is stored, who should have access to data, identity management concerns,
and to have Immuta push that to our consuming services, whether data scientists are accessing
data in Snowflake or in Looker. We're basically trying to build out a centralized governance
or access control capability there. So from what I understand is that with Immuta right now,
you're having two main components
and two main functions.
One is like the management, the aggregation and the management of metadata.
And the other one is like access control, which probably also, I mean, access control
might probably also need the metadata in order to be implemented.
Is this correct?
Do I understand it correct?
Yeah, that's correct.
So how are these metadata defined?
You as a data scientist, you have like to start to implement a new pipeline for your data.
You have a new project.
What are these metadata?
How they come into existence?
And how also at the end you use Immuta to store these metadata and to use them outside also of access management?
That's another good question. So the metadata
that we leverage in Immuta is all built around enforcement policies. So it tends to be much
simpler than the massive amounts of metadata you could associate with an individual
data set or pipeline. In particular, we want to define a minimal set of tags that are
related to any actions that are going to drive a decision about who has access to what data
for what reason. And so it basically boils down to three things, user attributes, data attributes,
and contextual attributes, like accessing data for a certain purpose.
It's, you know, these are all elements of attribute-based access control, which a lot
of companies implement. But what we found in working with companies and employing Immuta
internally is that you really have to take a step back at the beginning of building out your data
warehouse and define what are really my
hard requirements about what needs to be tagged, what data needs to be tagged, who should have
access to what data, and for what reasons. And so at Immuta, we have a pretty transparent
organization around data, but we still have heavy requirements around making sure that any data that
comes in, we identify whether it has personal information in it, whether it has privileged information in it,
such as, you know, like employee salaries, for example, and making sure we're tracking
that as it propagates along the data modeling layer.
And then enforcing access control in our database system.
So we were discussing right now about how you are using Emuta internally,
and we use that also to, let's say, describe a very important use case
on how the product is used.
Is this the main use case that you see,
or is it you've seen people using Emuta and deploying it also in different ways
or trying to address also other problems outside of the things that you mentioned already. The main use case for Immuta is simplifying that
access control layer and uniting different systems with the same identity access control.
In particular, one of the core innovations, I think, in our product is a global policy builder that's quite human comprehensible.
So if you're familiar with AWS, IAM policies, you know how hard those can be to comprehend.
Emuta makes it very easy to create a policy that a compliance person or a data access person or
data engineer can understand and then apply it across any
data set that's tagged a certain, you know, in a certain way.
And so we actually at the, it was one of our core bets when we, when the product was originally
built was that we, to do data governance better, we have to have better communication channels
around our data and understand if understand if I'm a data scientist
and I can't get access to data, why? And what attributes do I need to get access to it? If I'm
a compliance person, what is actually being implemented in Snowflake and who has access to it?
So that's definitely the main use case. And it does, what is great about attribute-based access
control and particularly policy-based access control, and particularly policy-based
access control that's a little more human understandable, is that it can take a ton
of policies that might be implemented in effect on a database down to a single policy in some
cases or a couple of policies in many cases.
Oh, okay.
That's great.
Larry, sorry.
Well, actually, I think you answered part of my question.
I was going to ask in what ways,
and I know it varies sort of on the level of complexity
of the stack and the size of the organization
and even probably the industry and type of data,
but you mentioned AWS you know, AWS,
IAM policies, like, is that the primary way that people are solving this if they're not using
a Muta or a similar tool or what other ways, what are, I guess, what are the ways that people are
experiencing the pain that you solve and how are they trying to solve that outside of a Muta?
I think to answer that question, you really have to be asking who are
you talking about and where in the pipeline are you talking about? Because you take even a very
simple pipeline like ours, we have to manage data access in Stitch. We have to manage it in the raw
tables in the database. We need to manage it in the sort of a Muta sanctioned part
of the database. We need to manage any consuming application. So if you expose it in a looker,
are you using a system user that has global access to access the data, the snowflake data?
If a data scientist comes in and then wants to stand up like some infrastructure of their own,
like how are you managing access to it? So I think there's,
there's two real issues.
One is there's just a huge proliferation of where data can be within an
organization.
And then the second issue is no one knows the answer to any questions of who
should have what data, like that's, that's really problematic.
I think a lot of times, well, I won't say a lot of
times, I've been in place organizations where there's some documents that exist somewhere on
someone's computer or in some share drive about what, you know, a data policy is. But then in
effect, like no one who's on the front lines knows what that policy really is.
And so if someone asks for data, they just get data or, you know, or if they might ask for data
and no one knows how to get them the data. So I think having clarity around how data should be
used and also then of course, knowing where it is, those are the two biggest pain points that companies are facing.
Yeah, absolutely.
No, I think it's very, very interesting to think about various levels of access at various points in the pipeline and sort of the points where you do need some sort of governance around access.
One more specific question,
and then I'll hand it back over to Kostas.
But so in your pipeline,
you said that you go from Stitch
and some other sources into Snowflake
to Immuta to Looker.
So is Immuta actually sort of sitting
between Snowflake and Looker?
I ask because we leverage Looker
on top of Snowflake as well.
And just as a user of that particular piece of the stack,
I'm interested in what it's like
to insert Immuta into that equation
and what it's like to interact with Looker
running on Immuta,
if that's actually what you meant by how it works.
Yeah, so our Snowflake integration
and our Databricks integration
are what we call native workspaces,
which means Immuta sits behind the scenes
and actually creates views or secure views
of your data within Snowflake
so that your looker would still be pointing to Snowflake.
And so what we have internally,
which is really actually a pretty neat experience,
is Google single sign-on to Emuta, to Snowflake, to Looker.
And so there's one identity.
People don't have to know any passwords
except for their Google password.
And Emuta is enforcing access controls,
whether they're row-level security or column-level masking
or just
subscription level masking or access on Snowflake account without anybody ever even having to log
into Immuta or like change where they're pointing Looker. Now, in other cases, for example, we
started out on Redshift. In that case, Immuta does act as a proxy. And so you'd be accessing your Redshift data through Emuta and Looker would actually be
pointing to Emuta's Postgres proxy engine.
But the Snowflake integration is very cool because you can use all of the, you know,
you can create different warehouses and everyone accesses the data through the public role, but they're having individualized access controls applied.
So it really eliminates some role management issues that you might have if you're trying to do dynamic access controls in Mucker.
That's very cool. Very, very cool.
Yeah, that's amazing. Especially when we are talking about managing access to many different products and tools.
Like we already mentioned at least two, right?
Like we have a database itself and then we have like the various different BI tools that are used there.
So that's super cool what we are doing there, Stephen.
Who is responsible for these policies? Who is usually, who has the role to create these policies in Immuta? Who is the user of
Immuta? This is a question that the answer varies upon who you're talking to. And I think it also
varies heavily on the size of the organization. At a small startup, so speaking from experience,
what I found is that the person who owns the data platform is the one who knows the most about the
data. He knows he or she knows where the data is, you know, most sensitive. And they're also the
ones actually enforcing the policies for real, right? So if there is no centralized policy
defined, then whatever the database policies are, that's the actual policy that's being implemented for that company.
And so, but in larger organizations,
you might have compliance organizations that have standards and there's
someone who's like job it is to make sure that warehouses are,
or data assets are up to that standard.
What's challenging in that scenario is that it data changes so fast.
I mean, it changes all the time. It changes so fast. I mean, it changes all the time.
It changes so fast.
And so if the person who's owning the data platform
and actually releasing the data to people
isn't the person who's most on top of the policies
and maybe even defining the policies,
then it gets out of date.
You know, whatever that downstream organization is
gets out of date or there's a big, it takes time.
It takes additional time to release a data product.
Whereas if you have the data platform owning it,
they're making sure that the data is up to snuff
before then they can release it without,
it's almost like a CICD process
for releasing data or data governance.
And that's in some ways
where I think the future is, you know, you can, it'd be awesome. And it's sort of how it works
now for a Muta users. You, when you create a pull request against your, your data warehouse,
as long as you have the right metadata attributes on it, then, and you put those metadata attributes in
a Muta, as soon as that data is released to end users, the correct policies will be applied.
And you've already defined those policies in the front on the first place. So, so it makes it easy
for, for you to have like one big initiative to define all your policies and then just be confident
that that data is getting, having those policies applied as you add new data sets.
That's very cool.
All right.
I think, I mean, the product itself has monopolized
a little bit of our discussion, which, okay, makes sense
because it's pretty interesting.
And it's very interesting also like the kind of approach
that you have and what you said about like the CICP.
But let's talk also a little bit more about your role inside Immuta.
So what is your team doing?
And what are the products that you are delivering?
Great question.
So I talked a little bit about my background.
When I started at Immuta, I was a data scientist.
I came in, I was focused on doing some ad hoc data science projects,
looking at performance considerations or doing, you know,
maybe customer segmentation and things like that. As I pivoted more towards managing infrastructure
and building a data platform for the organization for downstream users, you know, we've gone through
that data maturity cycle of starting from, hey, let's just like get some basic counts that everyone can agree on, you know, count of customers, count of like opportunities,
count of like these basic things, getting and getting consensus around that.
So that's where we started.
And then where we've been going as we've started growing is we've been building all of these,
this great analytics expertise and operational
expertise within all of our different departments. So within sales, within marketing, within product.
And so now our data team is focused really heavily on enablement and the development of new
interdisciplinary products, data products. So finding ways to unite sales data and marketing data and product data,
telemetry data into a single, for example, activity stream of users, a unified user
activity stream that we can understand what the customer journey looks like. That's an example of
something we're working on right now. And that's been, it's been great because it's positioned us both as partners with the different stakeholders in each team, but also as independent experts who are creating
like custom data products that can accelerate the business's impact.
That's super interesting. So can you give us a little bit more color around like how you unify
the data? What kind of sources you have? what are the challenges of unifying and like where do you stand in terms of that how mature do you think that this
product that you are describing is right now inside the company yeah i think one of the biggest
challenges is building the whole building sustainability across the whole data supply
chain so from from the original source system, for example, Salesforce,
making sure that that data is really high quality. And then you've got the technical infrastructure that extracts it, loads it, transforms it into a custom data product that
we expose in Looker. That's a technical challenge. And then you have to train people on what that
new product looks like. So you've got to have the high quality source data
or the downstream product doesn't work
or isn't valuable.
And then you have to start repeating that process
across different domains.
And each time you do that,
you guys have worked in data.
So you know, it's like you get excited,
you build a proof of concept in two weeks
and then it's six months of ironing out the kinks and realizing, oh, this doesn't mean that, or, you know, there's this weird,
weird data quality thing here.
So it's really about building, building out that supply chain.
And then, and then there's a really big element of team building and, and education as well.
Yeah, that, that is is it's both exciting it's really like i really enjoy that
aspect but it's it's easy to forget about yeah yeah absolutely i totally agree with you i mean
we tend to forget how important the human factor is because at the end i mean all these numbers and
all this data are going to be depredated by a human being right like they have to make sense
for the humans that are involved.
And of course you have also like to train them.
That's a very interesting topic actually.
And like for us, the people that work in the technology,
we keep to forget about it.
But you also, that's another very, very interesting topic,
which has to do with the quality.
You said that the first thing that you have to do
is to ensure the quality on the data supply chain.
And you mentioned also Salesforce.
So I think it's a very good example that we can discuss a little bit. So what is quality? I mean,
when you talk about the quality of the data, how do you define it? And how do you solve the problem
of the data quality in the pipelines and the systems that you're building?
That's a great question. I think of data quality, I think of it in sort of the same quality in the pipelines and the systems that you're building? That's a great question.
I think of data quality, I think of it in sort of the same way that I think of access controls, actually.
So access controls are, they're basically agreements between people about who should
get access to what kind of data.
And I think data quality is in a similar state where it's an agreement
between the person providing the data and the person using the data and the person,
and even maybe downstream, like the person originally providing the data of what certain
things mean and what the expectations should be across that data product. So, you know,
we've recently embarked on a data quality project. And
so we've been thinking a lot about it, in fact, and it's, you know, you could take one approach
of adding data quality and schema tests to every single column, like when you build out the, the,
the original data model, but, but it quickly leads to leads to noise and it becomes impossible to maintain
because things are firing all the time. I think what we're trying to do currently is define
critical fields. So define sort of our key metrics that we want to back as a data science org and then work backwards from there to to identify what are the
guarantees that we need to make as an organization to make sure that that final product that final
number is quality and then build visibility into into that pipeline so that both the people like my team can maintain it and identify
when something goes down quickly, but then also other people can look in and understand
whether the number they're seeing is actually correct or whether there's some known issues
around it.
But it all comes back to taking the time to identify what are the most critical components, what supports those components, and then what is the agreement that we need to make with, that we have made with fascinating point around agreement. In my past, we referred to that as
sort of the end-all be-all definition. And one example that keeps coming back is
I worked with a company who said, well, we need to track active users, right? And that sounds
like a simple metric. It's just one metric, right? But when you started to
ask people around the organization, what, you know, what is the definition of an active user?
I mean, you would get wildly different responses, right? For what seems on face value, just like a
very, like, well, this is easy. Let's just track active users, right? And it's like, okay, well,
you start getting into it. And I mean, there's all sorts of edge cases and, you know, it can sort of
cross different user actions that are difficult to track. I mean, there's all sorts of complications
in there. And so the agreement that really resonated with me when you said agreement,
because that's actually, I mean, unrelated to the pipeline or the actual sort of data science work itself, the fundamental challenge of getting agreement is actually pretty formidable in a lot of organizations.
Not because anyone's necessarily territorial, but you just have to do a lot of work across teams to get to a shared definition.
Yep. And I've found, I mean, investment from executives and leadership is so key there, right? Like we've got, we couldn't be an effective data team without that investment
because it forces the question of what does this number mean?
And what are we going to like accept that it means?
And also what do we accept is not known or knowable from this number?
And that's, you know, that's hard.
I think that is one of the things that
people find very hard because they, they do, they look at the active users and it's like,
well, I want to know all of the information about the active users, but you know, as soon as you
define it, you're defining it as also in the negative, like it's not this. Sure. Sure.
So some questions become off bounds. One, one quick question, and this is just very practical. I'm just thinking about our own experience. I mean, I would say over the last two quarters, we went
through a similar effort of, hey, let's just make sure that the numbers in marketing and sales are
the same, right? And that we can agree upon all these numbers. But just thinking about a lot of
our listeners are data engineers or working in or related to data engineering. What was that effort like for you through it ourselves and having done it before, you know,
part of you wonders like, man, does every other company have this sorted out?
It seems like it's taking us forever to do this, you know, and in reality, it's something
that every company struggles with.
So we just love some practical, you know, you know, some practical thoughts on your
experience with that.
Yeah, I would definitely, you know, offer encouragement to anyone who's
feeling discouraged from efforts like this. It has been a two-year rapid growth experience for me.
I mean, the amount of conversations and like you said, the time it takes to implement these things
is much longer than I expected. And I think a large part of that is trust.
You have to build trust among people
and people have to trust the numbers.
And if it's something new,
there's an intrinsic skepticism.
And so one of the first things
for a couple of our more effective projects,
I think the first thing we did was we got a graph and you start shopping it
around to different stakeholders. And it's like, you, you,
you find a format that people are going to see over and over and over again
and start shopping it around. And that starts to build familiarity.
And then over time, as, as that graph is shared in meetings and stuff,
that's when you start to build trust in it. And then you can start getting kind of derived information from
it. But as a data scientist, I know my instinct a lot of times is to put like the, the big profiling
dashboard together of like the 50 different, the 50 different ways we can slice this data model.
And it's, it's too much too soon in many cases. I
think it's better to have like one graph and then slice it in a couple of different facets,
get people to build trust in that and then kind of roll out new stuff. At least that that's been,
you know, my experience. Yeah, that's, it's really interesting to think about that. And we don't have time to get to it today, but as a consumer of the type of data that you're talking about, in our organization, I would be one of your internal customers, when I hear you describe it that way, what comes to mind to me,
and I don't know if I would have articulated it this way if you hadn't sort of given that
explanation, but I'm making decisions with the data, right? And so it really does take time for
me to sort of take, understand, and have enough confidence in a chart or a data set and make a decision on it and then
sort of, you know, get feedback on are the decisions I'm making based on this data better?
Are they helping, you know, the company? Are they helping my team? Are we progressing as a result of
this? And that really does take time to build trust, not necessarily because I don't trust you,
but, you know, because there's a lot on the line as I'm making decisions
with this data. And so I want to see that it will actually prove out to be producing results as I
use it in my, in my job day to day. So I had this experience during my PhD, where we analyze brain images. And these can be three-dimensional image volumes,
or they can be four-dimensional
or even five-dimensional volumes.
And so it's coming in,
I didn't have any experience in this type of data,
totally brand new data to me.
It took me four years of working with brain data
day in, day out, running experiments, running
processing on it to really gain a lot of trust into that data and understand at a sort of
an intuitive, deep level what I was working with.
Or when I saw like a blob in this part of the brain, that's like a statistically significant
result in one part of the brain, I was like, Oh, I trust that. I know what it means. And that's a
situation where I actually, I could have a hundred percent trust that the data that I was getting
was correct. And, and so I had like a hundred percent data provenance over or understand
control over the data provenance. But it was all about the
expertise of, you know, becoming a user of that data was all about building trust and building
intuition and building knowledge. And that process just takes so much time. I share that just because
it's one of the few times where I, you know, have like a totally unfamiliar data set and just had
to build that intuition from the ground up. And, you know, it just a totally unfamiliar data set and just had to build that intuition from the ground up.
And, you know, it just takes a long, it takes a long time to trust.
That's actually super interesting.
I mean, I'm observing like the conversation that the two of you had in this past few minutes.
And at the end, I think I went up to data governance again, because if you think about
it,
working with data,
it's like it can't be distilled at the end,
like in two things, right?
It's actually one thing,
and this is trust.
It's trust to the data
and trust among the people, right?
And the understanding that people have around the data.
And I think this is, let's say,
the broader definition of what data governance
is trying to solve as a problem,
how the people can work with the data and trust the data and also trust and communicate and come in an agreement between
them of what like this data i'm into i mean i know that we said that about the muta that it's
more around the access control around the data but this is like i think very foundational part
of like building trust both on your data and the processes and the people that you have inside the company.
And then on top of that, you can build other layers.
You were talking about the definition of a KPI and how we understand it.
I don't know what the plans are that NUTA has around the product.
That's my next question.
But from my experience, at least, a big part of data governance also around that how we can
have a data catalog where we agree upon like the definitions of the data that we track and the
kpis that we measure and it's interesting because these are problems that the large enterprises have
been like solving for quite a while or they were trying like to show for quite a while but i think
as more and more the whole industry becomes like data-driven, anyone will have a life to deal with these problems.
So in the end we discussed so many different things, but I think all the stuff that we were
discussing were around data governance at the end. And having said that, my last question for you,
Stephen, what's next about Muta? I mean, you have solved from what
it seems like a very core problem around data governance, a very important one and in a very
elegant way. So what's next? So we've had a lot of conversations around this. I think one of the
cool things that I've gotten to experience at Emuta is when we started two years ago, I didn't see any other
like access controls, entitlements and security startups really that were, that I would say were
direct competitors. And we started to see more of a movement in this space. And it's been really
exciting because I think there's an acknowledgement that governance has to be part of the data development lifecycle. And so we are starting
to look into some of the adjacent governance responsibilities. I think there's a really good
article by Andres and Horowitz's group on the modern architectures, and they define metadata or data governance in sort of four buckets.
There's metadata management, which would be like your enterprise data catalogs, entitlements to security, which would be what Amita is currently doing, data quality, and then observability. And so data quality and observability are of high interest
to us and really creating a centralized place for data engineers to understand what's going on in
their data pipelines and then exposing that to end users. That's an area of intense, I'll say,
research interest right now, because I think it's a big gap.
As a data platform owner at Immuta, I'm managing, you know, I've got my GitHub repos with Terraform in them.
I've got a couple of AWS Lambda functions.
I've got Snowflake.
I've got an orchestrator, Stitch.
We've used a little bit of Fivetran.
I've got Looker.
I've got Immuta.
It's like, I have this, I have a bunch of tools.
Each of the tools does what they do really well and makes my life better. But now I have to manage all of these different tools and
all of these tools create dependencies for my, my golden data products that I want to give to
end users. So thinking about how do we, you know, extend beyond data sharing agreements and go more
into maybe data quality agreements or
or adjacent spaces that's really where our mind is at and then of course improving the core
experience of making access control simple and easy and communicatable i think that's
that's going to be there's so much to do there absolutely i mean it's, it's a very foundational problem, as we said.
And so, of course, there's still a lot of space for improvement.
I'm really interested to see what's going to happen.
Steven, it's very, very interesting what you described.
And I'm also personally very interested in anything
that has to do with data quality and observability with data.
My feeling is that many things that we take for granted
as engineers when we develop code,
there are things that they are missing right
now when someone is working with data.
So I think
there's going to be very interesting times ahead of us
and very interesting products are going to
come into existence and
I'm very excited about it.
Thank you so much. It was a great
time and
we really enjoyed the conversation with you. And I'm looking forward to connecting in the future and see how things are going with Ibuta and you and discuss more about data and the human brain again. This was really great, guys. One quote to end it on that I think is really relevant is I was talking to a colleague who
runs a data team and he said, when it comes to data governance, it just feels like there
are tons of wrong ways to do things, but not a really clear right way to do things right
now.
And so I just, that has stuck with me.
And I think as a community, I'm just really excited to see how we grow in terms of sharing
best practices and also technologies that help us build sustainable,
confident, sustainable pipelines that we can be really confident in.
Absolutely. Well, again, thank you for spending time with us. Thank you for teaching us both
about data governance and your work at Immuta, and also a little bit about how we can deal with
kids learning to read, which I know is very relevant for me right now. So appreciate that
from your background as well. And we'll catch up with you soon. Awesome. Thanks, guys.
Well, that was fascinating, not only because I'm teaching my four-year-old son to read and sort of
working on letters and recognizing words. So it was really interesting
to hear Steven's take on that. But I think one of the things that I found most interesting,
and this is somewhat of a theme we've seen on the show, is that the technical problems with data are
absolutely fascinating, but they really sort of are secondary to getting alignment within an organization around data.
And that's a sort of a particular skill and particular endeavor on its own that, you know,
doesn't even necessarily in its early stages relate to the technology.
And I just found it really fascinating the way that Stephen talked about that dynamic
within Emuta and within organizations in general.
What stuck out to you, Kostas?
Absolutely. I totally agree with you.
Working with data is not just an institutional problem that every company has to solve.
I'm pretty sure that our listeners will notice how many times we use the word trust, right?
And trust is like a human characteristic, right?
We need to trust our data. We need to trust our technology.
And above all, we need to trust the teams that work with the data
and that we have a common understanding on how we interpret the data.
So I think this is something that's like a big part
of what data governance is trying to show.
It's a very interesting problem.
It's, as Stephen said, we are still at a stage where all the problems we're trying to solve
around that they have many bad ways to solve them, but we haven't figured out yet the good
ways to solve them.
So it's very fascinating.
It's very exciting.
And I think a couple of next months for like a year or something, we will see more and
more companies and people trying to come up with interesting solutions to these problems.
And of course, we'll see what Immuta is going to do.
I mean, they started with the access control to the data and from what it seems, they do like an excellent work product wise to solve this problem.
But I'm pretty sure that they are going to also to attack other problems around data governance.
So I'm very excited to see what's going to happen in the future.
Me too. Well, thanks again for joining us on the DataSec show.
As with many of our guests,
we'll check back in with Stephen and Amita
maybe in six months time or so
and get updates on where they are
with the product
and what his team is up to.
We'll catch you next time.