The Data Stack Show - 81: Digging into Data Ops with Prukalpa Sankar of Atlan
Episode Date: March 30, 2022Highlights from this week’s conversation include:Prukalpa’s background and career journey (3:16)Applying a data-driven mindset to poverty (7:21)What Atlan does (11:53)The makeup of a realistically... functioning data team (15:25)How to create a company’s first data team (18:13)Defining “agile data” (22:01)The necessity of data ops (26:36)The minimum data stack needed (29:16)Data team size (31:58)Where to start when you need to make adjustments (34:51)Collaborate with different parts of the data stack (41:27)Defining the metadata plane (44:29)Lessons from facing crazy data problems (48:31)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. We have
another data term to dissect in today's show, and that's the term data ops. We've talked a ton
about ops on the show and how ops is being adopted into the data space using a lot of the principles
of software engineering. We're going to talk with Prukalpa from a company called Atlan. Super interesting.
Kostas, she comes from a background where she's solved massive worldwide data problems
that have focused on things like poverty or access to clean fuel and water. And I am so
excited to hear what that was like, because those tend to be, you know,
different in so many ways from a lot of the things that us who work in data companies in B2B SaaS,
you know, sort of in the venture backed world face. And I'm sure there's probably some
similarities. So that's what I'm going to ask about. How about you? Yeah, I'm very, very interested in chatting with
him about metadata. I know that to build a platform like the one they have, you have to
build some kind of, let's say, metadata layer there. And I really want to see, first of all,
how mature the technologies are in order to collect all these? And also, what do you do with the metadata?
And the reason that I'm so interested with metadata is because, you know, you have the
data, you have to work with the metadata, and then you can go to the semantics.
Always the semantics.
Yeah, I guess, though, it's going to get complicated now with the metasphere or the
metaverse and talking about metadata.
What's that going to mean?
Yeah, I think that's going to be a very hot topic next year with all data.
Outside of jogging, it is an important aspect of working with data.
And it's good that we start hearing more and more about metadata because it means that the foundations of the technology is starting to be solidified.
So we can start working on the next iteration of how we can deliver value.
So we're talking about metadata from a business perspective, I think it's a very good indication of the maturity of the space.
So that's good.
I agree. Super exciting. Well, let's dive in and learn more. Yep. Let's do it. Prakapa, welcome to the DataSec show. We're so
excited to chat with you. Thanks for having me. Okay. Let's start where we always do. We'd love
to hear about your background and I'm excited because you've done data work for some really
interesting internationally, you know, sort of
internationally known organizations. So can you just tell us about your background and what led
you to creating Atlan? Sure. Yeah. So I've been a data practitioner my whole life. Prior to this,
my co-founder, Warren and I, we founded a company called Social Cops, mainly with the mission of
saying, hey, you
know, large scale problems in the world, like national health care and poverty alleviation,
they don't seem to be using data, and it really feels like they should be using data.
So let's do something about that.
And our model very quickly turned into that we became the data team for our customers,
because we were typically working with folks like the United Nations or the World Bank or the Gates Foundation or several large governments who did not have data
teams or technology teams for that matter. So we sort of just became that data team, which is really
where I learned everything that I learned about building and running data teams and how complex
and chaotic they can get. So because of the kind of work we were doing, we were sort of lucky to be
exposed to a wide variety and scale of data. we were doing, we were sort of lucky to be exposed
to a wide variety and scale of data.
At one point, we were processing data
for 500 million Indian citizens
and billions of pixels of satellite imagery,
which all sounds like they're really cool projects,
but they were not really cool on a daily basis.
The day-to-day was a nightmare.
You know, I feel like as a data leader,
I have seen it all.
I've had cabinet ministers call me at eight in the morning
and say the nightmare that no data leader
wants to be woken up with,
which is the number on this dashboard doesn't look right.
And then I've, you know, done that like wild goose chase
of calling my project manager who called my analyst
who said, hey, it looks like the pipeline's broken. And, you you know then call my engineer and he pulls out our locks and says no nothing looks
wrong and you know it takes us like four people and eight you know eight hours to figure out what
went wrong i have sat in the top of our terrace this one time and cried for three hours because
an analyst quit on me this one time exactly a week before a major project was due and he was
the only one who knew everything about our data like project was due and he was the only one who knew
everything about our data like he like and there was no way i could deliver this project without
this analyst and that's sort of just these kinds of things just brought us to this breaking point
right if you like our team was spending 50 60 percent of our time dealing with this chaos of
which data set should i use for this analysis what does this column name mean how do we measure
annual recurring revenue?
You know, and your number on this dashboard is broken,
like stuff like that.
And we realized we couldn't scale like that.
And so we actually started building
like this internal project
that we call the assembly line.
And the goal was basically to say,
our team is super diverse
and we want to find a way
to make our team work together effectively.
We actually tried to, long story short,
like we tried to buy a solution. We failed at buying a effectively. We actually tried to, long story short, we tried to buy a solution.
We failed at buying a solution.
We were forced to build a solution.
So we actually, Atlin was never born to build,
to sell as a product to anybody else.
We actually built it ourselves
to make our team more agile and effective.
Over two years, we ran 200 data projects
on the tooling that we built at that time.
And in that time, we made our team over six times more agile.
And we realized that we'd build tools that were more powerful than we had earlier intended, right?
Our team went on to, we did things like we built India's national data platform,
which the prime minister himself uses.
It's one of the largest public sector data lakes of its kind.
What was really cool about that project was it was built by an eight-member team in 12 months. It's also one of the fastest of its kind. So sort of realized that these tools could
help data teams around the world hopefully be a little bit more agile and effective. And that's
when, you know, Atlin was born. We said, you know, can we use these tools to help every data team in
the world? Sure. Okay. I have to ask, this is so interesting because
we love hearing about really diverse experiences of data. And when we think about subjects as big
as, you know, fighting poverty and then apply sort of a data-driven mindset to that,
could you just give us a little bit of insight into maybe like what's a specific
poverty related project that you worked on and what data were they not using? What data were
you able to introduce and how did that change the project? That's just so fascinating.
Sure. Yeah. So in some ways, actually, I think social problems are some of the most complicated data problems that can exist, you know, actually in business, because the outcomes are a lot clearer, right? It's you want to improve revenue and you want to reduce costs, versus like, you know, when you want to improve the quality of life of a human being, you know, like it's a much harder, like, you know, just like problem to model, right? And we saw this, maybe I'll give you one example with a project that's super close to my heart.
We partnered with the national government, which was basically rolling out clean cooking fuel to about 80 million below poverty line women across India. India and this was actually so just to give you context on the problem people basically in or
women in India and in rural areas and below poverty line they actually use sort of this
natural cooking fuel in their in their house firewood basically which which is equivalent
to smoking like 400 cigarettes an hour or some crazy number like that. It's crazy.
And so obviously the government wanted to solve this.
They were rolling out like cooking fuel programs.
So these were gas cylinders that were free
that were going to these below poverty line women.
And, you know, we rolled out the program
and, you know, there was like initial operational monitoring
and, you know, we put in place data systems for that.
The program rolled out really fast and really well.
And then we started hitting this challenge,
which is that while the penetration of gas cylinders
was increasing significantly,
cylinders need to be refilled, right?
So, and typical, the stations for gas cylinders
were only in urban India
because there was no penetration or demand, right?
And the government was creating this like very rapid demand because of what they'd done.
Now, this was a super interesting problem because the person who runs a gas station
is actually an entrepreneur.
So it's a decentralized model and it's privatized.
Now, the entrepreneur obviously cares about this being profitable, which makes sense.
On the other hand, the government wanted to create access. So what the problem statement
that they gave us was, or the minister at that point told us was, I would like a gas cylinder
station to be within 10 kilometers of every single Indian's home. And so now you have this like really unique problem
where you're balancing accessibility with profitability.
And so how do you do that the right way in some ways, right?
And so, for example, what we ended up having to do,
it took us a bunch of iterations to do this.
Like, do you do top-down allocation?
Do you do bottom-up allocation?
You're talking about 640,000 villages.
So what we ended up doing was we actually turned it
into a geospatial modeling problem,
brought together data from about 640,000 villages,
got about 600 data sets in, so population, affluence,
like, you know, a bunch of those parameters.
We layered market data on top of that.
Where are the existing gas stations and cylinders?
Like, where is there already access? And that basically? And that basically got out of our clustering algorithm. And then the rest of the think people are going to be willing to pay. And so every cluster was actually a different size in some ways
in terms of like the distance that it was covering.
And then use that to basically figure out
where you should go open these next 10,000 gas stations
across the country to actually solve for both profitability and accessibility.
Right. And so those are just some examples of the kind of
yeah, and modeling kind of challenges that we have to deal with.
Yeah, no, super fascinating. That's really helpful.
It is wild to think about that because I mean, just off the top of my head, I mean, you mentioned geospatial, but, you know, economic modeling, the demographic component of it, the socioeconomic
component of it, which is, you know, pretty wildly different data sets. Interesting. Okay. So
you're dealing with issues like that. Let's talk about what does Atlan actually do? So like what
were some of the, you, you talked about, you know, okay, you get the call from someone who says the
dashboard doesn't look right.
But what does it look like for a team to use Atlan and how does that make them more efficient?
Sure.
Yeah.
So let's jump in on some of those problems I talked about, which are pretty commonplace in most data teams around the world.
And if you think about these problems very deeply, you realize that the place it stems from is actually this fundamental reality of data teams, which is diversity, right? Data teams are diverse. To make a data project
successful, you need an analyst, an engineer, a scientist, a business user, machine learning
researcher, analytics engineer. All these people are very different. They have their own persona
types. They have their own DNA in the way that they work. They have their own tooling preferences, and they also have their own limitations.
And while this diversity in some ways is our biggest strength, it's also our biggest weakness
because a ton of the challenges that I talked about, like come from the fact that all these
people need to sort of come together and collaborate, but they all have different
contexts that they're operating in the ecosystem. And so at Aspen, we sort of see ourselves as a collaboration layer for the
modern data team. Every time there is a function inside an organization, right? Engineering teams
have a GitHub, sales teams have a Salesforce. What does it take to create that true collaborative
hub for a modern data team? Knowing that the only reality in the data team is diversity. So the place we operate
in is if you think about the fundamental modern data stack in some ways, which is your data
ingestion and warehousing and transformation and BI, that's what I think of as the data stack.
Atlant sits on the metadata plane or the control plane layer of the data stack. We bring in metadata from all of your different tools
in your ecosystem.
We bring that together, put it together
to essentially start creating intelligence and signals,
make it super easy to discover data assets
and so on and so forth.
But most importantly, we actually use this
to start driving back better context
into the tools that you're working in daily, right?
So for example, when I am in a dashboard or a BI tool,
I want to know, can I trust this dashboard?
But the truth about whether you can trust this dashboard
is actually in the ETL tool.
And it's in like, did the pipeline get updated today or not, right?
And or did the quality check run and did it pass?
That's the metadata that Atom brings
together. We make sense of it. We construct auto lineage. We basically make sense of your entire
data map in some ways and create that single source of truth. But then we take that back
into tools like BI tools, into Slack, into collaboration hubs, into GitHub,
into tooling like that to actually make the day-to-day workflows of teams significantly more simple.
Prokalpa, I have a few questions because you have mentioned some very exciting topics.
I'd like to start from the people.
You mentioned quite a few times about the diversity and the complexity of the data teams, right? Now, us coming, let's say, from the more
like technical side of things and the data engineering, when we talk about data teams,
we keep on forgetting like all the different stakeholders that are part of these teams,
right? We focus a lot on the engineering persona, talking about data engineers and maybe sometimes also analysts. So can you give us a bit of, based on your experience,
a description of how a data team, a functioning data team, in the realistic one, usually looks
like? And what are the personas involved there? Wow, that's a loaded question.
I wish there was a way a typical data team functions, right?
And I think that's the reality that, you know, every team is diverse.
Like every team is unique.
And teams also evolve over time, right?
And so I think this is a classic, like we've seen right from, you know, fully centralized data teams to fully decentralized data teams, to all kinds of
hybrid structures in the middle, right? We're seeing, you know, you know, we're seeing, we're
increasingly starting to see like, sort of, for example, some functions like data platform and
an enablement, which in my mind is a new form of governance, right? Like there's,
there's centralized functions, and then there are decentralized functions, which is, you know, pod structures with analytics engineers and analysts,
you know, and I think what I've realized over time is that there are, you know, four or five
different ways that you can structure your data team. I also am a very big fan of not fitting
people to JVs or fitting people to structures, instead actually building
a structure that works for your team. Because the reality is that there's a lot of overlap,
right? If you think about like the skill sets, like the skill sets, like the fundamental skill
sets from an analyst to an analytics engineer, to a data engineer, to a machine learning engineer,
that you're actually talking about overlapping skill sets. It's not, you know, it's not black and white. And in a lot
of ways, it has to do with the person in some ways, like I've never met a perfect data scientist,
like, you know, I've never met a public, I don't think that exists. And so I'm actually
a very big fan of this, this method methodology of actually starting at the fundamental skills and building roles around people. And then, you know, in some ways, the structure of your data
team gets get structured on the basis of your leaders, right? And how does that how do your
how do your leaders interplay with with each other? And what are their skill sets? That's,
that's, I think, you you know i wish more people would adopt
it because i think that's really the only reality you know in in in a data team yeah that's that's a
great point so you know like companies usually do not start with a data team right like when you
incorporate and you start like a new project or a new company you don't really have the resources or even the need for a data team.
There is a certain point in the lifecycle of the company that you will start needing
that.
Based again on your experience, because you mentioned having a core set of skills and
then building on top of that.
What is this core skill set that is required for the people to create
this first data team
in a company?
Yeah.
So I believe that
the way to think about this,
and I think every startup founder,
like, in fact,
I actually have a blog on this,
which is, you know,
how do you go about prioritizing this?
Because I actually get
a ton of questions
from like startup founders
who are like,
oh, we want to invest in a data team.
Where do we start?
And what I typically ask them to do is actually say,
okay, I think you should think about this from a strategic perspective
in terms of what do you want your data team to achieve in the first place?
And so to give you an example,
and I think this needs to start at like,
what is
this biggest strategic priority of the company? Because let's say I am starting a hyperlocal
delivery startup, or, you know, something like, you know, car, like an Uber equivalent, for
example, right? Maybe what's what I thought with maybe the most important thing when I'm starting
on day zero is just operational analytics, I just need to know, you know, how many rides are we serving and, you know, things like that.
But right after that, probably, or even, you know, at that point, probably the most important thing
that for the business to end up actually becoming the matching algorithm, which is actually a pretty
complicated data science problem, right? So on day zero, you're not just starting at analytics,
you're also probably starting with data science and investing in data science problem, right? So on day zero, you're not just starting at analytics, you're also probably starting with data science and investing in data science, so that you can
actually solve like data science, the fundamental part of your product in your business. And so on
day zero, when you're investing in your team, you're probably going to try and find a leader,
or an initial team, you'll probably start with like an analyst and a data scientist who can
stretch, and then you'll build out those two an analyst and a data scientist who can stretch.
And then you'll build out those two teams like that.
On the other hand, let's say you're a software startup and you're selling SaaS, for example.
Now, when you're selling SaaS, operational analytics is almost what you need to work really well.
Up until you get to like a relatively mid-sized company in some ways,
right? Like you want to invest in product analytics, you want to invest in sales analytics and sales ops. And so then in that case, for example, you probably just want to invest in a
really strong analytics leader, maybe someone who comes with domain expertise in SaaS, because SaaS
is complicated in the way that, you know, the domain the domain itself works. And you don't need
data science at all, up until maybe much later in the company when you decide to build a product
using all the data that you've collected in your SaaS product, for example. And so I think that
is the nuance in some ways, which building a data team and a structure, I think you need to start
at the first principles
of what you're trying to optimize for as a company and then from there figure out what are the skill
sets that you need your data team to have on day zero yeah yeah i think you gave some like super
super valuable advice here and it's a very interesting perspective on how building teams
like i don't know how many times i've seen you know sas companies
at an early stage and be like okay we are struggling with attribution for example let's
find a data scientist to do like some magic of course it fails at the end but anyway that's
that's the topic of another episode that was great like i really i really appreciate like
sharing this information with us. So you mentioned at some
point that using, let's say, the platform that Atlant is today, you became more agile, right?
And agile as a term in software engineering has a very specific meaning. And usually,
the easiest way to explain what agile is like you give like the
counter example of waterfall right uh but that's like in software what is agile in data what does
it mean i become more agile with working with data is it the same thing as software or is it
different yeah so i mean i think at a high level i think we thought about you know as we thought about like
how do you measure agility in some ways like we we sort of thought about this as as velocity and
in some ways and you know how you know how can we get stuff done but also at what level of quality
can you get done at what level of how can you reduce the iterations that you need in your work
when something changes change requests
are a really important part of like a data team's job right and when someone tells you oh yeah the
dashboard looks great that metric looks great but can you just like make this one change to it and
add this pull this one number additionally to it you know only your data person knows how difficult
it is to like go and get that one number to pull into that dashboard, right? And so how do you, how can you build your entire
pipeline in a way in some ways that can give you that kind of reusability and reproducibility to
be able to like manage change requests in some ways, right? So I think all of those are components
that are going to agility. To answer your question on is agile the same as software
engineering? Absolutely not, right? Software engineering is a very different practice
with the fundamental, actually the one fundamental that's different between software and data
is that in software, humans create code. So that fundamentally changes the equation because in data,
we can't control the data that we are working with in most cases right and i think that itself is like a fundamental paradigm shift between software
and data second in software often you already know what you're broadly going to build and you
know what you have to do like it's it's much more execute it's much easier to measure execution, right? And quality of execution.
Versus in data,
many problems are exploratory in nature, right?
Like let's say it's an exploratory,
like why is our ARR number dropping?
Like that's an exploratory analytics project.
Like how do you know?
And solving that,
like it's really difficult to scope a problem like that on day zero, right?
And so I think those are things
that are fundamentally different
between software and data. And I think that's why it becomes very difficult to
just say let me pick agile as a framework it works in software engineering and i'm just going to
bring it bring it into into data and so i think a few things that for us were were useful were
we we basically tried to take best case practices but not just best case practices
from software engineering right we also took best case practices from you know like like lean
manufacturing and devops and you know like there are so many like data itself is such an
interdisciplinary team so in some ways you can take like learnings from a bunch of product teams
for example like something i'm really really bullish about is this idea of going from like almost like a data service team where you're just like servicing requests to
a data product team where you're you know a product team for example is building for your
end users the your success is measured on whether your users at the end of the day use the product
the same way like can you actually think about your data products, right?
And can you measure yourself on success rather than just like closing out, you know, a service
request?
So I think all of those components are things that we should learn from as a data community
in some ways and build what our practice of agile or, you know, people call it data ops
should look like in the ecosystem.
Great, great.
That's super interesting.
And again, another very good definition.
And it's good to make, let's say, clear the differences
because especially like, you know, like many people,
especially like data engineers,
they come from software engineer background, right?
And they have been exposed in like very specific semantics
around what each thing means, right?
Like, for example, agile.
So understanding the differences between what it means to be agile when you work with data
and what it means when you work with software, I think it's really important if we want to
increase, let's say, the quality of the work that we manage to do at the end.
I'll keep in the same approach of trying to redefine terms.
You mentioned DataOps, right? Again, Ops is not something new as a term. We have DevOps,
we have SREs, we have RevOps, we have everything else.
BizOps.
BizOps, exactly.
MarketingOps, yeah. exactly so right why why do we and why do we need it yeah so i think at a higher level i think
the way i think about data ops is it's it's a it's really a principle or a almost like a way
of doing things i know it's caught in like a way of doing things.
I know it's caught in like a lot of, it's a buzzword now and it's gotten a lot of attention.
And there's a lot of products that claim to be a DataOps platform and a DataOps product and like all these other things.
But I actually don't think that that's what DataOps is, right?
Like DataOps is fundamentally about saying, how do we take you know the principles
of agile and devops and lean manufacturing and and all of this and bring it into a fundamentally
collaborative practice that helps data teams work together effectively it's built on the foundations
of collaboration reproducibility you know how do you ensure that your your data assets are reusable
and reproducible?
It's built on, you know, foundations like, you know, self-service and self-serving, right?
How do you create something that is that where you're reducing the dependencies on the core data team?
I think those are some of the elements of what DataOps means and can create.
For example, in our case, like we actually created like something that we call DataOps Culture Code,
which is about, you know,
what does implementing a DataOps culture
truly mean inside organizations?
And I think that's the way we need to think
about these concepts.
I think, you know, be it DataOps
or be it the Data Mesh, for example,
these are all design principles.
These are ways of doing things.
These are not, technology is just a part or an enabler in solving these problems.
But it's a broader principle that we're working towards.
All right.
So I think enough with terminology.
Let's get into the technology now.
So, all right.
We have figured out what DataOps is, why we need it, how we build such a platform.
What do we need in order to...
Actually, no.
Before we go to this question, I have another question.
Sorry.
Which I think is going to help us with this question.
And this question is about the data stack.
We keep talking a lot lately about the modern data stack.
We have a panel here trying to define what this thing is,
why it is modern, when it stops being modern
and it's not modern anymore,
what's going to happen in the future.
No, no, post-modern.
Exactly.
So let's, I mean, I'll try to avoid the controversial conversations
around it, but we need a stack, right? In order
to work with data, there are some architecture that needs to be in place, some minimal kind of
pieces of technology that we need to work and operate. So based on your experience,
two parts of this question. First, what's the minimum set of data stack that a company needs
to have in place? And second, what is the minimum, let's say, data stack that you as Atlan need in order to go and
operate and deploy your data ops platform?
Sure. Absolutely. Yeah. So I think the way I think about it is broadly a bunch of original,
like as I think about the data plane or the data stack itself i broadly think of it as
a few building blocks right the first around just first collecting your data in the first place
right this is where you know your you have data ingestion you have you know cdps and you have
essentially what does it take to actually even bring your data together in the first place and
collect the data that you need i think the center stone of every data platform in some ways is the storage and the processing
layer.
And there's a bunch of different architectures that you can use, but it could be your cloud
data warehouse or your cloud data lake or your lake house or, you know, whichever of
those architectures you're picking inside the org.
But that I think is the center stone in some ways.
Then there's transformation.
How do you take, you know,
how do you go from rod to like, you know,
bronze, silver, gold, and so on.
So that's the third layer that I'd say.
And then the final is what I call the application layer.
That's where I would say the BI tools sit.
And then depending on whether you're
a data science organization,
maybe some data science tooling,
like Jupyter, for example, sits. I'd say that's, in my mind, what forms the core data
stack.
It's at that point that I think once you have the data stack or the basic data stack, which
is like, say, these three or four tools, there's a bunch of others that I'm not mentioning,
but this is like, you know, minimum viable data platform.
I think it's at that point
that tools like Atlant start becoming helpful
where we say, hey,
we're building that like metadata governance plane
in some ways for your data stack.
So a great, for us, for example,
a typical customer who brings us in
has implemented something like a Snowflake
or a Databricks or a, you know,
AWS data platform in the last,
say, you know, 12 or 18 12 or 18 months, they already have set
up their initial BI.
They've solved some of those initial problems with data.
And that's when collaboration chaos becomes a reality.
That's when they start realizing, hey, we hired the first few sets of analysts, but
hey, my new analysts are not productive at all because they don't know what, you know,
what data they should be using and things like that.
And those problems start becoming real.
Is there a minimum size of a team that you have observed
that usually exists when Matlab becomes relevant?
So we typically see that somewhere in the, you know,
when your data team is in sort of that 10 member team size
is where it starts becoming is where it it starts becoming
where the problem starts becoming a real pain like that's when you know you're dealing with like you
know a really sizable chunk of your data team like over 50 percent of your time is actually probably
being spent on issues like this interestingly we also see a bunch of data leaders which is
I think which is interesting now because you actually have people who work in like larger teams who are now going in and setting up, you know, teams and like early stage startups.
And some of the things we actually hear now, and we have like teams actually that are starting out with Atlant much earlier, because we've started seeing data leaders say, hey, like we've gone through the chaos of not implementing this and then having to figure
this out at a later stage and we know how painful it is and so we just want to get it right from
day zero like we don't want to like go we don't want to have to fix our problems when we grow
and so we do definitely see earlier stage teams starting to adopt a lot of the practices that we
recommend for example we talk about things like how do you think about your data assets as data
products and and what does that mean like how do you create shipping standards on day
zero? How do you create a documentation for sculpture on day zero? Like, these are all
the things that, you know, we think about as practices inside the team. And we're starting
to see people actually adopt this almost at day zero, rather than necessarily wait till the problem becomes a real pain. Yeah.
I have a question on that, Prakapa,
because in an ideal world,
all of us working in data would love if companies were constantly looking six months ahead
and were implementing processes and tools
that would make their future data stack and data team operate more easily.
But in the real world for most companies, especially as you're scaling and dealing with data and putting out fires and adding that one number to the dashboard, it's really hard to anticipate what things are going to be like in the future. So
I'd love to hear you speak to someone who says, okay, I'm already experiencing that pain,
right? Like we have a pretty robust data team and stack, you know, we have, you know,
data science, machine learning practice, you know, are starting that journey. So if you do
have to go back and sort of solve the pain after things have become, you know, or starting that journey. So if you do have to go back and sort of solve the
pain after things have become, you know, reached a tipping point, where do you start? Like which
discipline do you start with within your definition of data ops? Right. Is it because, I mean,
there's so many things, right? It's like, okay, do we start with governance or do you need to
solve cataloging before that? Or, you know, lineage?
There are multiple components of this that Atlan solves, but what's the starting point?
So I think the best way to think about this in some ways is what I think of as the journey that our data team actually will take, right?
And people bring Atlan at different points in their journey.
That depends on how they think about agility and how forward thinking they are
and how much they think for them in advance. And that's different, right?
Different teams like operate differently. But for example,
the way I think about it is, you know, when you've just started your,
your data team, let's say your data team is, is pretty early,
it's pretty small team.
The first set of problems that you're probably going to start solving are
things like pretty simple things.
So it's going to be things like, do we all agree on the same metric definitions?
And how do we measure the metrics?
It's going to start there.
And then you're mainly focused on, when you're that early stage data team at that startup,
you're mainly focused on saying, how do i help my business users or my business stakeholders starting to trust the data starting to trust me starting to trust that
they should make data-driven decisions you know like those kinds of things that's where you're
starting very quickly what starts happening is that people actually start relying on the data
team and start sending service like i think of this as service requests to the data team and start sending service. Like I think of this as service requests to the data team, right?
So you start out with like maybe helping out in the monthly,
the monthly business review and the quarterly business review,
a bunch of requests start coming to you now,
a bunch of ad hoc requests, data team, early data team says, okay,
we can't handle this anymore. We need to hire new people.
This is when your data team starts growing. And at that point,
I think the biggest challenge that data teams have is productivity. It's really hard to get new analysts up to speed.
Typical average time that an analyst stays in an organization is like today, 18 months,
and you're spending six months onboarding a person in some ways in that time, right? And so
there's, I think the biggest challenge
you start facing is like analyst productivity
as your, or, you know,
this is also true for like data scientists,
productivity, basically any, you know,
data consumer productivity in some ways.
And that's where you want to start solving these problems.
So that's where things like data discovery,
data lineage, you know,
context or tribal knowledge around your data,
data documentation,
these start becoming a
reality and investing in that becomes super important now there's a point where this so
even if you improve the productivity of your data team and things like that you know hopefully your
data team is doing much better the reality is that the request that your data team is going to get
is always going to be much much more the demand is going to get is always going to be much, much more.
The demand is going to be much, much more than no matter how hard you try that you can scale your data team size.
Because the reality is that, you know, you can only scale your data team linearly and it's likely that you're going to start getting exponential requests.
So that's sort of the time where we see that, you know, data teams go from almost this mindset change of having to stop building data services to almost a data product mindset in some ways.
Where if you think about the difference between services and product, services, you're servicing a single request, product, you're basically building something scalable that everybody can use or like a good chunk of users can use and so it takes a little bit of upfront investment
on daisy but you know as you go along the way you know over time you're actually reducing a ton of
the repetitive requests that your team is getting which is saving you a ton of time so that you can
actually build new things at that time you know the priorities start becoming a little different
for for data teams right and so that's where we start seeing people say you know how do i start looking at insights as an asset queries as an asset or a
product in in atlant this is where you know you start saying for example we actually have a ton
of so so there are two ways you can use that like one is the atlant you know ui and interface itself
but atlant also has a ton of apis and apps that you can build on top of Atlant,
which you can connect into your CICD pipelines, which you can connect into your downstream tools,
which could be your BI tools and so on and so forth. And so that's where we start seeing people
leverage a bunch of those kinds of capabilities. And then the final layer is starting to truly
create that self-service environment, right? Like the holy grail that every data person wants is that, you know, we are just able
to like truly enable self-service in our organization.
And at that point, you're actually starting to expose a bunch of your data products to
your end users or your business users directly.
And at that point is where things like governance start becoming a reality.
Like I always think about this like democratization as much as it's a buzzword,
like democratization and governance are,
you know, they're two sides of the same coin, right?
The more people are getting access to the data,
the more you're starting to think,
who's accessing my data?
Are the right people accessing my data?
PII, like those kinds of things start becoming a reality.
And so I sort of see this as a journey.
And the question really sort of comes down to where you are in this journey.
So for example, teams that adopt us much later in their cycle, in some ways, when they're
a much larger team, for example, governance is a priority on Day Zero itself, because
of just where they are in their journey.
Versus, you know, if you're a much earlier stage team, you know, you're like five people, you know, and you're thinking about, you know, access control and security, you know, if you're a much earlier stage team, you know, you're like five people, you know,
and you're thinking about, you know,
access control and security, you know,
that's super unlikely, right?
And so I think that's the way we think about it.
Yeah, it makes total sense.
Sorry, Eric, but we need to,
I have a question that I pretty much have like
from the beginning,
but I think now's the right time to ask that.
Go for it.
Yeah, so we are talking a lot about enabling collaboration and like healthy collaboration, I pretty much have it from the beginning, but I think now is the right time to ask that. Go for it. Yeah.
So we are talking a lot about enabling collaboration
and healthy collaboration between people,
blah, blah, blah, like all these things.
We're talking about the data stack.
You gave a very good description of the complexity of a data stack,
even the minimum viability data stack, right?
It has many moving parts there.
So I wonder, in order to build a platform
like Atlant, you need to be able also on a technology level to collaborate with all these
different parts of the data stack, right? Like you need somehow to interact with them, pull some
metadata. And I'd like to talk more about that a little bit later. How do you do that considering that
obviously it's part of the data stack right now and each vendor they only care about
their own problems right? Like I don't think that the first thing they think about is how we are
going to expose metadata or like APIs or whatever like to tools like yours. So how does this work and how much of a challenge it is today?
Yeah.
No, absolutely. I think we're actually doing a decent job as a community today in exposing tools to
a pretty decent job of making it possible for you to get metadata out of the tools.
So this is not true for the fringe tools
in the ecosystem
where the use case
is not as elaborate, right?
But for the main tools
in the ecosystem,
it's actually okay.
The thing is,
you might have to do some work
on top of that
to make that metadata useful.
That's a different thing.
And that's what like products
like us focus on, right?
In some ways.
I think the true challenge
is not the integration point as much as it's the diversity of the
integration points.
The truth in the data ecosystem is that the data stack is also evolving.
So if the data stack was just, these are the 100 tools in the data stack, and it's going
to be these 100 tools for the next five years, that would be awesome.
But, you know, and a relatively, you know,
simpler problem to solve.
But you know what, like, I never thought
I would be hearing about Firebolt even like a year ago.
But, you know, now you hear about it, right?
And, you know, and I think the data stack
is changing so often and the new tools
getting added to the ecosystem
and that is going to continue to happen.
In fact, I think after diversity, the only reality in data is change.
And so then you need to be able to be, as for us, for example, as a platform,
we need to be truly agile to be able to actually support these integration points.
Because if you want to be the true collaboration, the only way we can do it is by supporting these integration points so that's why for example the way we so we turned
that into a feature rather than a bug so the way we thought about it is we actually atlant's built
on behind the scenes what we call an open marketplace which basically means that customers
can actually build these apps on top of atlant which allow you to actually build integration
points integration points not just into you know the tools that we're pulling in metadata from, but also
integration points into collaboration workflows and downstream tools that you want to integrate
into, right?
So for example, if a team has a specific workflow that they use on Jira and they want to build
a metadata orchestration workflow of it, they're able to do that on atlin as well so and that's the way we think about the the the role we play in the stack in some ways yeah okay
and i know we don't have that much time left which is a good thing it means that we need to arrange
another episode at some point to keep keep chatting on that. But before we reach the end of our episode today,
let's talk about the metadata plane.
Usually you, I mean, the two main terms that we listen
is like the control plane and the data plane.
And suddenly we introduce a new term,
which is the metadata plane.
So what is this metadata plane?
And what is a piece of metadata?
Like if you could give us like an example
from like a BI tool or something like a data
warehouse like that would be amazing yeah sure yeah so let's say you started like what what
metadata is itself is right and i think the simplest way of describing it is is data about
data in some ways right and what that means is you know, every one of your tools is generating data assets.
And there is context that is created about each of these data assets.
So let's pick, for example, in your BI tools, you have context about usage, right?
Which of these BI tools are getting used the most?
Which of these dashboards that you're building are getting used the most?
At what time? By which which users that's metadata which data source or which table in snowflake is connected to this dashboard right that's metadata in your data warehouse tool right
like your query like you can use your query logs to actually figure out you know in some ways
lineage and how these how different tables are connected to each other, like that's metadata. All of those in your pipeline, or your orchestration engine, you know,
you have metadata about what time was the pipeline updated, right? Like, and that's metadata. So I
think the way I think about it is like metadata could be technical, metadata could be social,
it could also be about, you know, who's using where about usage you know things like that and the more you're able to bring in
you know right from you know the standard forms of metadata is all the technical stuff you know
you're able to bring it in and marry it in with more and more types of metadata that's really
where you're able to sort of create i think about this as almost uh like a single a single plane for all
your metadata right in the ecosystem like there was a world where you know we used to like the
same thing that happened with the data lake actually right we used to like there was a time
where we like you know the big data world back in the day we were bringing data from a bunch of
different places to put it dump it into the data lake in some ways
to say, hey, you know what, like,
we don't know what the countless use cases of this
is going to look like,
but, you know, we know that this is valuable.
And we can talk about, you know,
of course the implementation hurdles
and the issues that had happened,
but like, if you think about
from the fundamental concept level,
metadata also has a ton of different use cases.
I think we've just scratched the surface of what those use cases could look like today.
Today in an ecosystem, we are talking about data discovery or data lineage or data observability.
These are just one or two or three use cases of what metadata can do.
In the future, you could be using metadata to auto-tune your data pipelines you could be using
metadata to actually cost optimize your entire data management ecosystem there's a ton of different
use cases of what metadata can do and so the way i think about the metadata plane is it's sort of
this i i think the metadata plane is the foundation the control plane to be honest right like i think
to bring in all of your metadata and then you're using it to drive these use cases, governance and security
and, you know, catalogs and discovery are some of them, but then there's a ton of other newer,
intelligent, operational kinds of metadata use cases that are still remaining to be discovered
in many ways. So interesting. Well, we are close to time here,
but I have one more question for you.
And we like to get advice from our guests.
And I think one really interesting experience
that you've had is tackling these massive data problems
with multiple different types of data.
So going back to the beginning of our conversation
where we talked about clean gas and how that included geospatial data and economic data,
what are maybe one or two of the lessons you learned when trying to face a big,
sort of crazy data problem like that, that, that our listeners could learn from?
That's a great question. So I'd say a couple of things. I think one that,
and this, this comes back to, for me, and maybe this is the same battle, this is probably why
I'm building Atlantid today, right? But to me, I think it really just comes down to the team and the culture.
I think that is the most important thing in being able to crack
the most difficult data problems.
Like, for example, in that team that I was telling you about,
the clean cooking gas, like, you know, and, you know,
I honestly think we were probably the only, like,
I have not heard of a problem like that being cracked that way.
It took us
multiple iterations three months actually to get there and the the reason i think we were able to
do it right like even fundamentally think about how do you think about accessibility versus
profitability it sounds today like when you hear it like simple but it was not like when you're
really like you know trying to figure out how to do this the first time in the world it was not like when you're really like, you know, trying to figure out how to do this the first time in the world, it was not. And we had a, we had a development economist in the room. We had a data engineer in the room. We had a project manager who came from a political background in the room. We had all these very, very diverse people in the room and I think that enabled us to actually sort of rethink the problem
from first principles in a way that a standard team that would just have had maybe analysts or
just have had data like a single kind of persona would not have been able to think about that
problem and I think so that diversity is very very important example, again, I go back to that example. We actually had a solution
that had been signed off
by our client
where it was not
the ideal solution,
but it was like, you know,
it was a top-down
way of allocating.
There are multiple ways
to solve a data science problem, right?
It was a top-down way
of allocating
where these gas centers,
which districts
they should go get opened in.
And we still felt like it wasn't solving the access problem. We felt it solved And we still felt like it wasn't solving the access problem.
We felt it solved the profitability problem,
but it wasn't solving the access problem.
And so literally three days before the final presentation to the cabinet
minister,
I remember my co-founder and I were in the room and my co-founder basically
like listens to the problem. And then he's like, Hey, like, so wait,
this is not a profitability problem. This is a, This is actually an accessibility, this is a distance problem.
So why are we not thinking about it from a geospatial perspective?
Why are we thinking about it?
And so we actually flipped the entire solution in like two or three days.
And that wouldn't have happened if we didn't have the diversity in the room.
And so I think that to me is the most important thing. And so leaders should really
strive to find a way to build diverse teams and have them work together. I think the second aspect
of that is trust. The problem with diversity is it's really hard to build trust in teams.
When a number on a dashboard, going back to the number on the dashboard breaking, and I know we
laugh about it a lot in the data space. But the reality is that at that moment, when the cabinet minister
called me and said, the number on the dashboard is broken, I couldn't answer his question as to
why the number on the dashboard was broken. At some level, the hard-earned trust that I had
built with him broke. Same time when I called my data engineer, and he said, I'm gonna pull audit
logs and check. At some level, I didn't know if the problem was that the pipeline really broke,
or if my data engineer was messing up. And it broke again. And this creates such deficit
in diverse teams. In most of the teams, sales leader started out as a sales rep. Everybody
does the same job in the team. Everybody has clarity. That's not the case in a data team. And so the second most important thing to build in a data team to make
it successful is how do you build an ecosystem of trust? How do you help people trust each other?
How do you help people trust the data that they're working with? I think that's the second
most important thing that I would invest in as a data leader. Incredibly wise advice. And we thank
you so much for that, Prakopal. And thank you for your time today. It as a data leader. Incredibly wise advice. And we thank you so much for that, Prakapa.
And thank you for your time today.
It was a great conversation.
Thank you so much for having me.
This was a lot of fun.
My big takeaway, you know, I just appreciate
we covered a bunch of topics,
but I appreciate that Prakapa returned to a theme
that we've heard on the show multiple times.
And it was so great to see her kind of think through all of her experiences with data,
building a data ops platform, and what she went back to as the most important thing in
solving data problems as a team.
And I really appreciated how she said diversity is so important to have on a team that's solving a data problem, but it also makes the trust component difficult because you have that diversity, right?
And people are coming from different backgrounds and skill sets and have different responsibilities as stakeholders in the project. So that was just a really, that's one of those things where like,
we, I think kind of have all heard
and known the back of our mind,
but to hear it articulated like that
is always a great reminder.
Yeah, a hundred percent.
Like if you think about it,
like think about that,
like when you build a company
and like you build the product,
you build the product for a very specific persona.
You have only one persona to keep in your mind.
And even that is super hard,
figuring out how to satisfy this one persona.
Now, if you're getting the shoes of a data professional,
like analyst, data engineer, whatever,
whoever is a member of this data team,
these teams have as customers
all the different departments and functions
that the company has, right?
So they have to satisfy by delivering services
or products, all these different personas.
And that's exponentially harder to do.
And of course you need trust.
Like without trust,
like you can't build anything, right?
So yeah, I think that was probably
like one of the most important topics
that we touched during this conversation.
And we don't usually talk that much about that
when we talk about data
and the technologies around it,
but we should spend more time.
I agree. I agree.
Well, thanks again for joining us
and we will catch you on the next episode.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.