The Data Stack Show - 81: Digging into Data Ops with Prukalpa Sankar of Atlan

Episode Date: March 30, 2022

Highlights from this week’s conversation include:Prukalpa’s background and career journey (3:16)Applying a data-driven mindset to poverty (7:21)What Atlan does (11:53)The makeup of a realistically... functioning data team (15:25)How to create a company’s first data team (18:13)Defining “agile data” (22:01)The necessity of data ops (26:36)The minimum data stack needed (29:16)Data team size (31:58)Where to start when you need to make adjustments (34:51)Collaborate with different parts of the data stack (41:27)Defining the metadata plane (44:29)Lessons from facing crazy data problems (48:31)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. We have another data term to dissect in today's show, and that's the term data ops. We've talked a ton about ops on the show and how ops is being adopted into the data space using a lot of the principles
Starting point is 00:00:39 of software engineering. We're going to talk with Prukalpa from a company called Atlan. Super interesting. Kostas, she comes from a background where she's solved massive worldwide data problems that have focused on things like poverty or access to clean fuel and water. And I am so excited to hear what that was like, because those tend to be, you know, different in so many ways from a lot of the things that us who work in data companies in B2B SaaS, you know, sort of in the venture backed world face. And I'm sure there's probably some similarities. So that's what I'm going to ask about. How about you? Yeah, I'm very, very interested in chatting with him about metadata. I know that to build a platform like the one they have, you have to
Starting point is 00:01:32 build some kind of, let's say, metadata layer there. And I really want to see, first of all, how mature the technologies are in order to collect all these? And also, what do you do with the metadata? And the reason that I'm so interested with metadata is because, you know, you have the data, you have to work with the metadata, and then you can go to the semantics. Always the semantics. Yeah, I guess, though, it's going to get complicated now with the metasphere or the metaverse and talking about metadata. What's that going to mean?
Starting point is 00:02:07 Yeah, I think that's going to be a very hot topic next year with all data. Outside of jogging, it is an important aspect of working with data. And it's good that we start hearing more and more about metadata because it means that the foundations of the technology is starting to be solidified. So we can start working on the next iteration of how we can deliver value. So we're talking about metadata from a business perspective, I think it's a very good indication of the maturity of the space. So that's good. I agree. Super exciting. Well, let's dive in and learn more. Yep. Let's do it. Prakapa, welcome to the DataSec show. We're so excited to chat with you. Thanks for having me. Okay. Let's start where we always do. We'd love
Starting point is 00:02:57 to hear about your background and I'm excited because you've done data work for some really interesting internationally, you know, sort of internationally known organizations. So can you just tell us about your background and what led you to creating Atlan? Sure. Yeah. So I've been a data practitioner my whole life. Prior to this, my co-founder, Warren and I, we founded a company called Social Cops, mainly with the mission of saying, hey, you know, large scale problems in the world, like national health care and poverty alleviation, they don't seem to be using data, and it really feels like they should be using data.
Starting point is 00:03:34 So let's do something about that. And our model very quickly turned into that we became the data team for our customers, because we were typically working with folks like the United Nations or the World Bank or the Gates Foundation or several large governments who did not have data teams or technology teams for that matter. So we sort of just became that data team, which is really where I learned everything that I learned about building and running data teams and how complex and chaotic they can get. So because of the kind of work we were doing, we were sort of lucky to be exposed to a wide variety and scale of data. we were doing, we were sort of lucky to be exposed to a wide variety and scale of data.
Starting point is 00:04:07 At one point, we were processing data for 500 million Indian citizens and billions of pixels of satellite imagery, which all sounds like they're really cool projects, but they were not really cool on a daily basis. The day-to-day was a nightmare. You know, I feel like as a data leader, I have seen it all.
Starting point is 00:04:26 I've had cabinet ministers call me at eight in the morning and say the nightmare that no data leader wants to be woken up with, which is the number on this dashboard doesn't look right. And then I've, you know, done that like wild goose chase of calling my project manager who called my analyst who said, hey, it looks like the pipeline's broken. And, you you know then call my engineer and he pulls out our locks and says no nothing looks wrong and you know it takes us like four people and eight you know eight hours to figure out what
Starting point is 00:04:54 went wrong i have sat in the top of our terrace this one time and cried for three hours because an analyst quit on me this one time exactly a week before a major project was due and he was the only one who knew everything about our data like project was due and he was the only one who knew everything about our data like he like and there was no way i could deliver this project without this analyst and that's sort of just these kinds of things just brought us to this breaking point right if you like our team was spending 50 60 percent of our time dealing with this chaos of which data set should i use for this analysis what does this column name mean how do we measure annual recurring revenue?
Starting point is 00:05:25 You know, and your number on this dashboard is broken, like stuff like that. And we realized we couldn't scale like that. And so we actually started building like this internal project that we call the assembly line. And the goal was basically to say, our team is super diverse
Starting point is 00:05:39 and we want to find a way to make our team work together effectively. We actually tried to, long story short, like we tried to buy a solution. We failed at buying a effectively. We actually tried to, long story short, we tried to buy a solution. We failed at buying a solution. We were forced to build a solution. So we actually, Atlin was never born to build, to sell as a product to anybody else.
Starting point is 00:05:54 We actually built it ourselves to make our team more agile and effective. Over two years, we ran 200 data projects on the tooling that we built at that time. And in that time, we made our team over six times more agile. And we realized that we'd build tools that were more powerful than we had earlier intended, right? Our team went on to, we did things like we built India's national data platform, which the prime minister himself uses.
Starting point is 00:06:18 It's one of the largest public sector data lakes of its kind. What was really cool about that project was it was built by an eight-member team in 12 months. It's also one of the fastest of its kind. So sort of realized that these tools could help data teams around the world hopefully be a little bit more agile and effective. And that's when, you know, Atlin was born. We said, you know, can we use these tools to help every data team in the world? Sure. Okay. I have to ask, this is so interesting because we love hearing about really diverse experiences of data. And when we think about subjects as big as, you know, fighting poverty and then apply sort of a data-driven mindset to that, could you just give us a little bit of insight into maybe like what's a specific
Starting point is 00:07:05 poverty related project that you worked on and what data were they not using? What data were you able to introduce and how did that change the project? That's just so fascinating. Sure. Yeah. So in some ways, actually, I think social problems are some of the most complicated data problems that can exist, you know, actually in business, because the outcomes are a lot clearer, right? It's you want to improve revenue and you want to reduce costs, versus like, you know, when you want to improve the quality of life of a human being, you know, like it's a much harder, like, you know, just like problem to model, right? And we saw this, maybe I'll give you one example with a project that's super close to my heart. We partnered with the national government, which was basically rolling out clean cooking fuel to about 80 million below poverty line women across India. India and this was actually so just to give you context on the problem people basically in or women in India and in rural areas and below poverty line they actually use sort of this natural cooking fuel in their in their house firewood basically which which is equivalent to smoking like 400 cigarettes an hour or some crazy number like that. It's crazy. And so obviously the government wanted to solve this.
Starting point is 00:08:28 They were rolling out like cooking fuel programs. So these were gas cylinders that were free that were going to these below poverty line women. And, you know, we rolled out the program and, you know, there was like initial operational monitoring and, you know, we put in place data systems for that. The program rolled out really fast and really well. And then we started hitting this challenge,
Starting point is 00:08:47 which is that while the penetration of gas cylinders was increasing significantly, cylinders need to be refilled, right? So, and typical, the stations for gas cylinders were only in urban India because there was no penetration or demand, right? And the government was creating this like very rapid demand because of what they'd done. Now, this was a super interesting problem because the person who runs a gas station
Starting point is 00:09:15 is actually an entrepreneur. So it's a decentralized model and it's privatized. Now, the entrepreneur obviously cares about this being profitable, which makes sense. On the other hand, the government wanted to create access. So what the problem statement that they gave us was, or the minister at that point told us was, I would like a gas cylinder station to be within 10 kilometers of every single Indian's home. And so now you have this like really unique problem where you're balancing accessibility with profitability. And so how do you do that the right way in some ways, right?
Starting point is 00:09:54 And so, for example, what we ended up having to do, it took us a bunch of iterations to do this. Like, do you do top-down allocation? Do you do bottom-up allocation? You're talking about 640,000 villages. So what we ended up doing was we actually turned it into a geospatial modeling problem, brought together data from about 640,000 villages,
Starting point is 00:10:13 got about 600 data sets in, so population, affluence, like, you know, a bunch of those parameters. We layered market data on top of that. Where are the existing gas stations and cylinders? Like, where is there already access? And that basically? And that basically got out of our clustering algorithm. And then the rest of the think people are going to be willing to pay. And so every cluster was actually a different size in some ways in terms of like the distance that it was covering. And then use that to basically figure out where you should go open these next 10,000 gas stations
Starting point is 00:10:56 across the country to actually solve for both profitability and accessibility. Right. And so those are just some examples of the kind of yeah, and modeling kind of challenges that we have to deal with. Yeah, no, super fascinating. That's really helpful. It is wild to think about that because I mean, just off the top of my head, I mean, you mentioned geospatial, but, you know, economic modeling, the demographic component of it, the socioeconomic component of it, which is, you know, pretty wildly different data sets. Interesting. Okay. So you're dealing with issues like that. Let's talk about what does Atlan actually do? So like what were some of the, you, you talked about, you know, okay, you get the call from someone who says the
Starting point is 00:11:43 dashboard doesn't look right. But what does it look like for a team to use Atlan and how does that make them more efficient? Sure. Yeah. So let's jump in on some of those problems I talked about, which are pretty commonplace in most data teams around the world. And if you think about these problems very deeply, you realize that the place it stems from is actually this fundamental reality of data teams, which is diversity, right? Data teams are diverse. To make a data project successful, you need an analyst, an engineer, a scientist, a business user, machine learning researcher, analytics engineer. All these people are very different. They have their own persona
Starting point is 00:12:22 types. They have their own DNA in the way that they work. They have their own tooling preferences, and they also have their own limitations. And while this diversity in some ways is our biggest strength, it's also our biggest weakness because a ton of the challenges that I talked about, like come from the fact that all these people need to sort of come together and collaborate, but they all have different contexts that they're operating in the ecosystem. And so at Aspen, we sort of see ourselves as a collaboration layer for the modern data team. Every time there is a function inside an organization, right? Engineering teams have a GitHub, sales teams have a Salesforce. What does it take to create that true collaborative hub for a modern data team? Knowing that the only reality in the data team is diversity. So the place we operate
Starting point is 00:13:07 in is if you think about the fundamental modern data stack in some ways, which is your data ingestion and warehousing and transformation and BI, that's what I think of as the data stack. Atlant sits on the metadata plane or the control plane layer of the data stack. We bring in metadata from all of your different tools in your ecosystem. We bring that together, put it together to essentially start creating intelligence and signals, make it super easy to discover data assets and so on and so forth.
Starting point is 00:13:37 But most importantly, we actually use this to start driving back better context into the tools that you're working in daily, right? So for example, when I am in a dashboard or a BI tool, I want to know, can I trust this dashboard? But the truth about whether you can trust this dashboard is actually in the ETL tool. And it's in like, did the pipeline get updated today or not, right?
Starting point is 00:14:00 And or did the quality check run and did it pass? That's the metadata that Atom brings together. We make sense of it. We construct auto lineage. We basically make sense of your entire data map in some ways and create that single source of truth. But then we take that back into tools like BI tools, into Slack, into collaboration hubs, into GitHub, into tooling like that to actually make the day-to-day workflows of teams significantly more simple. Prokalpa, I have a few questions because you have mentioned some very exciting topics. I'd like to start from the people.
Starting point is 00:14:37 You mentioned quite a few times about the diversity and the complexity of the data teams, right? Now, us coming, let's say, from the more like technical side of things and the data engineering, when we talk about data teams, we keep on forgetting like all the different stakeholders that are part of these teams, right? We focus a lot on the engineering persona, talking about data engineers and maybe sometimes also analysts. So can you give us a bit of, based on your experience, a description of how a data team, a functioning data team, in the realistic one, usually looks like? And what are the personas involved there? Wow, that's a loaded question. I wish there was a way a typical data team functions, right? And I think that's the reality that, you know, every team is diverse.
Starting point is 00:15:33 Like every team is unique. And teams also evolve over time, right? And so I think this is a classic, like we've seen right from, you know, fully centralized data teams to fully decentralized data teams, to all kinds of hybrid structures in the middle, right? We're seeing, you know, you know, we're seeing, we're increasingly starting to see like, sort of, for example, some functions like data platform and an enablement, which in my mind is a new form of governance, right? Like there's, there's centralized functions, and then there are decentralized functions, which is, you know, pod structures with analytics engineers and analysts, you know, and I think what I've realized over time is that there are, you know, four or five
Starting point is 00:16:15 different ways that you can structure your data team. I also am a very big fan of not fitting people to JVs or fitting people to structures, instead actually building a structure that works for your team. Because the reality is that there's a lot of overlap, right? If you think about like the skill sets, like the skill sets, like the fundamental skill sets from an analyst to an analytics engineer, to a data engineer, to a machine learning engineer, that you're actually talking about overlapping skill sets. It's not, you know, it's not black and white. And in a lot of ways, it has to do with the person in some ways, like I've never met a perfect data scientist, like, you know, I've never met a public, I don't think that exists. And so I'm actually
Starting point is 00:17:00 a very big fan of this, this method methodology of actually starting at the fundamental skills and building roles around people. And then, you know, in some ways, the structure of your data team gets get structured on the basis of your leaders, right? And how does that how do your how do your leaders interplay with with each other? And what are their skill sets? That's, that's, I think, you you know i wish more people would adopt it because i think that's really the only reality you know in in in a data team yeah that's that's a great point so you know like companies usually do not start with a data team right like when you incorporate and you start like a new project or a new company you don't really have the resources or even the need for a data team. There is a certain point in the lifecycle of the company that you will start needing
Starting point is 00:17:52 that. Based again on your experience, because you mentioned having a core set of skills and then building on top of that. What is this core skill set that is required for the people to create this first data team in a company? Yeah. So I believe that
Starting point is 00:18:14 the way to think about this, and I think every startup founder, like, in fact, I actually have a blog on this, which is, you know, how do you go about prioritizing this? Because I actually get a ton of questions
Starting point is 00:18:22 from like startup founders who are like, oh, we want to invest in a data team. Where do we start? And what I typically ask them to do is actually say, okay, I think you should think about this from a strategic perspective in terms of what do you want your data team to achieve in the first place? And so to give you an example,
Starting point is 00:18:42 and I think this needs to start at like, what is this biggest strategic priority of the company? Because let's say I am starting a hyperlocal delivery startup, or, you know, something like, you know, car, like an Uber equivalent, for example, right? Maybe what's what I thought with maybe the most important thing when I'm starting on day zero is just operational analytics, I just need to know, you know, how many rides are we serving and, you know, things like that. But right after that, probably, or even, you know, at that point, probably the most important thing that for the business to end up actually becoming the matching algorithm, which is actually a pretty
Starting point is 00:19:19 complicated data science problem, right? So on day zero, you're not just starting at analytics, you're also probably starting with data science and investing in data science problem, right? So on day zero, you're not just starting at analytics, you're also probably starting with data science and investing in data science, so that you can actually solve like data science, the fundamental part of your product in your business. And so on day zero, when you're investing in your team, you're probably going to try and find a leader, or an initial team, you'll probably start with like an analyst and a data scientist who can stretch, and then you'll build out those two an analyst and a data scientist who can stretch. And then you'll build out those two teams like that. On the other hand, let's say you're a software startup and you're selling SaaS, for example.
Starting point is 00:20:01 Now, when you're selling SaaS, operational analytics is almost what you need to work really well. Up until you get to like a relatively mid-sized company in some ways, right? Like you want to invest in product analytics, you want to invest in sales analytics and sales ops. And so then in that case, for example, you probably just want to invest in a really strong analytics leader, maybe someone who comes with domain expertise in SaaS, because SaaS is complicated in the way that, you know, the domain the domain itself works. And you don't need data science at all, up until maybe much later in the company when you decide to build a product using all the data that you've collected in your SaaS product, for example. And so I think that is the nuance in some ways, which building a data team and a structure, I think you need to start
Starting point is 00:20:44 at the first principles of what you're trying to optimize for as a company and then from there figure out what are the skill sets that you need your data team to have on day zero yeah yeah i think you gave some like super super valuable advice here and it's a very interesting perspective on how building teams like i don't know how many times i've seen you know sas companies at an early stage and be like okay we are struggling with attribution for example let's find a data scientist to do like some magic of course it fails at the end but anyway that's that's the topic of another episode that was great like i really i really appreciate like
Starting point is 00:21:21 sharing this information with us. So you mentioned at some point that using, let's say, the platform that Atlant is today, you became more agile, right? And agile as a term in software engineering has a very specific meaning. And usually, the easiest way to explain what agile is like you give like the counter example of waterfall right uh but that's like in software what is agile in data what does it mean i become more agile with working with data is it the same thing as software or is it different yeah so i mean i think at a high level i think we thought about you know as we thought about like how do you measure agility in some ways like we we sort of thought about this as as velocity and
Starting point is 00:22:11 in some ways and you know how you know how can we get stuff done but also at what level of quality can you get done at what level of how can you reduce the iterations that you need in your work when something changes change requests are a really important part of like a data team's job right and when someone tells you oh yeah the dashboard looks great that metric looks great but can you just like make this one change to it and add this pull this one number additionally to it you know only your data person knows how difficult it is to like go and get that one number to pull into that dashboard, right? And so how do you, how can you build your entire pipeline in a way in some ways that can give you that kind of reusability and reproducibility to
Starting point is 00:22:56 be able to like manage change requests in some ways, right? So I think all of those are components that are going to agility. To answer your question on is agile the same as software engineering? Absolutely not, right? Software engineering is a very different practice with the fundamental, actually the one fundamental that's different between software and data is that in software, humans create code. So that fundamentally changes the equation because in data, we can't control the data that we are working with in most cases right and i think that itself is like a fundamental paradigm shift between software and data second in software often you already know what you're broadly going to build and you know what you have to do like it's it's much more execute it's much easier to measure execution, right? And quality of execution.
Starting point is 00:23:47 Versus in data, many problems are exploratory in nature, right? Like let's say it's an exploratory, like why is our ARR number dropping? Like that's an exploratory analytics project. Like how do you know? And solving that, like it's really difficult to scope a problem like that on day zero, right?
Starting point is 00:24:02 And so I think those are things that are fundamentally different between software and data. And I think that's why it becomes very difficult to just say let me pick agile as a framework it works in software engineering and i'm just going to bring it bring it into into data and so i think a few things that for us were were useful were we we basically tried to take best case practices but not just best case practices from software engineering right we also took best case practices from you know like like lean manufacturing and devops and you know like there are so many like data itself is such an
Starting point is 00:24:36 interdisciplinary team so in some ways you can take like learnings from a bunch of product teams for example like something i'm really really bullish about is this idea of going from like almost like a data service team where you're just like servicing requests to a data product team where you're you know a product team for example is building for your end users the your success is measured on whether your users at the end of the day use the product the same way like can you actually think about your data products, right? And can you measure yourself on success rather than just like closing out, you know, a service request? So I think all of those components are things that we should learn from as a data community
Starting point is 00:25:14 in some ways and build what our practice of agile or, you know, people call it data ops should look like in the ecosystem. Great, great. That's super interesting. And again, another very good definition. And it's good to make, let's say, clear the differences because especially like, you know, like many people, especially like data engineers,
Starting point is 00:25:37 they come from software engineer background, right? And they have been exposed in like very specific semantics around what each thing means, right? Like, for example, agile. So understanding the differences between what it means to be agile when you work with data and what it means when you work with software, I think it's really important if we want to increase, let's say, the quality of the work that we manage to do at the end. I'll keep in the same approach of trying to redefine terms.
Starting point is 00:26:09 You mentioned DataOps, right? Again, Ops is not something new as a term. We have DevOps, we have SREs, we have RevOps, we have everything else. BizOps. BizOps, exactly. MarketingOps, yeah. exactly so right why why do we and why do we need it yeah so i think at a higher level i think the way i think about data ops is it's it's a it's really a principle or a almost like a way of doing things i know it's caught in like a way of doing things. I know it's caught in like a lot of, it's a buzzword now and it's gotten a lot of attention.
Starting point is 00:26:50 And there's a lot of products that claim to be a DataOps platform and a DataOps product and like all these other things. But I actually don't think that that's what DataOps is, right? Like DataOps is fundamentally about saying, how do we take you know the principles of agile and devops and lean manufacturing and and all of this and bring it into a fundamentally collaborative practice that helps data teams work together effectively it's built on the foundations of collaboration reproducibility you know how do you ensure that your your data assets are reusable and reproducible? It's built on, you know, foundations like, you know, self-service and self-serving, right?
Starting point is 00:27:35 How do you create something that is that where you're reducing the dependencies on the core data team? I think those are some of the elements of what DataOps means and can create. For example, in our case, like we actually created like something that we call DataOps Culture Code, which is about, you know, what does implementing a DataOps culture truly mean inside organizations? And I think that's the way we need to think about these concepts.
Starting point is 00:27:54 I think, you know, be it DataOps or be it the Data Mesh, for example, these are all design principles. These are ways of doing things. These are not, technology is just a part or an enabler in solving these problems. But it's a broader principle that we're working towards. All right. So I think enough with terminology.
Starting point is 00:28:15 Let's get into the technology now. So, all right. We have figured out what DataOps is, why we need it, how we build such a platform. What do we need in order to... Actually, no. Before we go to this question, I have another question. Sorry. Which I think is going to help us with this question.
Starting point is 00:28:37 And this question is about the data stack. We keep talking a lot lately about the modern data stack. We have a panel here trying to define what this thing is, why it is modern, when it stops being modern and it's not modern anymore, what's going to happen in the future. No, no, post-modern. Exactly.
Starting point is 00:28:58 So let's, I mean, I'll try to avoid the controversial conversations around it, but we need a stack, right? In order to work with data, there are some architecture that needs to be in place, some minimal kind of pieces of technology that we need to work and operate. So based on your experience, two parts of this question. First, what's the minimum set of data stack that a company needs to have in place? And second, what is the minimum, let's say, data stack that you as Atlan need in order to go and operate and deploy your data ops platform? Sure. Absolutely. Yeah. So I think the way I think about it is broadly a bunch of original,
Starting point is 00:29:40 like as I think about the data plane or the data stack itself i broadly think of it as a few building blocks right the first around just first collecting your data in the first place right this is where you know your you have data ingestion you have you know cdps and you have essentially what does it take to actually even bring your data together in the first place and collect the data that you need i think the center stone of every data platform in some ways is the storage and the processing layer. And there's a bunch of different architectures that you can use, but it could be your cloud data warehouse or your cloud data lake or your lake house or, you know, whichever of
Starting point is 00:30:19 those architectures you're picking inside the org. But that I think is the center stone in some ways. Then there's transformation. How do you take, you know, how do you go from rod to like, you know, bronze, silver, gold, and so on. So that's the third layer that I'd say. And then the final is what I call the application layer.
Starting point is 00:30:36 That's where I would say the BI tools sit. And then depending on whether you're a data science organization, maybe some data science tooling, like Jupyter, for example, sits. I'd say that's, in my mind, what forms the core data stack. It's at that point that I think once you have the data stack or the basic data stack, which is like, say, these three or four tools, there's a bunch of others that I'm not mentioning,
Starting point is 00:31:00 but this is like, you know, minimum viable data platform. I think it's at that point that tools like Atlant start becoming helpful where we say, hey, we're building that like metadata governance plane in some ways for your data stack. So a great, for us, for example, a typical customer who brings us in
Starting point is 00:31:17 has implemented something like a Snowflake or a Databricks or a, you know, AWS data platform in the last, say, you know, 12 or 18 12 or 18 months, they already have set up their initial BI. They've solved some of those initial problems with data. And that's when collaboration chaos becomes a reality. That's when they start realizing, hey, we hired the first few sets of analysts, but
Starting point is 00:31:41 hey, my new analysts are not productive at all because they don't know what, you know, what data they should be using and things like that. And those problems start becoming real. Is there a minimum size of a team that you have observed that usually exists when Matlab becomes relevant? So we typically see that somewhere in the, you know, when your data team is in sort of that 10 member team size is where it starts becoming is where it it starts becoming
Starting point is 00:32:06 where the problem starts becoming a real pain like that's when you know you're dealing with like you know a really sizable chunk of your data team like over 50 percent of your time is actually probably being spent on issues like this interestingly we also see a bunch of data leaders which is I think which is interesting now because you actually have people who work in like larger teams who are now going in and setting up, you know, teams and like early stage startups. And some of the things we actually hear now, and we have like teams actually that are starting out with Atlant much earlier, because we've started seeing data leaders say, hey, like we've gone through the chaos of not implementing this and then having to figure this out at a later stage and we know how painful it is and so we just want to get it right from day zero like we don't want to like go we don't want to have to fix our problems when we grow and so we do definitely see earlier stage teams starting to adopt a lot of the practices that we
Starting point is 00:33:00 recommend for example we talk about things like how do you think about your data assets as data products and and what does that mean like how do you create shipping standards on day zero? How do you create a documentation for sculpture on day zero? Like, these are all the things that, you know, we think about as practices inside the team. And we're starting to see people actually adopt this almost at day zero, rather than necessarily wait till the problem becomes a real pain. Yeah. I have a question on that, Prakapa, because in an ideal world, all of us working in data would love if companies were constantly looking six months ahead
Starting point is 00:33:39 and were implementing processes and tools that would make their future data stack and data team operate more easily. But in the real world for most companies, especially as you're scaling and dealing with data and putting out fires and adding that one number to the dashboard, it's really hard to anticipate what things are going to be like in the future. So I'd love to hear you speak to someone who says, okay, I'm already experiencing that pain, right? Like we have a pretty robust data team and stack, you know, we have, you know, data science, machine learning practice, you know, are starting that journey. So if you do have to go back and sort of solve the pain after things have become, you know, or starting that journey. So if you do have to go back and sort of solve the pain after things have become, you know, reached a tipping point, where do you start? Like which
Starting point is 00:34:33 discipline do you start with within your definition of data ops? Right. Is it because, I mean, there's so many things, right? It's like, okay, do we start with governance or do you need to solve cataloging before that? Or, you know, lineage? There are multiple components of this that Atlan solves, but what's the starting point? So I think the best way to think about this in some ways is what I think of as the journey that our data team actually will take, right? And people bring Atlan at different points in their journey. That depends on how they think about agility and how forward thinking they are and how much they think for them in advance. And that's different, right?
Starting point is 00:35:09 Different teams like operate differently. But for example, the way I think about it is, you know, when you've just started your, your data team, let's say your data team is, is pretty early, it's pretty small team. The first set of problems that you're probably going to start solving are things like pretty simple things. So it's going to be things like, do we all agree on the same metric definitions? And how do we measure the metrics?
Starting point is 00:35:34 It's going to start there. And then you're mainly focused on, when you're that early stage data team at that startup, you're mainly focused on saying, how do i help my business users or my business stakeholders starting to trust the data starting to trust me starting to trust that they should make data-driven decisions you know like those kinds of things that's where you're starting very quickly what starts happening is that people actually start relying on the data team and start sending service like i think of this as service requests to the data team and start sending service. Like I think of this as service requests to the data team, right? So you start out with like maybe helping out in the monthly, the monthly business review and the quarterly business review,
Starting point is 00:36:12 a bunch of requests start coming to you now, a bunch of ad hoc requests, data team, early data team says, okay, we can't handle this anymore. We need to hire new people. This is when your data team starts growing. And at that point, I think the biggest challenge that data teams have is productivity. It's really hard to get new analysts up to speed. Typical average time that an analyst stays in an organization is like today, 18 months, and you're spending six months onboarding a person in some ways in that time, right? And so there's, I think the biggest challenge
Starting point is 00:36:45 you start facing is like analyst productivity as your, or, you know, this is also true for like data scientists, productivity, basically any, you know, data consumer productivity in some ways. And that's where you want to start solving these problems. So that's where things like data discovery, data lineage, you know,
Starting point is 00:37:00 context or tribal knowledge around your data, data documentation, these start becoming a reality and investing in that becomes super important now there's a point where this so even if you improve the productivity of your data team and things like that you know hopefully your data team is doing much better the reality is that the request that your data team is going to get is always going to be much much more the demand is going to get is always going to be much, much more. The demand is going to be much, much more than no matter how hard you try that you can scale your data team size.
Starting point is 00:37:31 Because the reality is that, you know, you can only scale your data team linearly and it's likely that you're going to start getting exponential requests. So that's sort of the time where we see that, you know, data teams go from almost this mindset change of having to stop building data services to almost a data product mindset in some ways. Where if you think about the difference between services and product, services, you're servicing a single request, product, you're basically building something scalable that everybody can use or like a good chunk of users can use and so it takes a little bit of upfront investment on daisy but you know as you go along the way you know over time you're actually reducing a ton of the repetitive requests that your team is getting which is saving you a ton of time so that you can actually build new things at that time you know the priorities start becoming a little different for for data teams right and so that's where we start seeing people say you know how do i start looking at insights as an asset queries as an asset or a product in in atlant this is where you know you start saying for example we actually have a ton
Starting point is 00:38:35 of so so there are two ways you can use that like one is the atlant you know ui and interface itself but atlant also has a ton of apis and apps that you can build on top of Atlant, which you can connect into your CICD pipelines, which you can connect into your downstream tools, which could be your BI tools and so on and so forth. And so that's where we start seeing people leverage a bunch of those kinds of capabilities. And then the final layer is starting to truly create that self-service environment, right? Like the holy grail that every data person wants is that, you know, we are just able to like truly enable self-service in our organization. And at that point, you're actually starting to expose a bunch of your data products to
Starting point is 00:39:16 your end users or your business users directly. And at that point is where things like governance start becoming a reality. Like I always think about this like democratization as much as it's a buzzword, like democratization and governance are, you know, they're two sides of the same coin, right? The more people are getting access to the data, the more you're starting to think, who's accessing my data?
Starting point is 00:39:35 Are the right people accessing my data? PII, like those kinds of things start becoming a reality. And so I sort of see this as a journey. And the question really sort of comes down to where you are in this journey. So for example, teams that adopt us much later in their cycle, in some ways, when they're a much larger team, for example, governance is a priority on Day Zero itself, because of just where they are in their journey. Versus, you know, if you're a much earlier stage team, you know, you're like five people, you know, and you're thinking about, you know, access control and security, you know, if you're a much earlier stage team, you know, you're like five people, you know,
Starting point is 00:40:05 and you're thinking about, you know, access control and security, you know, that's super unlikely, right? And so I think that's the way we think about it. Yeah, it makes total sense. Sorry, Eric, but we need to, I have a question that I pretty much have like from the beginning,
Starting point is 00:40:21 but I think now's the right time to ask that. Go for it. Yeah, so we are talking a lot about enabling collaboration and like healthy collaboration, I pretty much have it from the beginning, but I think now is the right time to ask that. Go for it. Yeah. So we are talking a lot about enabling collaboration and healthy collaboration between people, blah, blah, blah, like all these things. We're talking about the data stack. You gave a very good description of the complexity of a data stack,
Starting point is 00:40:38 even the minimum viability data stack, right? It has many moving parts there. So I wonder, in order to build a platform like Atlant, you need to be able also on a technology level to collaborate with all these different parts of the data stack, right? Like you need somehow to interact with them, pull some metadata. And I'd like to talk more about that a little bit later. How do you do that considering that obviously it's part of the data stack right now and each vendor they only care about their own problems right? Like I don't think that the first thing they think about is how we are
Starting point is 00:41:17 going to expose metadata or like APIs or whatever like to tools like yours. So how does this work and how much of a challenge it is today? Yeah. No, absolutely. I think we're actually doing a decent job as a community today in exposing tools to a pretty decent job of making it possible for you to get metadata out of the tools. So this is not true for the fringe tools in the ecosystem where the use case is not as elaborate, right?
Starting point is 00:41:49 But for the main tools in the ecosystem, it's actually okay. The thing is, you might have to do some work on top of that to make that metadata useful. That's a different thing.
Starting point is 00:41:56 And that's what like products like us focus on, right? In some ways. I think the true challenge is not the integration point as much as it's the diversity of the integration points. The truth in the data ecosystem is that the data stack is also evolving. So if the data stack was just, these are the 100 tools in the data stack, and it's going
Starting point is 00:42:19 to be these 100 tools for the next five years, that would be awesome. But, you know, and a relatively, you know, simpler problem to solve. But you know what, like, I never thought I would be hearing about Firebolt even like a year ago. But, you know, now you hear about it, right? And, you know, and I think the data stack is changing so often and the new tools
Starting point is 00:42:41 getting added to the ecosystem and that is going to continue to happen. In fact, I think after diversity, the only reality in data is change. And so then you need to be able to be, as for us, for example, as a platform, we need to be truly agile to be able to actually support these integration points. Because if you want to be the true collaboration, the only way we can do it is by supporting these integration points so that's why for example the way we so we turned that into a feature rather than a bug so the way we thought about it is we actually atlant's built on behind the scenes what we call an open marketplace which basically means that customers
Starting point is 00:43:18 can actually build these apps on top of atlant which allow you to actually build integration points integration points not just into you know the tools that we're pulling in metadata from, but also integration points into collaboration workflows and downstream tools that you want to integrate into, right? So for example, if a team has a specific workflow that they use on Jira and they want to build a metadata orchestration workflow of it, they're able to do that on atlin as well so and that's the way we think about the the the role we play in the stack in some ways yeah okay and i know we don't have that much time left which is a good thing it means that we need to arrange another episode at some point to keep keep chatting on that. But before we reach the end of our episode today,
Starting point is 00:44:06 let's talk about the metadata plane. Usually you, I mean, the two main terms that we listen is like the control plane and the data plane. And suddenly we introduce a new term, which is the metadata plane. So what is this metadata plane? And what is a piece of metadata? Like if you could give us like an example
Starting point is 00:44:22 from like a BI tool or something like a data warehouse like that would be amazing yeah sure yeah so let's say you started like what what metadata is itself is right and i think the simplest way of describing it is is data about data in some ways right and what that means is you know, every one of your tools is generating data assets. And there is context that is created about each of these data assets. So let's pick, for example, in your BI tools, you have context about usage, right? Which of these BI tools are getting used the most? Which of these dashboards that you're building are getting used the most?
Starting point is 00:45:11 At what time? By which which users that's metadata which data source or which table in snowflake is connected to this dashboard right that's metadata in your data warehouse tool right like your query like you can use your query logs to actually figure out you know in some ways lineage and how these how different tables are connected to each other, like that's metadata. All of those in your pipeline, or your orchestration engine, you know, you have metadata about what time was the pipeline updated, right? Like, and that's metadata. So I think the way I think about it is like metadata could be technical, metadata could be social, it could also be about, you know, who's using where about usage you know things like that and the more you're able to bring in you know right from you know the standard forms of metadata is all the technical stuff you know you're able to bring it in and marry it in with more and more types of metadata that's really
Starting point is 00:45:59 where you're able to sort of create i think about this as almost uh like a single a single plane for all your metadata right in the ecosystem like there was a world where you know we used to like the same thing that happened with the data lake actually right we used to like there was a time where we like you know the big data world back in the day we were bringing data from a bunch of different places to put it dump it into the data lake in some ways to say, hey, you know what, like, we don't know what the countless use cases of this is going to look like,
Starting point is 00:46:31 but, you know, we know that this is valuable. And we can talk about, you know, of course the implementation hurdles and the issues that had happened, but like, if you think about from the fundamental concept level, metadata also has a ton of different use cases. I think we've just scratched the surface of what those use cases could look like today.
Starting point is 00:46:50 Today in an ecosystem, we are talking about data discovery or data lineage or data observability. These are just one or two or three use cases of what metadata can do. In the future, you could be using metadata to auto-tune your data pipelines you could be using metadata to actually cost optimize your entire data management ecosystem there's a ton of different use cases of what metadata can do and so the way i think about the metadata plane is it's sort of this i i think the metadata plane is the foundation the control plane to be honest right like i think to bring in all of your metadata and then you're using it to drive these use cases, governance and security and, you know, catalogs and discovery are some of them, but then there's a ton of other newer,
Starting point is 00:47:36 intelligent, operational kinds of metadata use cases that are still remaining to be discovered in many ways. So interesting. Well, we are close to time here, but I have one more question for you. And we like to get advice from our guests. And I think one really interesting experience that you've had is tackling these massive data problems with multiple different types of data. So going back to the beginning of our conversation
Starting point is 00:48:05 where we talked about clean gas and how that included geospatial data and economic data, what are maybe one or two of the lessons you learned when trying to face a big, sort of crazy data problem like that, that, that our listeners could learn from? That's a great question. So I'd say a couple of things. I think one that, and this, this comes back to, for me, and maybe this is the same battle, this is probably why I'm building Atlantid today, right? But to me, I think it really just comes down to the team and the culture. I think that is the most important thing in being able to crack the most difficult data problems.
Starting point is 00:48:53 Like, for example, in that team that I was telling you about, the clean cooking gas, like, you know, and, you know, I honestly think we were probably the only, like, I have not heard of a problem like that being cracked that way. It took us multiple iterations three months actually to get there and the the reason i think we were able to do it right like even fundamentally think about how do you think about accessibility versus profitability it sounds today like when you hear it like simple but it was not like when you're
Starting point is 00:49:21 really like you know trying to figure out how to do this the first time in the world it was not like when you're really like, you know, trying to figure out how to do this the first time in the world, it was not. And we had a, we had a development economist in the room. We had a data engineer in the room. We had a project manager who came from a political background in the room. We had all these very, very diverse people in the room and I think that enabled us to actually sort of rethink the problem from first principles in a way that a standard team that would just have had maybe analysts or just have had data like a single kind of persona would not have been able to think about that problem and I think so that diversity is very very important example, again, I go back to that example. We actually had a solution that had been signed off by our client where it was not the ideal solution,
Starting point is 00:50:11 but it was like, you know, it was a top-down way of allocating. There are multiple ways to solve a data science problem, right? It was a top-down way of allocating where these gas centers,
Starting point is 00:50:20 which districts they should go get opened in. And we still felt like it wasn't solving the access problem. We felt it solved And we still felt like it wasn't solving the access problem. We felt it solved the profitability problem, but it wasn't solving the access problem. And so literally three days before the final presentation to the cabinet minister, I remember my co-founder and I were in the room and my co-founder basically
Starting point is 00:50:38 like listens to the problem. And then he's like, Hey, like, so wait, this is not a profitability problem. This is a, This is actually an accessibility, this is a distance problem. So why are we not thinking about it from a geospatial perspective? Why are we thinking about it? And so we actually flipped the entire solution in like two or three days. And that wouldn't have happened if we didn't have the diversity in the room. And so I think that to me is the most important thing. And so leaders should really strive to find a way to build diverse teams and have them work together. I think the second aspect
Starting point is 00:51:12 of that is trust. The problem with diversity is it's really hard to build trust in teams. When a number on a dashboard, going back to the number on the dashboard breaking, and I know we laugh about it a lot in the data space. But the reality is that at that moment, when the cabinet minister called me and said, the number on the dashboard is broken, I couldn't answer his question as to why the number on the dashboard was broken. At some level, the hard-earned trust that I had built with him broke. Same time when I called my data engineer, and he said, I'm gonna pull audit logs and check. At some level, I didn't know if the problem was that the pipeline really broke, or if my data engineer was messing up. And it broke again. And this creates such deficit
Starting point is 00:51:55 in diverse teams. In most of the teams, sales leader started out as a sales rep. Everybody does the same job in the team. Everybody has clarity. That's not the case in a data team. And so the second most important thing to build in a data team to make it successful is how do you build an ecosystem of trust? How do you help people trust each other? How do you help people trust the data that they're working with? I think that's the second most important thing that I would invest in as a data leader. Incredibly wise advice. And we thank you so much for that, Prakopal. And thank you for your time today. It as a data leader. Incredibly wise advice. And we thank you so much for that, Prakapa. And thank you for your time today. It was a great conversation.
Starting point is 00:52:30 Thank you so much for having me. This was a lot of fun. My big takeaway, you know, I just appreciate we covered a bunch of topics, but I appreciate that Prakapa returned to a theme that we've heard on the show multiple times. And it was so great to see her kind of think through all of her experiences with data, building a data ops platform, and what she went back to as the most important thing in
Starting point is 00:52:57 solving data problems as a team. And I really appreciated how she said diversity is so important to have on a team that's solving a data problem, but it also makes the trust component difficult because you have that diversity, right? And people are coming from different backgrounds and skill sets and have different responsibilities as stakeholders in the project. So that was just a really, that's one of those things where like, we, I think kind of have all heard and known the back of our mind, but to hear it articulated like that is always a great reminder. Yeah, a hundred percent.
Starting point is 00:53:36 Like if you think about it, like think about that, like when you build a company and like you build the product, you build the product for a very specific persona. You have only one persona to keep in your mind. And even that is super hard, figuring out how to satisfy this one persona.
Starting point is 00:53:55 Now, if you're getting the shoes of a data professional, like analyst, data engineer, whatever, whoever is a member of this data team, these teams have as customers all the different departments and functions that the company has, right? So they have to satisfy by delivering services or products, all these different personas.
Starting point is 00:54:18 And that's exponentially harder to do. And of course you need trust. Like without trust, like you can't build anything, right? So yeah, I think that was probably like one of the most important topics that we touched during this conversation. And we don't usually talk that much about that
Starting point is 00:54:39 when we talk about data and the technologies around it, but we should spend more time. I agree. I agree. Well, thanks again for joining us and we will catch you on the next episode. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app
Starting point is 00:54:57 to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.