The Data Stack Show - 50: From Data Infrastructure to Data Management with Ananth Packkildurai

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're talking with Ananth, who publishes the Data Engineering Weekly newsletter.

Starting point is 00:00:35 And my guess is going to be that a lot of you out there listening are already subscribed. If you haven't definitely subscribed, Kostas and I both are avid readers and it's a tool we use to keep up to date on the industry. And Ananth is a fascinating, fascinating data engineer, has worked at Slack sort of in hyper growth mode, which is really cool just going to try to steal out of the gate and get this question in before the conversation gets going is, what was it like to be a data engineer at Slack in 2016 versus 2020? Because that period of growth for them is really mind boggling in many ways. I don't have the numbers on hand, but I'm so fascinated to hear about that. So after I ask that question, what are you going to ask? Okay. First of all, I have to say that you are quite predictable, but it's a very good question.

Starting point is 00:01:36 So yeah, I mean, being in like Slack at that time, both in like the lifecycle of Slack, but also in terms of how the industry was back then in terms of data technologies and all that stuff. I think it makes a lot of sense to ask him that. And I think we are going to hear some very interesting things about that. For me, I'd love to get a little bit more technical and abstract at the same time with him, chat a little bit about architectures. What are the differences? Some products that he has seen that they are quite important and we use them since he started working.

Starting point is 00:02:13 And how the data stack has changed. How predictable. All right, let's jump in and talk with Anant. Yeah, let's do it. Anant, we are really excited to talk to you today. I feel like we never have enough time to talk to any of our guests, but the list of things available to us to talk to you about is really, really long. We'll get right to it, but first of all, thanks for joining the show.

Starting point is 00:02:38 Great. Thank you for having me. Okay. We start every episode the same way by just asking you to give a little bit of your background and how you ended up where you are today. So tell us about your background and what you're up to. Yeah, totally. I'm working as a principal data engineer for Zendesk, essentially overseeing customer-facing analytics and what makes for us in ZRM space in terms of analytics. Previously, I used to work for Slack, where I kind of building the data infrastructure, orchestration engine, all sort of things.

Starting point is 00:03:17 So having an engineering background and working in the industry for almost 15 years now. So pretty passionate about data engineering. Happy to be talking about that. And you mentioned this in the intro, but you also published the Data Engineering Weekly newsletter, which has grown significantly. And if anyone in our audience doesn't subscribe, you should absolutely subscribe to the Data Engineering Weekly newsletter. I get lots of my news personally about the industry and the space. And so congrats on that being a really successful newsletter. Yeah, thank you. We just crossed the 50th edition of the newsletter. So looking forward

Starting point is 00:03:50 to build more on top of it. Great. Well, we'll talk a little bit more about that just because as content producers ourselves, Kastan and I have lots of questions to ask you about. You're so productive with that. But let's start with Slack. Slack is a fascinating company, just unbelievable acquisition by Salesforce, kind of mind-boggling. And you started there in 2016. And as I was thinking about this before the show, I actually remember, gosh, I want to say it was 2015 or 2016 when the company I was at switched from HipChat to Slack. And Slack was kind of, at least for us in our circles, it was kind of a new thing. And we were like, oh, this is awesome. And then you were there for four or five years. So you

Starting point is 00:04:40 got to see this unbelievable, unbelievable growth at Slack. Can you just describe to us what it was like when you started there and then what it was like towards the end of your time from sort of a data engineering perspective? Yeah, I think that's a good question. So what did it mean in 2016 that we actually barely started building some of the base foundation for data infrastructure at the time. The good thing about Slack, as you mentioned, it grows exponentially. Whatever the assumption that we are making, in six months, it will invalidate. The scale that it goes and the ability for us to multiply the data platform at the space to keep the business to keep Innovate on top of it.

Starting point is 00:05:29 I think that's a big challenge that we can run through those things. I think I'm glad to be there. A lot of learning, a lot of pragmatic decisions to scale the systems over the period of time. And in 2016, when you started, you said you were just barely building out the data infrastructure. What was the componentry there and what were the initial needs you were trying to address? I know that the exponential growth meant that a lot of those things need to be updated in a short amount of time. But what were the initial problems in 2016? Mostly, we have this analytics team right now. So now we have to

Starting point is 00:06:07 enable and empower analytics teams to build the reporting dashboard needs for the executives to understand our product usage, to understand our customer experience sort of a thing. So how do we empower analytics teams or data science team to be much more productive when you started 2016 like you know there's no snowflake there's there's no like any other mature database for us to kind of go and approach or not much maturity on the ingestion framework also like you have to build everything either you have to adopt an open source system or you have to build by yourself and even if you adopt an open source system it may not be scaled at some point of a time so you have to really really dig in and do

Starting point is 00:06:51 certain things so certain components that we had at the time like we we are running one 12 node kafka cluster at the time and we were barely starting to use airflow at the time. And we were barely starting to use Airflow at the time. And we were running a single instance Airflow for some time. And another EMR cluster. It was a pretty standard structure to begin with for any data infrastructure in this case. Got it. Yeah. It's wild to think about Snowflake not being sufficient enough back in 2016 for Slack. It's just, it's crazy to hear that. And then just a quick fast forward,

Starting point is 00:07:32 what was the high level architecture in 2020 towards the end of your time there? Yeah, I mean, the foundation remains the same, but it's more the scalability of a system. Keep change, right? The scalability, like we no longer can run Airflow in a single box. So at some point of time, we were well cross-passed like 1,000 plus tags and almost 30K task running per day. So we are more adopted a distributed approach on the Airflow side of it.

Starting point is 00:08:05 And we also kind of increasingly adopting other vendors like Snowflake for certain business use cases in this case. And we are growing our EMR cluster to a larger instances and Kafka from one 12 node cluster that kind of grown to three, four Kafka clusters with 80-plus nodes in this case. So the system component fundamentally remains the same, but more and more security compliance requirement that we had to satisfy. We added additional layers on top of it to kind of fulfill that in this case. So it's largely like scaling and keep the foundations remain the same. So because it's very, like the rate of business innovation is happening at the time and the speed at that we are moving, it would be very, very expensive to change any foundation thing.

Starting point is 00:08:54 So we keep iterating, keep scaling these things. For example, we were running Airflow on a single box for a very long time. And one of the things with Airflow as a scalability problem is when there are sensors waiting for the files to come or partitioned available, it'll just spin off the process and keep just waiting and it waits doing nothing. So we have a choice whether we have to go distributed mode or we can fix it and get a little more bandwidth. So we implemented a retriable sensor. At a later point, I think the airflow community also adopted a similar pattern in this case. So we always have to kind of understand what is really going on internally and then try to fix it to our need to make sure that engine is running in this case.

Starting point is 00:09:43 Nanda, I have a question for you. You've been in this space for quite a while, and you've been through a very pivotal phase of the space that has to do with all the technology around data. If you had to choose, let's say, one technology, which you consider as the biggest thing that happened this past, let's say, six to seven years, what you would choose? You mentioned a few different things.

Starting point is 00:10:10 You mentioned Kafka, EMR, Airflow. But if you had to choose just one, what would that be? That's a tough choice. I think I'm kind of thinking back. I think the last eight years, there are two technologies that have defined the data infrastructure. I think the last eight years, there are two technologies that have defined the data infrastructure. I think one is Kafka, another thing I would say is Presto. These two, and the reliability that it provides and the flexibility these two tools can provide,

Starting point is 00:10:39 I would choose either one of them. I would rate both equally the same rate. Okay. That's very interesting. It's one of the first times actually that someone is mentioning on our conversations like Presto. Why each one of them? Can you give us a reason? Why Kafka and why Presto? Why these two? Yeah. So if you take Kafka, the guarantee that what Kafka as a system is doing is pretty simple. Very, very simple guarantee that you send the event and we store it sequentially and give an access to it sequentially. And you can scale up to whatever the way you wanted to do that. And it's much more reliable in this case. So it eliminates a bigger need for us. Because everything in data infrastructure

Starting point is 00:11:26 that we deal is out of the events. And when you have a solid systems like Kafka can scale and store and ability to process that information at a much more reliable way, I think that's a big win for that data infrastructure. Second thing I would say like Presto, Presto is like kind of a federated core engine. In a typical data infrastructure. Second thing I would say, like Presto, Presto is like kind of a federated core engine. In a typical data infrastructure world, it is not like, there's two things.

Starting point is 00:11:52 One is like the clear separation of storage and the computing, the way Presto kind of adopted from the beginning that it's just focusing on the computing, not necessarily on the storage part and relying on all the other

Starting point is 00:12:06 storage engines and other data format to do its job. So the same principle again goes to Presto, they like simplify the guarantee and the constraint of the system, it's much more simplified and they do that very well. So having those clear boundaries established by those two systems and they do what they really do well, I think that's what makes those systems very powerful. I think that's a critical thing for any other system that can be adopted. Stick to this boundary and do that very well in this case. Yeah, absolutely. Actually, I found very interesting what you mentioned about like preston the separation of computing to storage and i want to ask you because you also mentioned snowflake and snowflake built the whole story around delivering exactly that right like we

Starting point is 00:12:55 are the first data warehouse that separates storage from computing and this has like this this and that benefits blah blah blah all that stuff and that's also my belief to be honest it's not like a new concept right presto in a way was doing like something similar why do you think that although we had presto today we are talking about snowflake and not presto what made snowflake win let's say in this in this market at the end? I would say like, I would give a little more credit to the Redshift in this case. So I think Redshift 2014, when Redshift is kind of coming along, I think it changes the way people started to think about how to kind of 2080s kind of disturb the way introducing the concept of like whole big data processing, and you can throw out anything and then process it anything and then quickly

Starting point is 00:13:45 people understood there is a mistake that you cannot just throw anything without a structure and then your processing will be not as efficient as one would like to have so then the shift has happened between like okay let's store with parkway or RC format in this case. Then if you wanted to store that, then you have to access in a very structured way. Then that's a SQL. I have other systems kind of popularized in this case. And then we slowly kind of going back to the traditional data warehousing style.

Starting point is 00:14:20 Why can't we do that with the database? I think Redshift is the first system, in my opinion, kind of trying to crack that and maybe successfully computing and how the elasticity that it is providing to scale independent of each other's, the mistakes that we learned from HTFS and Hadoop World. And they started to, they builded that very well. I think that is what the traction is more towards a managed service that also kind of adopted a similar strategy that what we like to move on having the separate scalability from the computing and storage as famously people say right

Starting point is 00:15:12 life is too short to scale both computing and storage at the same time so i think they got into the right moment right time and the right challenge like right business problem at the time in the industry. Yeah. And my feeling is like discussing more and more and people around that. And I think in a recent conversation that we had with another guest, he mentioned the word about Snowflake, that Snowflake is like the Apple of databases or like data technologies. And like with Apple, you get the iPhone. That's exactly what he was saying. And I found it very interesting. I get the iPhone and I keep getting and buying the iPhone. I don't know exactly, but I like it. Okay. And I think that what Snowflake did very well compared to other technologies, it was also like the product experience itself. It wasn't just, okay,

Starting point is 00:16:02 we separate storage from compute. Yeah, that's something that we have done. In a way, Hadoop was also doing that, right? You had HDFS and then you could run Spark on top of that and do this. So it's not like this is a new concept, but the actual product experience that Snowflake provides, and especially if you compare it with Redshift, right? Because back then we had Redshift, the dominant data warehouse, and then Snowflake came. And Redshift was compared to Snowflake, too much manual labor, right? Vacuum, deep copies, to rescale your cluster,

Starting point is 00:16:39 you had to have downtime. And suddenly you went to Snowflake and you didn't have to do any of these things. So I think the product, at least experience that they offered to the customers was amazing, especially like for this product category, right? That's, I think, something that they did very well. You mentioned something about the concept of taking all the data, regardless of like the structure and try to process them and this generates a mess. What's your opinion about data lakes and data lake as an architecture?

Starting point is 00:17:08 Because it kind of has, let's say, this approach, right? The whole idea is we have a file system, let's throw everything there and then create the first layer of some structure and then another layer and then start processing. What's your opinion about data lakes? Yeah, that's a good question. So data lakes, I think I would say we should not just implement data lake for the sake of we are implementing a data lake.

Starting point is 00:17:35 It all depends upon the nature of the data source that what you are dealing with. So I would like to categorize the data source in two different ways, like a managed data source and unmanaged data source. What I mean by that is that, or I would say controlled data source and non-controlled data set in this case. So a controlled data set in the sense, take an example of a company like Slack, where the user-generated event, the interaction event that has its generator within the product experience at whatever capturing, it's completely a controlled environment

Starting point is 00:18:11 because it's within the system of the company scope. So they can define the structure upfront. If you have a controlled data production capabilities, I don't think any reason why you should not adopt a structured approach to dealing with the data. But that cannot be always true

Starting point is 00:18:32 where there are scenarios where you might have the data source that producing the data may not be within your control. And that time it makes sense to kind of go through, like take the data and to put the data

Starting point is 00:18:44 like satellite imagery or any other information that you are getting from the third party. and that time it makes sense to kind of go through, like take the data into, put the data into a data lake, like satellite imagery or any other information that you are getting from the third party, put the data into a data lake and then do a structure approach in this case. So it's not like one size fit all. I'm sure you're aware of like, there's a lake house concept is coming along,

Starting point is 00:19:00 the Delta Lake and Apache Hudi, Iceberg kind of systems. I feel like these two systems will go hands in hand. Depends upon the nature of the source, nature of the data that we are capturing. Yeah, I completely agree with you. And actually it's a space that I'm really paying attention to.

Starting point is 00:19:19 There are like so many new technologies out there. You mentioned Hyundai, you mentioned Iceberg. I'm pretty sure that all of them have received funding or they are going to receive it if they haven't already. So we are going to see also quite a few new companies in this space. So I think it's going to be very interesting to see what kind of products will come out of this and this, let's say, merge of the data lake concept and the data warehouse concept, which is something that we see also with Snowflake, right?

Starting point is 00:19:45 Like Snowflake started as a data warehouse. At some point, they changed the narrative a little bit. They started saying like how you can also build a data lake on top of Snowflake. And now they're talking about the data cloud, which is even more expanded as a concept to the previous one. So I think this whole category is still under definition

Starting point is 00:20:02 and it's going to be very interesting to see where it goes and what will happen. One of the things that I'm very passionate about or I'm interested to watch what is going to happen in that space is something called, I think Snowflake also announced something called data sharing that you can share certain maybe publicly available data

Starting point is 00:20:23 or any other data to across different companies. We see more and more companies adopting SaaS products. If I'm a retailer and I wanted to open a shop, I can do a Shopify or I can accept payment from Square and like my Stripe. If I wanted to run a business, I can literally use a mesh of SaaS technologies to build my business over the period of a time.

Starting point is 00:20:50 Now, again, the question will come, like how do I get the intelligence out of it? Each and every systems are very good at its own, but how do we get the integrated view out of it? I think that's why the data sharing, how far it's going to emerge, and obviously the CDP platform like Eurostack, like how much going to disrupt in this space, something that's very interesting

Starting point is 00:21:09 to watch. Yeah, absolutely. That's very interesting. And that's a very good point. Actually, I remember when I was reading the S1 filing from Snowflake, and one of the first things that they were talking about was how these data sharing capabilities and these data marketplaces and all these, let's say, their action were like a data sharing process, how important it is for their vision, especially because they, based on what

Starting point is 00:21:38 they are saying, at least, they generate very strong network effects. So from a business perspective, we are talking about a completely different game. If you manage to implement something like this and you actually manage to get people use this kind of capabilities on top of your data warehouse, data cloud, or whatever you want to call it, right? So yeah, absolutely.

Starting point is 00:21:59 I really want to see what's going to happen with data sharing. That's super interesting. Cool. So another technical question. You mentioned, actually not exactly technical, want to see what's going to happen with data sharing. That's super interesting. Cool. So another technical question you mentioned, actually not exactly technical, more of an architectural question. You mentioned when you were talking with Eric that what has changed since the beginning of Slack until today, it's not that much the architecture itself, but it's more about the

Starting point is 00:22:23 scale of the architecture, right? And from what I understand, at least, there are like some standard components out there that are exactly that. They are standard. Now, you might implement them differently depending on your scale or your needs. But at the end, the data infrastructure has some kind of standard structure. Can you give us your perspective on that? How does this architecture look like? What are these components? And then we can discuss a little bit more about each one of these components.

Starting point is 00:22:49 So I mean, this is again, like one year long, so whatever I'm telling right now might be invalidated. So this is where I used to be. So just take that into consideration. Yeah, sorry for interrupting you. I'm not talking that much about Slack specifically, but in general, with your experience as a data engineer, so it doesn't have to be something specific to Slack itself. But I guess that there is some kind of, let's say, more generic architecture that some patterns at least that we find in every company, right? Yeah, totally. So one thing that I found out, so there are two parts to that. One is like a data infrastructure perspective, and this is

Starting point is 00:23:30 a data management perspective. The standard component I found out in most of the companies is Kafka. Obviously it's taking all the events and stream that event to Kafka in this case. So in order to stream the data, there's a common approach from most of the companies, either they're using an Avro or Protobuf or Thrip structure in this case, and some kind of an agent running on the individual machines to kind of capture those data and send that back. That approach that I've seen very predominantly there.

Starting point is 00:24:03 And other part is like ingesting the third-party data, like marketing data, your Stripe, HubSpot, and whatnot. So I usually see people are increasingly adopting the SaaS solutions, like maybe RouterStack or other like Airbyte or any other ingestion framework, they started adopting that over a period of time. In terms of the data where it is getting stored, S3 predominantly high like most of the people are storing in like I see most of the ecosystem in AWS data getting stored in S3 for a long-term retention and maybe in a Parquet format in a column of storage.

Starting point is 00:24:49 These are all just to get the data into the system. How do they access the data? From there, people actually may be either using Snowflake or Redshift or any other storage database depends upon the scale and depends upon the need. Press-to, Spark, SQL are predominantly used to do a lot of SQL query processing in this case. And that's mostly the infrastructure perspective. I think orchestration engine is one of the core feature of data infrastructure. I've seen Airflow adopted very widespread adoption of Airflow and obviously FreeFact and Baxter is kind of having the adoption in this case. So the standard structure I see here is like there is an ingestion layer that brings the data either from the third party or from the local system, like the internal systems.

Starting point is 00:25:50 And there is a vast storage for long-term retention and also like an efficient after the metrics kind of computed over the period of a time. The nature of data engineering or data pipeline is that as we go downstream to the pipeline, the volume of the data are going to be narrowed down because as we go down the pipeline we tend to store more and more aggregated information and aggregated information lose the granularity over the period of time so I see more of aggregated information getting stored in Snowflake

Starting point is 00:26:18 and people are adopting like a hybrid strategy where raw data getting into S3 and aggregation slowly going to flow into Snowflake or Redshift in this case. So yeah, that's pretty much the standard. That's interesting. And how

Starting point is 00:26:34 do you go from the raw data on S3 to the aggregations that are going to be stored inside Snowflake? Usually what tools are used for that? So usually Snowflake has the S3 and what tools are used for that? So usually like Snowflake has it as an ingestor and then Airflow and like all the

Starting point is 00:26:49 engines already supporting Snowflake operator and other aspect of it is that. So some kind of a tooling that can do a bulk insert to either Redshift or Snowflake in this case. Because the more and more aggregation

Starting point is 00:27:05 happened on the pipeline, it's narrowing down towards the business logic. I would say like a business domain in this case, right? So if a marketing team want to have, they will want to have a set of aggregation that then can be easily accessed and then build their dashboarding and reporting and other aspect of it is that.

Starting point is 00:27:24 So different are trying to kind of understand in a different aggregation level so i think i see that is where the tools like snowflake is kind of very very helpful in this case like you know the concept of cloning virtual table and all sort of thing is kind of very powerful that you can easily clone certain data set back to share across different domains and sort of things. Makes sense. And okay, it seems that the data infrastructure, at least, is something quite well defined so far, at least. What about data management and data governance and data management? You mentioned that this is another important part. Do you think that data management is... What's the state of data important part. Do you think that data management is...

Starting point is 00:28:05 What's the state of data management today? Do you see their technologies missing, products missing, or everything is in place? What about best practices? What's your feeling about that? So I think data management usually comes afterthought. That's what I found in most of the companies when I'm talking to them.

Starting point is 00:28:25 No one's going to implement straight away that this is my data governance tool. Let me bring the data back. There will be some kind of an ecosystem already in place and people are running some analytics and at some point of time, either they go into a public

Starting point is 00:28:39 or some kind of a GDPR compliance. They wanted to expand the market and then they need a data governance at the time. So it will always be an afterthought. So that makes the adoption of a data management platform very, very complicated because you have to find a solution somehow to integrate your existing infrastructure. And that infrastructure might have multiple technical depth or multiple integration problems here. I think that is the significant challenge of data management in this case. Even through the lineage

Starting point is 00:29:20 and data quality, I always think it is an afterthought in most of the data pipeline when they started that. That is a big significant challenge. I don't think anyone should solve this problem. I think it will remain this, I feel like it will be the challenge moving forward. It will remain for some time because it's hard to expect a business to kind of think through that aspect of it is when they're trying to solve or bootstrapping a company. The company will essentially look to solve the business problem, not necessarily data governance and other aspects of it. And that's a gap between when they realize we need a data governance and the tooling that are available in the market,

Starting point is 00:29:59 there will be always a mismatch between the system they build and then the governance that is offering in this case. So I can give a simple example. Let's take a look at what is the existing open source tool that is available to implement a data governance. It's like Apache Ranger, for example. It heavily rely on either Kerberos or LDAP system to be in place to define the role-based access and all those things. If I had to implement the R-back access and I had to manage Kerberos or LDAP and other access here, which is pretty expensive things to do in my system, which the rest of the organization may or may not need at all. It's like a custom control here. That is where the gap, how do we merge that,

Starting point is 00:30:44 it's going to be a bigger challenge. Nand, one thing on data governance that I'd love your thoughts on, especially seeing it at scale at companies like Slack and Zendesk is the tooling is certainly one part of it. And I think it's just a really good point you made about around the mismatch between sort of the needs of the company, especially in the early stages and the tooling that's available is how much time you invest in it. But we've also heard repeatedly on the show that the other side of data governance is really cultural inside the company, right? So you can have great tooling, but you really have to have a culture around it and sort of align teams around it. What's

Starting point is 00:31:25 your experience been on that front, not considering the tooling aspect of data governance? So I have a reverse thought on it. Like my opinion, like the tools essentially defines a process, not the other way around. So if the tool is sufficient, you know, fulfill enough to do that, I think that is a good enough process to foot forward to implement data governance in this case. I would think architecture in a way, it's never the people fault. It's always the system fault, that a system failing. So in this case, I would say the tools are still not matching. That's why people are not able to follow that.

Starting point is 00:32:05 Yeah, super interesting. I'd agree with that. I think the primitive example of that is just how many companies use Google Sheets as their primary store for managing data governance or at least alignment across teams, which sounds really crazy, but I think that's a symptom of what you're talking about where the tooling really hasn't come around sort of at an enterprise scale that allows people to do it. And an efficient tool always enables a workflow to be much more efficient. So let's say I wanted to implement a data governance and this is my workflow that company is kind of following. This is how I generate the data. Let's say I use

Starting point is 00:32:50 protobuf to generate the data sets. Then that should be a systematic way to kind of tell, hey, whenever you're adding a protobuf message, add this tag that it says PII and we will take care of the rest of the things. And which means that it doesn't change any of my developer workflow or the way people accessing the system here. And the tool is kind of behind the scene taking care of it. If you don't have that such kind of a tool

Starting point is 00:33:16 that kind of meet where the developer experience or the user experience there. And that going back to a point that we discussed about how Snowflake is kind of successful. I think that's a bigger lesson that we need to kind of learn. Are we meeting,

Starting point is 00:33:30 that there's a tool kind of meeting the user need and giving the user experience have a higher chance of successful than just having just like shake up, having those tick marks for the period of time. Okay. Yeah.

Starting point is 00:33:41 There's encrypting PII tick mark and all that doesn't go into scale or going to fit well in this Okay. Yeah. There's encrypting PAI, tick mark and all that. That doesn't go into scale or going to fit well in this case. Yeah. And one thing that's interesting about data governance is that you sort of have two opposing forces, which makes, I would think makes building the tool really difficult. I'm not an expert, of course, but data by nature, especially when you think about things like customer data, very dynamic, it's constantly changing. There's always some sort of mess with data. And governance is all about sort of standardization and enforcement. especially when the needs vary by business model, team structure, technologies used. So to have a sort of pervasive set of tools that solve those problems really elegantly is challenging because of those opposing forces.

Starting point is 00:34:34 Yeah, totally. I think this is a similar conversation I had some time back with the master data management system also, like where does the MDM system kind of standing in the larger scale? And the system like MDM, it's very easy to like, or much more efficient to implement data governance. The fundamental difference where the world that we are living in is, or the modern data technologies adopted mostly on the schema on read approach. Like I no longer necessarily need to know the storage and compute kind of separated completely whenever i need whenever i'm going to read the data i know what is the schema inside

Starting point is 00:35:11 that whereas if you wanted to enforce the data governance and like master data management kind of a system here you need to have a schema on right system so before i'm right i need to know what i'm writing in this case and so there's a conflict of philosophical approach that data, you know, the modern data pipeline and data engineering approach is going on with the data governance and the MDM technologies over a period of time. So it's an interesting challenge to solve. Yeah, it really is. And it'll be cool to see.

Starting point is 00:35:43 There's companies out there trying to solve it. It'll be cool to see what happens. Okay, let's talk. Before the show, we brought up a subject that I'm actually really surprised in almost 50 episodes now that we have not, and correct me if I'm wrong here, Brooks or Costas, but data mesh has not come up on the show as a topic that we've discussed at length, at least to my knowledge, but my memory is somewhat faulty. And you talked about sort of working in the context of data mesh. It's certainly, I don't know if buzzword is the exact correct term, but it's a really interesting subject. It gets a lot of buzz online. But why don't we just start out with,

Starting point is 00:36:24 and I think this is probably a question that a lot of buzz online, but why don't we just start out with, and I think this is probably a question that a lot of our listeners have, but some of the definitions of data mesh can become pretty complex, but you have some experience sort of working in that context on the ground. Can you just give us, what is your definition of data mesh sort of at a 101 basic level? Yeah, totally. I think in my take on data mesh, the whole discussion that I observed and understood from my perspective, this is again, whatever I'm telling, it doesn't mean that this is data mesh, but this is what I understood and what I pursued in the organizations. The founding principle of data mesh is what I really liked about it. The treating data as a product.

Starting point is 00:37:06 Culturally, what happened is that the feature team was very busy in developing the feature products. Let's say, like a Slack as a messaging product, they wanted to develop all the feature team want. If I type a message, is it reached to all the recipients? Whoever is receiving the information, that they received that data. That's all they care about. Now, and there's a whole other world where we need to capture that business logic and understood and define the user behavior

Starting point is 00:37:36 over the period of time. There is a lot of context is missing from the future team to go to an analytics team. So there's like a lot of back and forth manual synchronization to knowledge sharing and often what it turned to be, let's say, an expert knowledge sharing way of building analytics in this case. I think what Data Mesh is bringing in and say, creating data as a product, the future team that producing the data is should have that ownership of the data also so they kind of add more and more context to

Starting point is 00:38:12 why what it means by this particular field what is the business context behind that so it will be easy for the to collaborate and build data products on top of it. I think that is the concept that I kind of really liked it. But at the same time, as we kind of discussed more in the data governance also, it is a good concept. It's a good philosophical approach of designing the system. The lack of tooling is still in, there's a lack of tooling to support the theory.

Starting point is 00:38:48 And if you wanted to introduce certain process to the system, there has to be a tooling to back. And if you don't have a tooling to back, it's like you will end up creating an undeterministic process, like a non-deterministic process in this case. That is a chaotic environment. So that is the confusions in my opinion, in this case, like the concept or vice it's really good and it still needs some tooling and maturity around this.

Starting point is 00:39:14 Super interesting. Okay. Costas, I need your take on this because we've offline, we've discussed data mesh. Does that, does that definition line up with your understanding? I just love your thoughts on Anand's definition. Oh, yeah. I mean, I agree with the definition, to be honest. It's data mesh as an architecture or as, I don't know,

Starting point is 00:39:39 best practice, paradigm, whatever you want to call it. I think it's something that's still under definition. And as Anand said, there's tooling missing. And as people will start building tooling about it, the definition will also change because the tools are also going to, let's say, change the way that we are implementing things and the way that we are doing things.

Starting point is 00:40:02 I think it's still early. There's definitely a reason why it exists. And I think it has to do a lot. It's not just like a technology. Let's say concept is also like an organizational concept, tries to create like a kind of framework of how the organization and people inside the organization interact with the data. We'll see. I think it's still early, but I think it's going

Starting point is 00:40:25 to be interesting how the definition will change and how companies are going to adopt it and implement it. Yeah. I think one challenge on that case is like the fully adopting a decentralization. It'll be a challenging factor, especially for a data infrastructure in this case, right? So data, as you mentioned, like data is inherently social in nature. What I mean by that, the domain-driven approach that we are adopting in the microservices, there is a user domain, there is maybe a customer domain in this case, can have a standalone working in the microservices.

Starting point is 00:41:06 I think people are trying to copy the similar concept into the data mesh principle and say like, oh, this is a customer domain. All the analytics within the customer domain owned by this customer domain team. The challenge with that approach is that data is inherently social in nature. A standalone domain will not add any value out of it. If you have a customer information, we also need to correlate that data into maybe their activity. Maybe some other information is happening across the domain.

Starting point is 00:41:39 That adds more insight and more value. So how does a cross domain communication going to happen? And the model that we define in one domain has to be consistent with the other domain, right? And if you don't, if you have a, you like, it's a

Starting point is 00:41:58 very common thing, like even with the very controlled environment, like most of the things user ID can be represented in multiple ways. You can find like my underscore user table. You can have user underscore one, two, three table. It's pretty common to find in the modern data warehouse site. So, and it will exponentially increase the silos and how those cross domain communication going to happen.

Starting point is 00:42:21 Where does the standardization happening? And that brings all the challenges that whatever we face in MDM and implementing those data governance systems here, right? So that is where the system is going to, you know, balance it out, you know, where the philosophy falls, like where it is trying to be discovered in this case. Yeah. There's two thoughts on that. I think that's just a very astute observation on some of the practical challenges on the ground, because of course, like the idea, I agree is great, but you have this challenge where there's value and

Starting point is 00:43:01 velocity that come from decentralization, but you almost need some level of centralization in order to make the decentralization work. Like you said, the schema has to match sort of across domains. And then the other component that comes to mind, and I'm just thinking about what this looks like inside of an organization, is that the sort of skill sets and expertise across domains are not necessarily equal, right? And so you have a data, maybe you have multiple data teams, and I'm sure there's ways to make that work in the organization, but different types of data lend themselves to different skill sets, sort of different processes. And so again, you kind of run

Starting point is 00:43:45 into this issue of varying skill sets or emphases across domains will tend to produce unstandardized results in a decentralized system. But in order to get the most value, you need to have some level of standardization. So it'll be really interesting to see how it plays out. Yeah, totally. I think not only the skill set, it's a different priorities, conflicting priorities because the whole purpose of the domain team is to kind of fulfill the business need and satisfy the users there.

Starting point is 00:44:15 And they have no additional responsibility to produce the event, make sure that synchronized and standardized those events and all sort of a thing. So when it goes to the project management, when you're prioritizing, oh, this is my quarterly plan, it always kind of pushed down in the priority lane. So that would be a hard fight.

Starting point is 00:44:34 And especially a product manager's incentive as to towards delivering the future, the customer facing future in this case, the less intensive as to do that in this case, right? So it all plays around. All the human factors play around. Why does the incentive go? How do you measure the success in this case

Starting point is 00:44:53 of a domain success and domain modeling success in this case? So there's various practical limitations around there. Yeah, absolutely. One question, and I want to switch gears here just a little bit, unless Kostas, do you have any other hot takes on data mesh in our first extended conversation on this subject? I have a question for Anant. You mentioned the cross-domain issue, right? You cannot have data that is going to be isolated only in one domain. And I found this extremely, extremely interesting.

Starting point is 00:45:27 And before that, you mentioned when we were talking about data management, about lineage, data lineage. Do you think that data lineage is one of the ways that we can control and understand how data goes and moves in a cross-domain fashion? And do you think there's some value there? Or is there something else that is missing in order to effectively use this data across all the different domains in the company? Yes, totally.

Starting point is 00:45:53 I think that's a very good, very fair point to make in this case. I think the adoption and maturity of a data lineage can potentially minimize the risk factor of producing that inconsistency over the period of time. Now saying that like the current envisioning of this data lineage right now is like after the fact, right? We just, oh, there is a new data set that is created and now let me contribute back to the data lineage. And then the kind of visualizing what is happening end to end. There is no practical application yet built on top of the lineage structure that we kind of visualizing what is happening end to end there is no practical application yet built on top of the lineage structure that we kind of build it i think

Starting point is 00:46:30 marquess the lineage tool kind of started to put the first application that i know of is like kind of triggering backfilling job based on the lineage that we kind of do that so i think the maturity of a lineage infrastructure and then how many how that we kind of do that. So I think the maturity of a lineage infrastructure and then how those systems kind of systematically enable building application on top of lineage, that is going to play a significant role in the success of data mesh in this case. Instead of a reactive, how can it be an active system

Starting point is 00:47:00 capturing the model and reacting to the model when we are generating the data itself. And right now, it's like a reactive, it's a reactive engine. It's not an active engine, the current way of we are doing data lineage in this case. That's super interesting. I mean, that's like something that I always found as a very fascinating topic, mainly because something that the industry is trying

Starting point is 00:47:26 to implement for quite a while. I mean, it's not something new, right? Like especially in enterprise data management systems. But you put it very well. It's very reactive and it still feels like there's something missing there to actually take data lineage and extract the value that we can from that. And probably the data mesh is the environment in which data lineage is going to find its position to deliver the value, right?

Starting point is 00:47:55 The data stack show hot take on data mesh. No, it is a super interesting conversation. And we're getting close to time here. We have a couple minutes left. I wanted to ask you, so you've published 50 editions of the Data Engineering Weekly newsletter. So let's call that a year's worth of content. And you're doing sort of 8 to 12-ish, if my back of the napkin math is right, sort of summaries of pieces of content every week in the newsletter. And so that gives you just an incredible purview over

Starting point is 00:48:34 what's happening in the landscape of data and data tooling and data thinking, because you're really studying it and sort of curating it in an editorial way, which is fascinating. I'd love to know what are the things that you've sort of learned or that have really stuck out to you as you've spent a year curating thousands of pieces of content down into what you actually put in a newsletter? Is it landing mostly on the technology aspect of it, or like an industry aspect of it? Actually, I think both are really interesting questions. So I think both perspectives would be great. Yeah, I think what I really learned about, I think, first of all, like they're

Starting point is 00:49:17 writing the data engineering newsletter is kind of giving me a very structured approach of learning, right? I mean, that is what I really enjoy about writing the newsletter is kind of giving me a very dedicated time for me to learn and then curate certain information. So that's a biggest benefit I'm getting out of it. So what I learn is like when we take a look at a cumulative view on like a different article article you can always extract some patterns out of it you know how those companies are solving those problem or what is the current

Starting point is 00:49:49 problem they are even working on that is a good enough indication that says like what is the existing challenge still now so from one year from like last one year what i looked at it and i'm not sure is like most of the companies working on some kind of a data management system some kind of data discovery system and data linear system and and the course to do like how can we make data data democratization make more data-driven company I think that is a challenge most of the companies like from from all the small to large companies still struggling with and most of them kind of writing that on top of it. I think data mesh is kind of a little bit of popular in this case

Starting point is 00:50:32 is basically because of the very one pain, how can we kind of introduce those systems? Also, as last four years and when the beginning stage of Slack, when I started to read those things, most of the blog posts coming out, how do I scale, how do I cluster, how do I scale XY systems in this case? That is kind of reduced now. And I feel like data infrastructure and data processing largely move towards cloud-based solution in this case.

Starting point is 00:51:00 So people are no longer talking about how do I scale this and how do I do that, rather than more and more focusing going on in the data management and then the data knowledge phase, like data literacy phase, enabling a cultural change in the company. I think that's a very fascinating development. Yeah, absolutely. Super interesting. I mean, and just out of curiosity, if you're willing to share, what kind of time investment is it? I have to believe that it's a huge time investment to curate all of that. I mean, certainly probably educational, helpful for your career, but no small amount of work. Yeah, totally. I think roughly three train of hours per week, not much. Yeah. Wow. That means you're really efficient. You must be a very fast reader. Yeah. So I, like usually in the morning, I just kind of go through like

Starting point is 00:51:51 different articles. Like I kind of flagged that, you know, I wanted to read that later point of time. And usually I'll have, you know, 15 to 20 articles, minimum 15. So I'll just kind of read through that and I kind of then filter it out some articles out of it. And then focusing on maybe, as you mentioned, eight to 10 articles in this case. So there are a lot of articles that I read that I don't include in this case also. Sure. Yeah, absolutely. Okay. One last question. And I'm just thinking about our audience. I think there have been so many helpful things, but there are tons of professionals working in data out there. You had a chance to build some incredible stuff at

Starting point is 00:52:29 Slack. You're doing some amazing things at Zendesk, which we didn't even get to talk about. But if you could just share maybe one or two pieces of advice for someone who aspires to work in a data role at a company like Slack or Zendesk, what would you tell them? That's a good question. So if anyone's starting a new career, I think like the data engineering is kind of a vast field, but you don't need to learn everything to start with. It is a continuous learning process. So I would say like if you're starting a simple SQL and Python knowledge, it's sufficient enough for you to get started in the data engineering. And keep an open mind and start learning more as you progress over the period of

Starting point is 00:53:18 time. Don't spend more time to kind of learn from taking a course or anything. Just a simple tooling is more than sufficient for you to get in. And you can learn over the period of time, I think. That would be my biggest advice in this case. The second thing is, this is a very fast-moving field. It's a very new field. And a lot of things are changing over the period of time. The four years before what I used to work on, the infrastructure, is no longer relevant in this case.

Starting point is 00:53:50 We no longer wanted to maintain a certain expensive EMR cluster. We wanted to move towards cloud-based databases and outsourcing the computing and all those things. But what is important is to focus on the founding principles, how the system is kind of working. So if you started to build this foundation, understanding of distributed computing, understanding of the basic principles for data engineering, I think you can go a long way in this case. So simple tooling, focus on the foundation will be good.

Starting point is 00:54:20 I think that's such good advice. And I think like many, many disciplines, if you understand the foundations, you can learn the new tooling. And it's really important to get a good foundational understanding. So wonderful advice, both for our audience and really for me and Costas as well. So thank you for that.

Starting point is 00:54:38 Anant, this has been a great show. So interesting to learn about your experience. Congrats again on 50 editions of the Data Engineering Weekly newsletter. And again, those of you listening, if you haven't subscribed, please subscribe. It's a great newsletter and it will keep you up to date on everything happening in the data engineering world. Thanks again, Anant.

Starting point is 00:54:59 Thanks so much. Appreciate it. Pretty amazing experience that Ananth has. And it's interesting. We talk a lot about tooling and stuff, but it's really cool to talk with someone. And I think one of my big takeaways was sort of architecting a system from the ground up using what I would say are sort of the core componentry, right? And we've talked with a couple other guests who were in contexts where off-the-shelf SaaS products just weren't sufficient. And it was really crazy to hear Anand say that Snowflake, he's like, well, back in 2016, Snowflake wasn't going to work for us.

Starting point is 00:55:37 And just to hear that is really interesting. And so I just really enjoyed hearing him talk about the way that they architected this from the ground up. My other big takeaway is actually for you, Costas. I really wanted a spicier take on data mesh because your philosophical tendencies are really good for contested topics like that. So I'll bring it up again in another episode and you can give us a more opinionated response. Yeah, please do. Please do. I'll be more than happy. What was your big takeaway? I think I'll be predictable. And I'll say that what I found really interesting is this concept that I think it's not the first time that we hear about it, that when we are talking about the architecture of a data stack the components are pretty much the same regardless of like the type of the company

Starting point is 00:56:29 the size of the company what really changes is scale and control right that's like the two main main things and i think anand put it like very well and describe it very very very well the other thing that i found very interesting is when I asked him about to mention like a technology that he thinks that's the one most important technology that's happened all these years that he works in this space, he actually mentioned two. And one of them is Kafka and the other one is Presto. And I was surprised with Presto, to be honest. But that part of the conversation that we had about Presto, the separation of storage,

Starting point is 00:57:04 processing, why Presto, the separation of storage, processing, why Presto didn't make it. But I mean, okay, Presto wasn't the company, but there were companies that were trying like to monetize Presto, right? And something very interesting. One of the first companies that they tried to do that, give Presto as a service was Treasure Data, which by the way, ended up being a CDP at the end. So yeah, that's right. And then they were acquired by ARP. But part of the conversation,

Starting point is 00:57:32 I think it was really fascinating. That really was. I was not, I mean, Kafka makes sense, but I was not expecting him to say Presto either. So something unpredictable and a sea of predictability from you and me. Our questions are completely deterministic. After that many episodes, probably yes. All right. Well, thanks again for joining us on the show. Make sure to subscribe if you haven't on your favorite podcast network and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.

Starting point is 00:58:16 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. That's E R I C at data stack show.com. The show is brought to you by rudder stack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.

Your Ad Here

The Data Stack Show - 50: From Data Infrastructure to Data Management with Ananth Packkildurai

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.