The Data Stack Show - 52: Discussing Data Warehouses, Lakes, and Meshes with James Serra of EY

Episode Date: September 8, 2021

Highlights from this week’s conversation include:James’ background at Microsoft and current work with EY’s data fabric (2:22)The external and internal facing components of EY’s data fabric (6:...39)The importance of the data lineage (11:29)The most important requirements for data quality (15:32)Looking at the data capabilities of Microsoft (21:30)The data warehouse, explained (29:00)Using a data warehouse or a data lake (34:33)Defining the buzzword data mesh (51:13)The problem with data mesh (59:31)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rutterstack, the CDP for developers. You can learn more at rutterstack.com. Welcome back to the show. Today, we're going to talk to James Serra
Starting point is 00:00:32 and lots of interesting things to discuss. He has a great blog. We read it consistently. And I'm excited to ask him, this came up a couple of episodes ago, but it's a buzzword that is kind of all over the data space. And James has written a lot about it, but data mesh. And I have been forming my own opinions on data mesh as a concept in the data space.
Starting point is 00:01:01 And James has some strong opinions about it as well. So that's what I want to ask him about it. And I may even let some of my opinions, I know there are some strong opinions in the show, but I may let some of my nascent opinions on data mesh come out. Costas, what do you want to ask James about? Yeah, I'm very interested to ask him about the industry as a whole, to be honest. I mean, he's working at Ernst & Young and he's probably involved in pretty big projects and in projects with companies that we don't probably hear that much about here in Silicon Valley. So yeah, I'd love to hear how's the experience working with the rest of the industry out there trying to become data-driven, what kind of technologies they are using, if there are any differences compared to the technologies that we see here
Starting point is 00:01:49 and all that stuff. So yeah, that's what I find very super fascinating. And I'm happy that I will have the opportunity to discuss with him about that. Great. Well, let's jump in and talk to James. Let's do it. James, welcome to the show. We're really excited to dig into a number of topics with you. So thanks for giving us the time. Yeah, thanks for inviting me. Glad to be
Starting point is 00:02:13 here. All right. Well, just give us a brief background. You have a long history working with data. So tell us where you've been and what you do today. Sure. I currently am a data platform architect lead at UI Ernst & Young. I've been here about five months. My main focus here is to build this product internally. It's a data fabric and the idea is you want to collect tons of data. It could be third-party data, UI internal data, or client data into this data fabric and make it available for other products inside of that UI sells to customers as well as use it for understanding our own internal metrics. So it's a very large project.
Starting point is 00:03:07 It's about 200 people. And it's very interesting because we were closely with Microsoft. We're building on the Azure stack. And it's unique in that something that is large on the scale has not been done much. And so with Microsoft's help, we hope to have this built out within the next few months. And before EY, I was at Microsoft for seven years in various different roles, last being at the Microsoft Technology Center in New York City, where I spent every day engaged with different customers, whiteboarding data platform type solutions. It could be that they come in and they want to say, as an example,
Starting point is 00:03:51 learn more about a modern data warehouse and what that looks like. And through discovery and asking a lot of questions, I would come up with a high-level architecture with products that would fit their particular use case. Because it was always very challenging. There could be many, many products that do the same thing on Microsoft. And so wanted to help narrow them down
Starting point is 00:04:12 and make sure they make the right decisions. They don't know what they don't know. So it was very much an educational session for each of the customers in various different industries, various different sized customers. And I was always in pre-sales technical roles at Microsoft. And so this role at EY is a great experience because I'm on the board side of things.
Starting point is 00:04:33 Before I came to EY, I spent many years in Microsoft databases, data warehouses. I had experience with architecting and developing solutions that the main goal was just to collect data and make better business decisions with customers companies out there. And through the years I was also a DBA for many years and I started back in SQL Server 1.0 and OS2 back in 1989, I think it was. I have a long history with working on the data platform stack. Super interesting.
Starting point is 00:05:13 One question before we dig in. We want to talk about a lot of warehouse stuff because you produce some great material on your blog, which we'll put a link to in the show notes. But one question on the project, if you can talk about it. One thing that's really interesting when you described the project that stuck out to me was that there are multiple vectors of both internal and external facing parts of the project, it sounds like. And just to be specific on that, there is both sort of first party data and then also third party data, which isn't necessarily uncommon, but usually the most common use case we see around that is you have first party data and you want to augment it in some way with some set of third party data. But then also, it sounds like the project itself will serve both the business,
Starting point is 00:06:13 but also be included in sort of products or customer facing products that you sell, right? So sort of an internal data use case, and then also an external data use case. Can you talk any more about that? And I think the main question that comes to mind is that seems fairly complex dealing with those multiple vectors and multiple types of data and multiple audiences for the data. Yeah, it is complex. And then you add in the security that is needed at an extreme level to deal with data that is client data. And in a regulated company like EY, there's various rules and regulations you have to
Starting point is 00:06:56 follow. And then of course, each customer's data that you collect, they don't want other people seeing it. They should not. So there was a really high level of security and a lot of challenges with that. But the main idea is let's aggregate all this data together and make it available to the product. So as an example, it could be EY has
Starting point is 00:07:19 many products they sell and a product that a customer may be interested in it could take data that the customer has it could take third-party data to your point and they could aggregate them together and and make better could be machine learning models it could be reports from dashboards that that company could use to maybe find out more about their supply chain where they could increase profits, could use that data to find fraud or money laundering that's going on if they're a bank. They could use that data to find competitors
Starting point is 00:07:55 that are gaining them in the industry. So there's many dozens of use cases. Well, all those products need data and you don't want a situation where a new product comes along, it creates its own ingestion platform, ingests its own third-party data and client data while it's already been done with many other products. So it's unifying that experience and having one ingestion platform that'll collect this third-party data. In addition, think of the data saving, the licensing savings, you know, third-party data.
Starting point is 00:08:28 A company like UI has tens of millions of dollars it spends in third-party data sets. And there's likely a lot of repeat data sets where people didn't know that these other data sets already existed in UI. So we'll have one place where we collect all this data. Then we have a data explorer slash marketplace type environment where anybody can go and search the data we have and they'll say, oh, look, we have this data.
Starting point is 00:08:55 And here's the hooks into that data. So what happens is it's a great product accelerator. If somebody comes up with a new idea for a product and they say we need 10 different data sets and client data, they can go and find out that's already existing in this data fabric and they can quickly ingest that data
Starting point is 00:09:13 and use it and get insights of that and build their product and go to market a lot quicker. So that's a big idea in this data fabric we're building because think of the challenge of ing adjusting thousands of files from many different customers. And you have to clean this data and join it and aggregate it and secure it. You don't want everybody kind of reinventing the wheel and doing their own thing.
Starting point is 00:09:36 So this is built for multiple different products and also for an internal use step, maybe somebody want to look at all this data we've collected in various engagements that EY has had and said, well, let's see where we can optimize things. Let's collect these metrics and maybe build some machine learning models on that. And well, we need the data. So let's have it in one unified place. And that's what the data fabric gives them. So it's quite challenging because of all these various different data sets and client data has much different security requirements and third-party data sets.
Starting point is 00:10:12 So we're going through all those challenges now and it's been a great experience and working closely with Microsoft to see the various products that they have and where the gaps are when you're dealing with other products outside of Microsoft. James, I have a question for you.
Starting point is 00:10:29 I'm listening, like describing all this quite complex architecture like so far. And I'm wondering, I mean, one thing is like, okay, we want to ingest data. We have data coming from many various places. We are going to store this data in one place, and this is going to solve problems around creating data silos and giving access to all the data to the whole organization and built on top of that first-class security so we can ensure security and privacy around the data.
Starting point is 00:11:00 I was wondering, as we enable more and more use cases around data and more and more people and organizations at the company are able to access this data and process them, how important is it to keep track of what's going on with this data? And more specifically, what I'm referring to is data lineage. So, first of all, how important do you think this is when it becomes a problem and how do you deal with it? Yeah, I would say the biggest gap that I've seen customers have, especially one of them, Microsoft with building data warehouses, they didn't give enough time into data governance. And they really need to spend a lot of time thinking through the data governance piece, which includes data lineage, as you asked, data quality, data security, data access, all
Starting point is 00:11:54 these things can be quite complex. And frequently customers just did not put enough time in the project plan for all those different areas on there. And data lineage is a big one because at the end user gets that report, they may want to go this particular number. I'm not quite sure it's accurate. Where did it come from? And you want to be able to respond to them and show them the various stages this went through. And so data lineage is a big part of what we're implementing.
Starting point is 00:12:27 And there are various products that take data lineage and I'll throw out the one we're using is Azure Purview on there, which is not GAED yet. And there's many other great products outside the Microsoft realm for this that'll track this data came from this particular data source. It went and was transformed and cleaned via this procedure and then landed in, say, a data lake. It then was moved into, say, a relational database before it was then moved into something like Power BI where it became a data set that was used for a report
Starting point is 00:13:01 that was used for a dashboard. And if you can't get that answer quickly to the end user, you're going to lose their confidence in what you're giving them. So the challenge is that data lineage is not, it's not as you can just press a button and scan all these data sources and come up with a lineage on there. There's a lot of work that could be done behind the scenes. And you may have to send this information to a data lineage if you say you're changing
Starting point is 00:13:27 data inside a stored procedure because it's too much for some product to scan a stored procedure and tell it everything it's doing. So we have to set up guidelines. And if you're transforming the data, you have to call these APIs in the same purview to tell what you're doing in there. And so this becomes a lot of oversight governance in there. So it's coming up with these particular frameworks and guidelines in there, but then someone's going to oversee it. Maybe you have a center of excellence where anything that's submitted has to follow these rules. And part of one of those rules is that
Starting point is 00:14:03 it's got to send a lineage over to the particular product. So this gives you a nice clean way of seeing everything. And also that helps in making sure as you're building this along, you're not missing steps or not properly cleaning something or avoiding duplication of data in these individual source systems that come in there.
Starting point is 00:14:25 Because in most cases, the data that you're pulling into this data warehouse, you could have dozens and dozens of different sources on this. It's really important to have that lineage to track where it starts and where it ends. Yeah, makes total sense. It's a very interesting topic. And I'm glad that you broke down data governance into different pieces, because I started thinking, you mentioned data quality, and data quality, at least in the companies in Silicon Valley is a pretty hot thing lately.
Starting point is 00:15:02 You see companies, I think, on the catalog just raised another $100 million or something. You have companies like Big Eye raising money from Sequoia. And in general, everyone is looking into the data quality problem and trying to solve it. From your experience, if you had to describe the two, three more important requirements around data quality that a new product should address. What that would be based on your experience so far? Yeah, when I look at data quality, the first challenge comes up that a customer has to answer is who owns this data? And I've been in rooms where there was almost fistfights that were resulting in trying to answer that question because they're the ones responsible for the data quality.
Starting point is 00:15:49 As to collecting this data into a data warehouse, I can tell you how many times the customer said, oh, our data is perfectly clean. And I would say, I'll bet you a hundred bucks I'm going to find some problems with it. And sure enough, as soon as that data comes in, you find that, oh, well, the end-order entry system in order to get past the field you had it in their birth date so people were putting in people who were born in the future or people were 200 years old and and so you have to get this data and then clean it so this is part of the data quality now you can plug those holes and you can revert back to the source systems but the damage has already been done so So somebody has got to clean it all. So there's a lot of questions that kind of go back to the source system,
Starting point is 00:16:29 the owners of that and ask them, what do I do in these situations? If the birth date is not valid in there, should I put an alternate? What's it going to be? There's, there's going to be a lack of conformity if you're pulling in data from different source systems and they have customers in there, one of those systems could use abbreviations for a state and others could use the full name. If you're generating reports, you have to have one common standard. That's a big part of that. Somebody then is going to define the standard.
Starting point is 00:17:00 Usually, you may have a center of excellence team that goes through and says, okay, you need to conform everything and this is what we're going to do. Now add the complexity of master data management. That's going to be part of the data governance in there. I'm collecting those customer data. The last thing you want to happen is you create a report, and the end user looks at it and goes, well, wait a minute. Why is this person in here twice? Their names are misspelled, but they're really the same people.
Starting point is 00:17:23 Now you've lost their trust, and it's going to be hard to gain that back. So you have to think through this data governance. That's why I say you spend a lot of time in it. So mastering the data has got to be another important part of the data governance in there. And even the data quality, well, how do you know the data is bad? Is it a null or is it a zero? What does that mean in there? So a lot of investigation has got to be done with this.
Starting point is 00:17:48 And this is where you want to work closely with your end users. Get them involved in the process early. Ask them, how do I know this data is valid? How do I clean it? Do these numbers look right? So they're not left at the end of it going, well, here's the report. And they go, well, I have no input into this. So they don't feel like they were part of it.
Starting point is 00:18:07 And it's always, I say, always get those custom to end users involved early on in there. Because then they'll be rooting for you if they're part of the process as opposed to having this almost negative reaction to things that are just hand-in-hand. Because it involves a lot of change from what they may have been doing previously. And it's hard for the people embracing the change if you haven't made them part of the process in there. And then I will say the last thing that when people build out data warehouses is you want to have this one version of the truth.
Starting point is 00:18:36 And I've had situations where I've found people creating reports that were not accurate because they were in in some ways, changing the numbers to make themselves look better. And once you centralize the data in a data warehouse and come up with, say, one formula for all these various metrics and KPIs in there, you're going to have a possible lot of disputes on what those metrics or KPIs should be. So again, you get these rooms and you have these arguments in there, but in the end, you will have this one version of the truth. So people can
Starting point is 00:19:10 be confident that they're getting the same answers to the questions they're asking and not having different answers to the one question in there. So all revolves around data governance. I wish I can say there's this magic button that can go through your data and clean it up, but there's not. There's no shortcuts to this. It's a lot of time and effort to make the data quality. But in the end, it's going to be worth it. But you have to put that in your project plan
Starting point is 00:19:37 and spend a lot of time on data governance. Yeah, yeah. I think those are some great points around data quality. And I would also add that all these things, there is a reason that we have all these different parts that they are under data governance. And the reason that we have them under the umbrella of data governance is because, for example, data quality and data lineage, it's important to have both together, right? Like one supplements the other in terms of like the end goal. Same with data access and all that stuff. So yeah, that's a super interesting topic and a super hard topic also.
Starting point is 00:20:14 I think the industry now is trying to figure out like the right ways to implement all these methodologies and functionalities at a large scale. And I think we are going to see a lot of like interesting new companies trying to tackle these problems. implement all these methodologies and functionalities at a large scale. And I think we are going to see a lot of interesting new companies trying to tackle these problems in the future. But talking about new companies, I want to ask you something about a company that at least in Silicon Valley, we keep forgetting when we are talking about data, and this is Microsoft. So you have a lot of experience with Microsoft and their products. In Silicon Valley, we keep forgetting when we are talking about data, and this is Microsoft.
Starting point is 00:20:48 So you have a lot of experience with Microsoft and their products. And actually, it's interesting because for the database systems, at least, Microsoft is supposed to have probably one of the most complete database systems, which is MS SQL. I mean, it might be a pain to manage, but in terms of the capabilities that the system has and the functionality that it has, it's probably the most advanced database in the market right now. But can you give us a little bit more information around the products Microsoft has around data, like data warehouse, for example? What's the data warehouse with Microsoft if someone wants to go to Azure today and what other tools and products they offer for all that stuff that we discussed so far? Sure. And I had this discussion many times with customers because, again, they were confused why there's so many products that Microsoft has.
Starting point is 00:21:39 And, okay, what's your use case? And I'll narrow down that product list for you to then go and do research on there. If we look at the OLTP side, you have your SQL server, you have your SQL database, you have those relational databases that have been around for forever, especially meaning SQL server,
Starting point is 00:22:01 and then SQL database, which has many different flavors of that, is a PaaS solution instead of an IaaS solution that you get within SQL Server and VM. But those are mostly for OLTP, which sometimes you can get away with a data warehouse in there if it's small, let's say under four terabytes. And that only applied to customers who were very small customers
Starting point is 00:22:25 who didn't see a lot of growth in the data they're collecting. Once you get over four terabytes or around there, you want to start looking at a data warehouse solution. And that's where in the Microsoft realm, you get into address synapse analytics. That is the tool of choice, I will say, in Microsoft for large amounts of data for that data warehouse. I have my history at Microsoft when I first started seven years ago was on the parallel data warehouse.
Starting point is 00:23:02 That was Microsoft on-prem data warehousing solution on there. It's like a Teradata and a TISA with MPP, multiple parallel processing technology. So that technology gives you an advantage over the traditional SMT technology in that it can handle massive amounts of data. It distributes the data, distributes the queries. It could be a long conversation just in itself on how that works. But this opened the door for queries to go anywhere from 20 to 100 times faster than a traditional SQL Server query on there. Well, that product eventually migrated into SQL Data Warehouse in Azure. And that has been around for a number of years.
Starting point is 00:23:44 And that product then morph for a number of years and that product then morphed into Azure Synapse. And that technology is still in Synapse under a relational pool that they have, a dedicated pool in there. But that product also added a bunch of other features.
Starting point is 00:24:01 It has a serverless pool. It has Spark clusters in there. It's got Data Factory built in. So it's a great tool if you're going to build out a data warehouse. Everything is on a single payment class. And that's where Synapse has a tremendous value for customers to enhance their time to market or time to build a solution in Synapse because of that integration of all those products in that single thing in class.
Starting point is 00:24:32 And they still have that MPP technology in that dedicated pool. And so that's the go-to with customers. And within there, you can even make the argument that we get into the serverless option. So instead of having a dedicated pool that can be very costly, and maybe I don't want to use it for databases, big warehouses that are small, you could make the argument, well, I can use the serverless option and only pay for query. And so maybe I can even then open up this to smaller databases,
Starting point is 00:25:04 the data set size in there. A lot has to do with customers and what your current skillset. Are you SQL Server developers? And that's going to make the transition pretty easy into Synapse in there. And so I asked a lot of discovery, during discovery, a lot of questions about customers
Starting point is 00:25:20 and then see if it would be a good fit. And usually it doesn't take more than a couple of days for anybody who's used to SQL server, SQL database to move to something like Synapse. And so that product is what I would say is the go-to and most, I would say almost 90% of cases with customers they Synapse was the solution for. It is really fun.
Starting point is 00:25:43 We talked with Costas. I don't know if you remember, but we talked to the startup company who is building a product in the medical space, actually. And they were building on a Microsoft stack. And it was great to hear. That was an early episode, I think. But it was really fun to hear about that
Starting point is 00:26:04 because you hear, like you said, it's really fun to hear about that because you hear, like you said, it's really easy to forget about, especially in the world of data, where it's all these new fancy tools and new fancy startups that Microsoft has some really awesome technology. So James, thanks for giving us some detail there and reminding us of that. Yeah, sure. It's always interesting with customers. They don't know what they don't know. So they come in and they think, well, we should do everything in SQL Server. And wait a minute, we have these past solutions like SQL Database, which has flavors, it has serverless features of managed instance, it's a hyperscale, so we can handle databases in OLTP that can be extremely large. And the challenge is the technology is changing so quickly that even though my full-time job at Microsoft was keeping up with the data platform, I can barely do that.
Starting point is 00:26:51 And so you can't expect customers to keep up with it all. So they would come to Microsoft, and they had cloud solution architects and MTC architects like myself that would educate them, or they'd go to partners and help educate them. Because the reason data warehouses fail in most cases techs like myself that would educate them or they'd go to partners and help educate them. Because the reason data warehouses fail in most cases is just customers use the wrong technologies for their use case. And I would see customers who use a certain product and they would go, why don't you use this other product? And they'd go, well, we didn't even know that.
Starting point is 00:27:19 Well, okay. That's the reason. And so it's really important up front to be aware of all the products and their use cases. So choose it early and don't run into a mistake where you're a few months in, many months in, many millions of dollars to spend and you realize, ah, this is not the right product. And I go back to the beginning. Wise words. Well, let's switch gears a little bit. As we were prepping for the show, we talked about,
Starting point is 00:27:51 which I love that we started out talking about sort of an extremely complex project with all different types of data and all different types of users. And then talked about the complexity of data governance and data lineage at scale. Let's step back a bit because something that you've written about a good bit is actually just the fundamentals of the data warehouse. And you have a great post on your blog and a great video on YouTube just that I think is called Data Warehouse Explained. And I'd love for you to
Starting point is 00:28:28 just give us an overview of that. And as I was saying before the show, I think we get exposed to so many new interesting technologies in the data space that it's easy to sort of assume that we know the fundamentals of a tool that we use every day. And so I think zooming out and getting context for that is helpful no matter where you're at in terms of working with data. So James, give us a high level overview of the data warehouse explained. Yeah, sure. And it's in particular true of the smaller companies who are just beginning their journey of trying to get better insights and make better business decisions through data on there. And it could be that they have some source system. It could be a homegrown thing that's OLTP that
Starting point is 00:29:20 they collect all this data maybe about customers. They could be using some CRM or ERP system like an SAP and they say, well, we want to generate some additional reports and we may want to combine what we have with multiple source systems. Could be even, hey, why are our sales slow in certain areas of the country? Well, maybe it's something weather related. So we need to combine our data with weather data or competitive data on there. Well, okay. So we want to generate better reports.
Starting point is 00:29:54 Well, what you don't want to do is try to cram that data into say SAP or your homegrown application and just hammering with reporting on there because you're going to make the end user very angry. And that's the first problem I see with customers trying to do reporting on live production systems is they spike the CPU. People start getting angry at IT. What's going on here?
Starting point is 00:30:19 Somebody wrote a query that was malformed. Man, I find a dollar for every time I did the kill command on the DBA. I'd be rich right now. And so you need to offload the data from a production system. Now, you can replicate that data, and there's various ways of doing it in SQL Server,
Starting point is 00:30:37 but a better way is to take that data and copy it into some location where you can make it better optimized for queries in there. So I can put different indexes on it. I can lay it out in a certain way. I can position it in a certain way.
Starting point is 00:30:56 I could also change the field names and the table of names to make it easier for people to understand. And if it comes from some European system, you may have some really cryptic names. Because the idea is you want to have self-service BI. You want to create a warehouse that has cables in it that are very easy for an end user to go to a tool
Starting point is 00:31:16 and just click and drag those fields onto a report and build it out without having to get IT involved. So you need to make it more presentable by copying out of that source system into that data warehouse in there. Also, you can have a lot more compute on top of that data in the data warehouse in there. You can ingest many different sources of data. You can do the cleaning of the data in there. You can master the data in there.
Starting point is 00:31:43 And that gives you protection against, say, a source system upgrading. Because if you're running reports running against a source system, they upgrade to a new original, reports may break. Well, if you copy that data into a data warehouse, well, the ETL into the data warehouse may break with the upgrade, but at least the data in there is okay, and
Starting point is 00:32:01 you're not going to have this huge problem of having to go back and rewrite all these various reports with your queries on there. And it also allows you to clean the data and find things that may result in holes in the source system that you can go back to the source system and say, you need to plug this hole in there because the data is not clean. And by having that data in that data warehouse,
Starting point is 00:32:26 you have one version of the truth. And that can be used as the basis to create all the reports and dashboards. And you can put that data in the data warehouse in a third normal form that has many relational databases that can be joined together
Starting point is 00:32:42 to produce those queries. But a lot of times customers will go one step further and they will create a star schema, which has taken that data and those multiple tables and joining it together. So you have this factor dimension table. So you have a lot less complexity because somebody's done the work to create those joins in there. And so again,
Starting point is 00:33:01 that end user can very quickly and easily generate reports off that. Now there's other steps you can take. You can aggregate it. You can put it into a product like Azure Analysis Services, where it's a cube and it aggregates that data. So it's also for performance reasons. And you can quickly get answers to queries that may take quite a long time. You can put hierarchies in there. And so there's all this additional steps you can take. Now, you may be saying, well, this additional cost and complexity. Well, it is, but there's a reason for that is that you are making that end user very easy to have reports that are not only easy to create, but very performant.
Starting point is 00:33:43 So there's that trade-off of cost and complexity, but it's worth it because you will have the speed and the simplicity for your end users on there. And so this is a lot of what I explain to customers as that data moves through this modern data warehouse and lands on all these things and copies all these things. The end result is going to be worth it. But you have to do the work up front. James, you gave a very good description of what a data warehouse is. But I would also like to ask you about the concept of the data lake,
Starting point is 00:34:19 something that you hear more and more lately. And can you tell us a few things about what a data lake is? What are the differences compared to a data warehouse? And when someone should consider one or the other? Sure. And that's a very hot topic. If we go over what I just mentioned and putting everything in a relational database, that was the way it was for many years. But there were problems arising on that.
Starting point is 00:34:44 The first is you have to have this maintenance window we have to knock images off the system because if i'm loading all this data and i need to clean it and master it do all these other things that's a lot of cpu a lot of processing that's going to be done and many times i see maintenance windows over three hours four hours we go to eight hours in there and what happens if you want somebody who wants access to data 24-7? What happens if you kick them off, but then there's a problem and you run over the maintenance window? Maybe they tell them you can't get on the system until it finishes fixing this bug or whatnot in there. So along came the data lake to help with some of those problems in there. You can think of the data lake, and there's many reasons why
Starting point is 00:35:23 not the data lake, but one of them could be I want to offload all that transformations of data that's staging area that you have in a relational database and put that into a data lake so the data is copied into that data lake instead and I put compute
Starting point is 00:35:40 on top of that data lake and I do all those transformations without affecting the data warehouse, the relational data warehouse in there. And then that maintenance window essentially goes away or just maybe a few minutes where you load the data after it's been cleaned in the data lake. So the data lake becomes that staging area. And so that's one huge reason right there for a data lake. Others are, I can hoard data in a data lake because if you look at the data lakes, the cost can be very, very cheap, especially compared to putting in a relational database.
Starting point is 00:36:13 And so I could, as opposed to a relational database where it's very costly and I have to delete data that's older or only keep data in there if I'm absolutely sure I need it, I can just dump all this data in a data Lake and down the road can see if I need it or I can keep a complete history of that. And because the Data Lake is schema on read, meaning I can put data in there without any upfront work. It's like a glorified file folder on your laptop. Create folders, put the data in there, as opposed to a relational database
Starting point is 00:36:44 where it's schema on right, meaning I have to go in there, create a database, create a table, create a field in there, write the ETL that landed in there. And so it's a lot of extra work in there. So I can put that data in the data lake very quickly. And then somebody who has a skillset
Starting point is 00:36:58 to read that data in a data lake can go in there and look at the data and investigate it and see if it's even valuable before you go through the work of putting in a relational database, which I spent many times as a DBA doing all this work for an end user to put data in the database. And then they go and tell me, oh, it turns out we don't need that data, or it's not relevant, or it doesn't give us the value we thought. Wow, that's just weeks out of my life that are gone now. I can instead just dump that into the data lake,
Starting point is 00:37:28 and if they have that skill set, they can query that data and see if it's important before I do all that work to it. Or maybe they just need a one-time report, or maybe they're data scientists and they need to build a machine learning model. So now they have that data lake to do it in there. So it's kind of the best of both worlds by having that quick access to it. However, you still, in most cases, want to have a relational database for a few reasons. One of them is in the data lake, the metadata is separate from the data.
Starting point is 00:37:55 So it can be quite confusing and challenging for end users to make sense of the data if the metadata is not along with it. Now, this is changing, and products like Synapse have ways of making it easier to make sense of that data. But in the end, it's just files sitting in a folder system. And so that could be too challenging for end users in there. It could also have less security on there. If I'm dealing with a file folder structure in there, I could also have less security on there. If I'm dealing with the file folder structure in there, I could put security on a file. But what happens if that file needs access by many different users who only should see certain rows in there? Maybe it's separated
Starting point is 00:38:36 by department. Well, you can't do that in a database. There's no role level security that there is in a relational database. There's no columns level security and all this additional security that have been part of relational databases for many years. Yeah, there's certain workarounds in the data lake that give you some of that, but it's very challenging, a lot of complexity, a lot of extra costs. So a lot of customers said, I'm going to use the data warehouse as that security layer and that presentation layer. And I will use the data lake for the cleaning and transforming of the data for its use of power users. So in most cases, and I've argued for many years, you should just the data lake.
Starting point is 00:39:27 For example, you can use T-SQL and Synapse on data sitting in the data lake. And that was the big problem before was a customer said, well, I want to use data in the data lake, and you're telling me I have to use something like Hive SQL or Spark SQL. I just want to use regular T-SQL. And as much as SQL could have been similar, it still wasn't enough. And products that Microsoft had like U-SQL failed because it just was too different. And so it gives you the benefit of using T-SQL. So what you can actually do is create a view on top of a file and then you have the
Starting point is 00:40:03 metadata in that view and you can use regular SQL and then that made it a lot easier to open up the door for customers to say well maybe I'll just keep everything in a data lake because you also have this serverless component that goes scaled up and down that's only for the query so I can save money that way but the bottom line is it still can be very confusing to have data in a data lake. If you're dealing with many sources, many files, many folders, still in a large majority of time, you want to have a relational database with it. But I can see a little bit of movement into getting away with just a data lake, especially when you look at things that Databricks has come into play with their data lake house and their Delta Lake, which I bet you can talk more about too.
Starting point is 00:40:51 But understand that the data lake is not what people thought of when it first came out, this land of rainbows and unicorns that you just dump data in there and the magic comes out and it's all cleaned and governed in there. It's more work to use a data lake in there, but you'll get a lot more benefits out of your solution if you have a data lake in a data warehouse, but realize it doesn't slow down the process. It doesn't speed up the process of data governance in there. It adds more to it, but in return, you can get a lot more value out of your data. It's interesting hearing you give these explanations. The term, hearing you describe
Starting point is 00:41:34 all of the practical uses and value you can get from a data warehouse, it almost feels like data warehouse is a strange term. When you think about a warehouse, at least the initial thing that comes to my mind is you're just storing a bunch of stuff in a warehouse, right? And almost every part of the description you gave was actually really active, right? You can do this, you can do this. It makes this process easier. There's these sort of levels of security, which is really interesting. I guess maybe it's more akin to maybe an Amazon warehouse where you have all these robots driving extreme efficiency on the floor of the warehouse as opposed to just storing stuff.
Starting point is 00:42:19 One question, and I want to, yeah, we have plenty of time to cover the last topic that we wanted to cover. But one question before we leave the data warehouse, data lake discussion. At scale, it certainly makes sense to have a data lake and a data warehouse. We probably don't have time to get into the details of the data lake house and some of the new architectures that we're seeing. But one thing we've talked about on the show that I think is helpful is in the life of an organization, you go through phases where you hit breakpoints on needing to implement new technology or sort of scale or business reasons where you may want to implement new technology. And we've talked about how, okay, two guys in a garage as a startup, they're just querying their production database because they don't have enough data for it to be worth it to add additional infrastructure. And then at the extreme scale, you have
Starting point is 00:43:22 companies with multiple data warehouses, multiple data lakes, data marts, complex orchestration, etc. In terms of a warehouse and a data lake, would love your perspective on which one comes first? And when does it make sense to augment with the additional tool? And I know that's a little bit of a loaded question because there's a lot of dependencies, but we just love your high level thoughts on that. Yeah, sure. Most of the customers I saw that they have been down the road for a number of years and they're having pain points. Maybe they had just a relational database and they're going, well, my queries are taking forever. I have this maintenance window. I need to load more data.
Starting point is 00:44:05 The DBA is saying we have no more space, no more compute to do all that. And now the report starts suffering. You can't augment with additional data. So you have all these challenges in there. And that's the case of a traditional data warehouse is you have these limits, especially if you're on-prem. And then my own data warehouse came out.
Starting point is 00:44:27 You can think of it as I'm migrating to the cloud because in the cloud I have unlimited compute and storage and also can then use some additional tools that make it easier to live. There's like a SQL database or synapse that has the solution of the platform as a service on there. And then you can start adding, using additional tools to master the data, to clean the data. And in the end, a modern data warehouse has five steps. You ingest the data, you store it, you transform it, you model it into a form that's easier to use in a relational database, and then you visualize it
Starting point is 00:45:11 on there. And then along the way, there may be machine learning you're using on there. So the idea is I need to collect all this data. And a lot of customers, that's their first challenge. And I have these four stages of maturity. The first one is, is I have this data that's sitting everywhere. It's structured, but it's locally managed and you have spread marks and Excel spreadsheets. So stage two, where most customers are at is you need to essentially locate the data. And it's always surprising how many customers are not through stage two yet. And that could be creating a modern data warehouse, putting all the data in one central location, and then starting reporting off of that.
Starting point is 00:45:52 And that's great. And it's sort of a rear view mirror approach. I can use that data to see where I've been and see trends. But the next stage, stage three, is predictive analytics. I want to take all that data I've captured and I want to predict predictive analytics on there. Maybe I want to use that to predict customer churn and take actions beforehand instead of being reactive. I can be proactive. Maybe I want to see when a part's going to fail and change that part through machine learning telling me that it's going to fail before it fails in there. And then the next stage after that is transformative,
Starting point is 00:46:28 where you want to take data no matter what the size, the speed, or the type of data and collect it all at a very large scale. And this is where we get into showing customers the art of the possible. If you ask an end user what would they like in addition to what they have now, what would make it better, and they're using Excel, they're just going to ask, tell you that they want additional features in Excel. They may not be aware of a product like Power BI or some of the machine learning. And I always say, show them the icing on the cake up front.
Starting point is 00:47:06 Give them the art of the possible. They're going to look at those power reports and dashboards and those machine learning tools to model them, and they're going to go, oh, holy, I'm completely shocked. You can see light bulbs going off in their head. Sometimes you can physically see them because they get all these ideas. They had no idea you could get all the value out of that. And they start going, well, I see so many ways I can save money with my company.
Starting point is 00:47:28 I can see so many ways I can take shortcuts into generating reports. All this machine learning stuff is awesome. You start showing the industry models that they can create and they just go crazy because you're making their life easier in there. But then you have to tell them, okay, well, to do that, you have to get to stage two at least and collect all that data. And it's a lot of work on there, but you're now getting buy-in from the end users. You're getting buy-in from the business units that may unlock some budget. And so I saw this trend of talking more with end users that would come to me than IT because IT saw everything as just additional work. And they may not be so excited about building this modern data warehouse.
Starting point is 00:48:07 The end users, they see the value. They don't care about the technical details that have been passed on IT, but they now see what they can get out of this. And especially if you prototype things, use something like Power BI that makes it easy, they can quickly see and touch and feel that reports. And then they can say this is awesome this is what we want and and so that gives me a level set to say this is where customers if they
Starting point is 00:48:35 start out new they're going to use a data lake in almost every case they're going to use the data warehouse relational one once every case if they've come from a traditional where they just use a data warehouse, they usually want to incorporate a data lake. And there's ways of incorporating it where it's not everything's going to the data lake at first and maybe just new data sets that they haven't been able to ingest. And that goes to the data lake first. And so it's a little bit of variation until they eventually get to the ultimate solution of having everything over data lake and then some of the data going into a relational database in there. easy to think about the sort of technological or data scale triggers that might necessitate augmenting your stack, but that doesn't take into consideration trust, which has been a really big thing on the show really since we began, of people who are going to consume data products that whatever your architecture is
Starting point is 00:49:46 produces. And I think the reporting example is great where it's okay, can we actually deliver real value with this component of the stack to sort of an end user consumer within the business? And then that of course justifies augmenting the stack for more complex use cases in the future. And that's just really helpful. I think that was, I think it was a really helpful way to think through it. We're closing in on the end here. And one of the subjects that, that we wanted to get your thoughts on is what we'll call sort of a data buzzword. And it came up on an episode, maybe two or three episodes ago. And it was a term that I was really surprised we hadn't covered yet on the show. And you've written a lot about it. So the term is data mesh. And I'll say the same thing I said as we were prepping for the show, data mesh is one of
Starting point is 00:50:45 those things that it sounds cool. We all think it probably is pretty cool. But if you ask the average person to define it, could you just define data mesh for me in a couple sentences? It's actually kind of, it's kind of hard. And there are parts of it that are still sort of ambiguous on a practical level. So can you give us your take on data mesh? And then we'll dig into a couple questions from there. Yeah, I was unaware about a data mesh that buzzword until maybe eight or, well, maybe a year ago was when I first came around. It was very confusing to your point. And this is one of the challenges with the data mesh is how can you have a new way of building a solution if nobody can agree on what the term means.
Starting point is 00:51:36 And I think it's got a way to go because I'm seeing people call everything a data mesh now. And the bottom line of a data mesh is really focused on organizational change, not a technical change. The idea of a data mesh is a mind shift where you go from a centralized storing of the data to decentralized.
Starting point is 00:52:06 So everything I've been talking about has been copying all the data into a central location, a data lake. Well, why not, and this is the data mesh theory, why not have all these various organizations in your company have data as a product, have a data domain, where instead of, say, HR and payroll and a homegrown application that could be something maybe dealing with customer orders, instead of copying all that into a central location, you keep it decentralized and you have each of those teams in those orgs who know the data best keep the data in their organization and you as an it give them the rules and and sort of like a contract that they have to follow to govern the data to clean the data and master data
Starting point is 00:53:03 but the data is is kept distributed so you're reducing the amount of etl to copy the data, to clean the data, and master data. But the data is kept distributed. So you're reducing the amount of ETL to copy the location. You're allowing the people who know the data best create the reports and dashboards. And you're reducing the bottleneck of IT having to do everything. The idea being we can scale better now because we're not limited to IT being the bottleneck. We can have all these organizations
Starting point is 00:53:32 who now you embed IT-like people in these organizations and they're all often doing their own thing. And so it becomes decentralized ownership instead of centralized ownership. You have less pipelines going to a central location and the more local pipelines in there. You think of data as a product by each of these organizations, and you now have cross-functional domain teams instead of one siloed data engineering teams. And that is the definition that I would say most people agree on,
Starting point is 00:54:18 but there's many, many different exceptions that people make to it, which is why we see a lot of issues with the confusion to it. And then while all that sounds great in theory, to implement that technology can be very challenging. I don't think even technology is even there yet. And then the reason why I have a lot of concern about the data mesh is because while it sounds great in theory to give each of these different domains, their responsibility is imagine you're a large company and you have dozens of these domains. And now you're going to tell all of them to control their own data, to give them extra work. And you have to give them the benefits of why they're
Starting point is 00:55:06 going to do that and they're going to be thinking in their own terms of i'm just going to collect what data to satisfy own needs they're not thinking enterprise why hhr may not be thinking of how to combine their data with all these other pieces of domains in there and so somebody's got to have that enterprise view and somebody's got to collect all that data. And that's where it gets extremely challenging in there. So while I like the idea of a data mesh, I see it only used for maybe 1% of customers because there is so much upfront work
Starting point is 00:55:42 to make that organizational change that many companies are, it's not going to work. And you also have to be at a size where you have this complexity of and challenges of scale, which again, 99% of companies don't have that problem. Many of the current solutions scale very well. They will continue to scale very well. I've seen Microsoft have many petabytes of data and then make it work. So sometimes the argument in data mesh is things are not scaling, but they are scaling. And sometimes I feel like they're creating a panic point where there's not one on there. So that's where it, and I put in my latest blog, a lot of the challenges I see
Starting point is 00:56:26 with the data mesh, but I'm hopeful that for certain customers, it's going to be worth it, that extra development time. And they're going to wind up getting a lot more benefit out of their data if they build this and it works correctly. Yeah. super interesting topic. And it's been interesting to consider it and have a couple conversations on the show. And I think you hit the nail on the head when you said, conceptually, if you just say, decentralize your data, and sort of has these effects of a sort of democratized access and all these different components, it actually is, it creates a lot of complexity practically in the stack for most companies, at least as it seems to me. And one of the concepts actually that's come up on the show
Starting point is 00:57:26 a lot over the episodes is that many times, especially when you're dealing with a sort of particularly critical or high scale data concerns, it's like simpler is often better. And a friend that we've heard a couple of times is, yeah, the way we do that is it's kind of boring, but guess what? Like it works and we, it's reliable and it's going to deliver on the mission critical things for our customers or internal stakeholders, et cetera. And, and, and so you see that, and also the tooling around centralization is getting better and better and actually making things a lot simpler, right? We didn't get into sort of what Databricks and Snowflake are doing around combining functionalities, but things that were once harder becoming easier in the context of centralization. So
Starting point is 00:58:23 it is interesting. It kind of reminds me, I don't know if you remember, and I'm far from an expert on organizational design, but I remember maybe five years or five years back, maybe there was a really big push for this organizational design called holacracy. And I remember it's kind of like data mesh where on the outset, it was like, yeah, that sounds really great. And I happened to be really close to a really large scale company that was implementing this. And on the ground, practically all the employees just said, this is way too complicated. Can I just go talk to my manager? And so it kind of feels the same way, but at the same time, time will tell. And there are certainly things that we said 10 years ago, because technologies didn't exist that do today,
Starting point is 00:59:12 and they changed the way that we thought about things. So we will certainly see where things land. But I will say one thing that is neat to hear you point out is that you've actually seen it happen on the ground at a real company, which we haven't talked to someone who's seen that before. Yeah, you really hit the nail on the head. It's a lot of change. It's a lot of complexity. The problem I have with data mesh is sometimes it's presented to be almost an easy button. And as customers get into it, they realize it's more work. And if you look at some of the use cases that people, and there's not a lot that I've seen yet of implemented data mesh. They were spending, in some cases, years building a data mesh, even before data mesh was a word.
Starting point is 01:00:02 Because of the complexity and difficulty of getting all the domains within our company sometimes there's dozens of those domains to buy into the data mesh and the problem is if you just have one that says i'm not going to do a data mesh you're telling me i got to do a little extra work you tell me it's going to take a year or two and i got to get work done now. So forget the data mesh. Well, now you have a data silo. Now how do you deal with that? Yeah.
Starting point is 01:00:28 And if everybody's going off doing their own thing, even though they said they'd be part of the data mesh, and somebody's using SQL Server, somebody's using Oracle, you have everybody just coming up with their own technology solutions in there, and you're the person that's got to collect all this data and make sense out of it into one, now you're opening up a lot of extra work in there and you're the person that's got to collect the list data and make sense out of it into one now you're opening up a lot of extra work in there and and then even the skill set challenges agent those domains now have their own it like people to go and build these solutions in there and it's and we're seeing in the why that to find the talent that can do that is so difficult and because now you're asking to find even more talent in there.
Starting point is 01:01:07 And they may not be as skilled and have the expertise as somebody in IT. So now they may build something that's suboptimal. Somebody had a great analogy. It's like telling all these cities to go and build their own roads. Well, I can kind of think I can build a road. I can dig a hole. But the end result was I may have some city, some roads that are not built well, and I may do it a completely different way than the other cities. And so you have this huge mess. And now you have to say all your cities have to combine all your roads together from one city to
Starting point is 01:01:41 the other. Well, who's going to do that? They're all going to say, it's not my responsibility. Well, then IT's got to go and do it. And they got to combine all the roads together. So it could take a lot of extra time, a lot of extra buying. Again, it could be worth it, but you have to know these things up front. That's why I try to put my log, all the concerns you have to go through and make sure that you address all those and go, yeah, this could help us or no, we're going to take a pass. Yeah, absolutely. I think that's the road analogy is the road analogy is a great one. And I think that's a huge benefit of having Purview over all of the components of the data stack centrally.
Starting point is 01:02:21 But time will tell and technology will tell. And unfortunately, we are out of time, but I'm so glad we got to talk about the data mesh buzzword and dig a little bit deeper into that. Always fun to kind of talk about the buzzwords du jour. James, thank you so much for taking the time to join us on the show. We learned a ton.
Starting point is 01:02:41 Really fun to hear about all the cool stuff at Microsoft and all the cool stuff at Microsoft and all the cool stuff you're working on at EY. And we'd love to have you back on the show sometime soon in the future. Yeah, happy to come on again. I love talking about this. I can spend hours until my voice goes out. And so thank you for having me for this hour. Absolutely. And tell us where people can read your blog. It's a great blog. We read it a lot. And that's actually, I've read a lot about data mesh on your blog. So where can people find your blog posts? Yeah, it's my name, jamesserra.com, S-E-R-R-A. You'll find a lot of posts on data architectures
Starting point is 01:03:19 and your data mesh. There's a contact me button. If you have questions for me, feel free to shoot them over and I'll be happy to answer them. Great. Well, thanks again for joining us and we'll talk again soon. Thank you for having me. Right. My takeaway is not related to DataMesh, although I'm glad that I shared some opinions with James and we were able to maybe not complain about data mesh, but point out some of the issues around it. But my main takeaway was actually something that was on one of our earliest shows, and that is all the different tools that Microsoft offers that are really cool. And Microsoft, for some reason, well, maybe not for some reason, probably a lot of the reasons we know, but kind of has this weird feel of not being cool, especially for, you know, startups or data infrastructure. But they actually offer some really cool tools.
Starting point is 01:04:17 So I was really, it was fun. And I'm really glad you asked him about some of their products. So that's my big takeaway. Kostas? Yeah, absolutely absolutely i really enjoyed that part i would like a pretty good introduction to all the different data infrastructure products related products that microsoft has and yeah we shouldn't forget like microsoft is huge and regardless of what we think about them, I mean, they have built some amazing technologies, like MS SQL is one of them, for example. And there are many companies out there using Microsoft products, right? That's how Microsoft has become so big. So
Starting point is 01:04:57 that's something that we shouldn't forget. And they also do a lot of research. That's also one of my takeaways the other one that i found very interesting and important i think it has to do about data governance i think that james with the description of like data lineage and data quality security he gave us like a good description of how complex of a thing data management is. And I think this is the space where we are going to see a lot of innovation happening in the near future. And it was very interesting to hear his opinion about that and how important it is. Absolutely.
Starting point is 01:05:36 And I know data lineage is a subject that you're particularly passionate about. That's a big, it grabs a lot of headlines. Yeah, it's the equivalent of data miss for you. That's right. One thing, here's a quick hot take for those of you who make it to the end of the episode on the perception of Microsoft. Here's my one minute theory that I just came up with. So do you remember how we talked about BigQuery maybe having some brand perception problems because they also like people use Google Docs and Gmail. And so it's like use Google for a large scale ML project on your warehouse, because you also it's like your personal Gmail that you get a bunch of spam email to. So here's my quick one minute
Starting point is 01:06:26 theory on Microsoft. They started out, you know, they provided tons of, and they always have provided tons of like data infrastructure products and other things like that. But that was sort of a bigger deal, like several decades ago, before their consumer products gained worldwide traction, right? So I think a lot of the people working in data today, their primary interaction with Microsoft was through the Office suite, right? Which is sort of its own conversation. And so when the Office suite, which is still the most widely used business software in the world, but you have all the cool kids now using Google Docs and Microsoft Office is not cool. And so if I'm going to go choose infrastructure
Starting point is 01:07:10 for my startup, I'm not going to choose Microsoft because I have a weird taste in my mouth. Yeah, that's true. And to your point, we shouldn't forget that probably one of the most sophisticated and most used data manipulation software out there is Excel. Yeah. So we should never forget that.
Starting point is 01:07:35 Like regardless of what we are doing, I mean, many very serious decisions about our lives every day are based on stuff that is happening on Excel. So never forget that. Never forget. Get some t-shirts that say Excel, never forget. Yeah, let's do that. Well, this is your little bonus round with some one minute theories and a t-shirt idea. Thank you again for joining the show. We'll have more interesting guests and potentially surprise hot takes at the end of the show for you coming up soon. We hope you enjoyed this episode of the Datastack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 01:08:29 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.