The Data Stack Show - 52: Discussing Data Warehouses, Lakes, and Meshes with James Serra of EY
Episode Date: September 8, 2021Highlights from this week’s conversation include:James’ background at Microsoft and current work with EY’s data fabric (2:22)The external and internal facing components of EY’s data fabric (6:...39)The importance of the data lineage (11:29)The most important requirements for data quality (15:32)Looking at the data capabilities of Microsoft (21:30)The data warehouse, explained (29:00)Using a data warehouse or a data lake (34:33)Defining the buzzword data mesh (51:13)The problem with data mesh (59:31)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rutterstack, the CDP for developers.
You can learn more at rutterstack.com.
Welcome back to the show.
Today, we're going to talk to James Serra
and lots of interesting things to discuss.
He has a great blog.
We read it consistently.
And I'm excited to ask him,
this came up a couple of episodes ago, but it's a buzzword
that is kind of all over the data space.
And James has written a lot about it, but data mesh.
And I have been forming my own opinions on data mesh as a concept in the data space.
And James has some strong opinions about it as well.
So that's what I want to ask him about it. And I may even let some of my opinions, I know there are some strong
opinions in the show, but I may let some of my nascent opinions on data mesh come out.
Costas, what do you want to ask James about? Yeah, I'm very interested to ask him about the
industry as a whole, to be honest. I mean, he's working at Ernst & Young and he's probably involved in pretty big projects and in projects
with companies that we don't probably hear that much about here in Silicon Valley. So yeah, I'd
love to hear how's the experience working with the rest of the industry out there trying to become
data-driven, what kind of technologies they are using, if there are any differences compared to the technologies that we see here
and all that stuff.
So yeah, that's what I find very super fascinating.
And I'm happy that I will have the opportunity to discuss with him about that.
Great.
Well, let's jump in and talk to James.
Let's do it.
James, welcome to the show. We're really excited to dig into a number
of topics with you. So thanks for giving us the time. Yeah, thanks for inviting me. Glad to be
here. All right. Well, just give us a brief background. You have a long history working
with data. So tell us where you've been and what you do today. Sure. I currently am a data platform
architect lead at UI Ernst & Young. I've been here about five months. My main focus here is
to build this product internally. It's a data fabric and the idea is you want to collect tons of data.
It could be third-party data, UI internal data,
or client data into this data fabric and make it available for other products inside of that UI sells to customers as well as use it for
understanding our own internal metrics.
So it's a very large project.
It's about 200 people.
And it's very interesting because we were closely with Microsoft. We're building on the Azure stack.
And it's unique in that something that is large on the scale has not been done much.
And so with Microsoft's help, we hope to have this built out within the
next few months. And before EY, I was at Microsoft for seven years in various different roles,
last being at the Microsoft Technology Center in New York City, where I spent every day engaged
with different customers, whiteboarding data platform type solutions.
It could be that they come in and they want to say, as an example,
learn more about a modern data warehouse and what that looks like.
And through discovery and asking a lot of questions,
I would come up with a high-level architecture with products
that would fit their particular use case.
Because it was always very challenging.
There could be many, many products
that do the same thing on Microsoft.
And so wanted to help narrow them down
and make sure they make the right decisions.
They don't know what they don't know.
So it was very much an educational session
for each of the customers in various different industries,
various different sized customers.
And I was always in pre-sales technical roles at Microsoft.
And so this role at EY is a great experience
because I'm on the board side of things.
Before I came to EY, I spent many years
in Microsoft databases, data warehouses.
I had experience with architecting and developing solutions
that the main goal was just to collect data and make better business decisions with customers
companies out there. And through the years I was also a DBA for many years and I started back
in SQL Server 1.0 and OS2 back in 1989, I think it was.
I have a long history with working on the data platform stack.
Super interesting.
One question before we dig in.
We want to talk about a lot of warehouse stuff because you produce some great material on your blog, which we'll put a link to in the show notes.
But one question on the project,
if you can talk about it. One thing that's really interesting when you described the project that
stuck out to me was that there are multiple vectors of both internal and external facing parts of the project, it sounds like. And just to be specific on that,
there is both sort of first party data and then also third party data, which isn't necessarily
uncommon, but usually the most common use case we see around that is you have first party data
and you want to augment it in some way with some set of third party data. But then also, it sounds like the project itself will serve both the business,
but also be included in sort of products or customer facing products that you sell, right?
So sort of an internal data use case, and then also an external data use case.
Can you talk any more about that?
And I think the main question that comes to mind is that seems fairly complex dealing with those multiple vectors and multiple types of data and multiple audiences for the data.
Yeah, it is complex.
And then you add in the security that is needed at an extreme level to deal with data that
is client data.
And in a regulated company like EY, there's various rules and regulations you have to
follow.
And then of course, each customer's data that you collect, they don't want other people
seeing it.
They should not.
So there was a really high level of security and a lot of challenges
with that. But the main idea is let's
aggregate all this data together and make it available
to the product. So as an example, it could be EY has
many products they sell and a product that
a customer may be interested in it could take data that the
customer has it could take third-party data to your point and they could aggregate them together
and and make better could be machine learning models it could be reports from dashboards
that that company could use to maybe find out more about their supply chain where they could increase profits,
could use that data to find fraud
or money laundering that's going on if they're a bank.
They could use that data to find competitors
that are gaining them in the industry.
So there's many dozens of use cases.
Well, all those products need data
and you don't want a situation where a new product
comes along, it creates its own ingestion platform, ingests its own third-party data and client data
while it's already been done with many other products. So it's unifying that experience and
having one ingestion platform that'll collect this third-party data. In addition, think of the data saving, the licensing savings,
you know, third-party data.
A company like UI has tens of millions of dollars it spends in third-party
data sets.
And there's likely a lot of repeat data sets where people didn't know that
these other data sets already existed in UI.
So we'll have one place where we collect all this data.
Then we have a data explorer slash marketplace type environment
where anybody can go and search the data we have
and they'll say, oh, look, we have this data.
And here's the hooks into that data.
So what happens is it's a great product accelerator.
If somebody comes up with a new idea for a product
and they say we need 10 different data sets
and client data,
they can go and find out
that's already existing in this data fabric
and they can quickly ingest that data
and use it and get insights of that
and build their product
and go to market a lot quicker.
So that's a big idea
in this data fabric we're building
because think of the challenge of ing adjusting thousands of files from many different customers.
And you have to clean this data and join it and aggregate it and secure it.
You don't want everybody kind of reinventing the wheel and doing their own thing.
So this is built for multiple different products and also for an internal use step, maybe somebody want to look at all
this data we've collected in various engagements that EY has had and said, well, let's see
where we can optimize things. Let's collect these metrics and maybe build some machine
learning models on that. And well, we need the data. So let's have it in one unified
place. And that's what the data fabric gives them. So it's quite challenging because of
all these various different data sets
and client data has much different security requirements
and third-party data sets.
So we're going through all those challenges now
and it's been a great experience
and working closely with Microsoft
to see the various products that they have
and where the gaps are
when you're dealing with other products
outside of Microsoft.
James, I have a question for you.
I'm listening, like describing all this quite complex architecture like so far.
And I'm wondering, I mean, one thing is like, okay, we want to ingest data.
We have data coming from many various places.
We are going to store this data in one place,
and this is going to solve problems around creating data silos
and giving access to all the data to the whole organization
and built on top of that first-class security
so we can ensure security and privacy around the data.
I was wondering, as we enable more and more use cases around data and more and more people and organizations at the company are able to access this data and process them, how important is it to keep track of what's going on with this data?
And more specifically, what I'm referring to is data lineage.
So, first of all, how important do you think this is when it becomes a problem and how
do you deal with it?
Yeah, I would say the biggest gap that I've seen customers have, especially one of them,
Microsoft with building data warehouses, they didn't give enough time into data governance.
And they really need to spend a lot of time thinking through the data governance
piece, which includes data lineage, as you asked, data quality, data security, data access, all
these things can be quite complex. And frequently customers just did not put enough time in the
project plan for all those different areas on there.
And data lineage is a big one because at the end user gets that report,
they may want to go this particular number.
I'm not quite sure it's accurate.
Where did it come from?
And you want to be able to respond to them and show them the various stages this went through.
And so data lineage is a big part of what we're implementing.
And there are various products that take data lineage and I'll throw out the one we're using
is Azure Purview on there, which is not GAED yet. And there's many other great products
outside the Microsoft realm for this that'll track this data came from this particular
data source. It went and was transformed and cleaned via this procedure
and then landed in, say, a data lake.
It then was moved into, say, a relational database
before it was then moved into something like Power BI
where it became a data set that was used for a report
that was used for a dashboard.
And if you can't get that answer quickly to the end user,
you're going to lose their confidence in what you're giving them.
So the challenge is that data lineage is not,
it's not as you can just press a button and scan all these data sources
and come up with a lineage on there.
There's a lot of work that could be done behind the scenes.
And you may have to send this information to a data lineage if you say you're changing
data inside a stored procedure because it's too much for some product
to scan a stored procedure and tell it everything it's doing. So we have to set
up guidelines. And if you're transforming the data, you have to
call these APIs in the same purview to
tell what you're doing in there. And so this becomes a lot
of oversight governance in there. So it's coming up with these particular frameworks and guidelines
in there, but then someone's going to oversee it. Maybe you have a center of excellence where
anything that's submitted has to follow these rules. And part of one of those rules is that
it's got to send a lineage
over to the particular product.
So this gives you a nice clean way of seeing everything.
And also that helps in making sure
as you're building this along, you're not missing steps
or not properly cleaning something
or avoiding duplication of data
in these individual source systems that come in there.
Because in most cases, the data that you're pulling into this data warehouse,
you could have dozens and dozens of different sources on this.
It's really important to have that lineage to track where it starts
and where it ends.
Yeah, makes total sense.
It's a very interesting topic. And I'm glad that you broke down data
governance into different pieces, because I started thinking, you mentioned data quality,
and data quality, at least in the companies in Silicon Valley is a pretty hot thing lately.
You see companies, I think, on the catalog just raised another $100 million
or something. You have companies like Big Eye raising money from Sequoia. And in general,
everyone is looking into the data quality problem and trying to solve it. From your experience,
if you had to describe the two, three more important requirements around data quality that a new
product should address. What that would be based on your experience so far?
Yeah, when I look at data quality, the first challenge comes up that a customer has to answer
is who owns this data? And I've been in rooms where there was almost fistfights that were
resulting in trying to answer that question because they're the ones responsible for the data quality.
As to collecting this data into a data warehouse, I can tell you how many times the customer said, oh, our data is perfectly clean.
And I would say, I'll bet you a hundred bucks I'm going to find some problems with it.
And sure enough, as soon as that data comes in, you find that, oh, well, the end-order entry system in order to get past the
field you had it in their birth date so people were putting in people who were born in the future
or people were 200 years old and and so you have to get this data and then clean it so this is part
of the data quality now you can plug those holes and you can revert back to the source systems
but the damage has already been done so So somebody has got to clean it all.
So there's a lot of questions that kind of go back to the source system,
the owners of that and ask them, what do I do in these situations?
If the birth date is not valid in there, should I put an alternate?
What's it going to be? There's,
there's going to be a lack of conformity if you're pulling in data from
different source systems and they have customers in there, one of those systems could use abbreviations
for a state and others could use the full name.
If you're generating reports, you have to have one common standard.
That's a big part of that. Somebody then is going to define the standard.
Usually, you may have a center of excellence team that goes through and says, okay, you need to conform
everything and this is what we're going to do.
Now add the complexity of master data management.
That's going to be part of the data governance in there.
I'm collecting those customer data.
The last thing you want to happen is you create a report, and the end user looks at it and goes, well, wait a minute.
Why is this person in here twice?
Their names are misspelled, but they're really the same people.
Now you've lost their trust, and it's going to be hard to gain that back.
So you have to think through this data governance.
That's why I say you spend a lot of time in it.
So mastering the data has got to be another important part of the data governance in there.
And even the data quality, well, how do you know the data is bad?
Is it a null or is it a zero?
What does that mean in there?
So a lot of investigation has got to be done with this.
And this is where you want to work closely with your end users.
Get them involved in the process early.
Ask them, how do I know this data is valid?
How do I clean it?
Do these numbers look right?
So they're not left at the end of it going, well, here's the report.
And they go, well, I have no input into this.
So they don't feel like they were part of it.
And it's always, I say, always get those custom to end users involved early on in there.
Because then they'll be rooting for you if they're part of the process as opposed to having this almost negative reaction to things that are just hand-in-hand.
Because it involves a lot of change from what they may have been doing previously.
And it's hard for the people embracing the change
if you haven't made them part of the process in there.
And then I will say the last thing
that when people build out data warehouses
is you want to have this one version of the truth.
And I've had situations where I've found people
creating reports that were not accurate
because they were in in some ways,
changing the numbers to make themselves look better.
And once you centralize the data in a data warehouse and come up with, say, one formula
for all these various metrics and KPIs in there, you're going to have a possible lot
of disputes on what those metrics or KPIs should be.
So again, you get these rooms and you have these arguments in there, but in the end, you will have this one version of the truth. So people can
be confident that they're getting the same answers to the questions they're asking and not having
different answers to the one question in there. So all revolves around data governance. I wish
I can say there's this magic button that can go through your data and clean it up,
but there's not.
There's no shortcuts to this.
It's a lot of time and effort to make the data quality.
But in the end, it's going to be worth it.
But you have to put that in your project plan
and spend a lot of time on data governance.
Yeah, yeah.
I think those are some great points around data quality. And I would also add that
all these things, there is a reason that we have all these different parts that they are under
data governance. And the reason that we have them under the umbrella of data governance is because,
for example, data quality and data lineage, it's important to have both together, right? Like one supplements the other in terms of like the end goal.
Same with data access and all that stuff.
So yeah, that's a super interesting topic and a super hard topic also.
I think the industry now is trying to figure out like the right ways to implement all these
methodologies and functionalities at a large scale.
And I think we are going to see a lot of like interesting new companies trying to tackle these problems. implement all these methodologies and functionalities at a large scale.
And I think we are going to see a lot of interesting new companies trying to tackle these problems in the future.
But talking about new companies, I want to ask you something about a company that at
least in Silicon Valley, we keep forgetting when we are talking about data, and this is
Microsoft.
So you have a lot of experience with Microsoft and their products. In Silicon Valley, we keep forgetting when we are talking about data, and this is Microsoft.
So you have a lot of experience with Microsoft and their products.
And actually, it's interesting because for the database systems, at least,
Microsoft is supposed to have probably one of the most complete database systems, which is MS SQL.
I mean, it might be a pain to manage, but in terms of the capabilities that the system has and the functionality that it has, it's probably the most advanced database in the market right now.
But can you give us a little bit more information around the products Microsoft has around data, like data warehouse, for example?
What's the data warehouse with Microsoft if someone wants to go to Azure today and what other tools and products they offer for all that stuff that we discussed so far?
Sure.
And I had this discussion many times with customers because, again, they were confused why there's so many products that Microsoft has.
And, okay, what's your use case?
And I'll narrow down that product list for you to then go and do research on there.
If we look at the OLTP side,
you have your SQL server,
you have your SQL database,
you have those relational databases
that have been around for forever,
especially meaning SQL server,
and then SQL database,
which has many different flavors of that,
is a PaaS solution instead of an IaaS solution
that you get within SQL Server and VM.
But those are mostly for OLTP,
which sometimes you can get away with a data warehouse in there
if it's small, let's say under four terabytes.
And that only applied to customers who were very small customers
who didn't see a lot of growth in the data they're collecting.
Once you get over four terabytes or around there,
you want to start looking at a data warehouse solution.
And that's where in the Microsoft realm,
you get into address synapse analytics. That is the tool of choice, I will say, in Microsoft
for large amounts of data for that data warehouse.
I have my history at Microsoft when I first started seven years ago
was on the parallel data warehouse.
That was Microsoft on-prem data warehousing solution on there. It's like a Teradata and a TISA with MPP, multiple parallel
processing technology. So that technology gives you an advantage over the traditional SMT
technology in that it can handle massive amounts of data. It distributes the data,
distributes the queries. It could be a long conversation just in itself on how that works.
But this opened the door for queries to go anywhere from 20 to 100 times faster
than a traditional SQL Server query on there.
Well, that product eventually migrated into SQL Data Warehouse in Azure.
And that has been around for a number of years.
And that product then morph for a number of years and that product then
morphed into Azure Synapse.
And that technology
is still in Synapse under
a relational
pool that they have,
a dedicated pool in there. But that
product also added a bunch of other features.
It has a serverless pool. It has
Spark clusters in there. It's got Data Factory built
in. So it's a great tool if you're going to build out a data warehouse.
Everything is on a single payment class. And that's where Synapse
has a tremendous value for customers to
enhance their time to market
or time to build a solution in Synapse
because of that integration of all those products in that single thing in class.
And they still have that MPP technology in that dedicated pool.
And so that's the go-to with customers.
And within there, you can even make the argument that we get into the serverless option.
So instead of having a dedicated pool that can be very costly,
and maybe I don't want to use it for databases,
big warehouses that are small, you could make the argument,
well, I can use the serverless option and only pay for query.
And so maybe I can even then open up this to smaller databases,
the data set size in there.
A lot has to do with customers
and what your current skillset.
Are you SQL Server developers?
And that's going to make the transition pretty easy
into Synapse in there.
And so I asked a lot of discovery,
during discovery, a lot of questions about customers
and then see if it would be a good fit.
And usually it doesn't take more
than a couple of days for anybody who's used to SQL server,
SQL database to move to something like Synapse.
And so that product is what I would say is the go-to
and most, I would say almost 90% of cases with customers
they Synapse was the solution for.
It is really fun.
We talked with Costas.
I don't know if you remember,
but we talked to the startup company who is building a product
in the medical space, actually.
And they were building on a Microsoft stack.
And it was great to hear.
That was an early episode, I think.
But it was really fun to hear about that
because you hear, like you said, it's really fun to hear about that because you hear, like you said,
it's really easy to forget about, especially in the world of data, where it's all these new fancy
tools and new fancy startups that Microsoft has some really awesome technology. So James,
thanks for giving us some detail there and reminding us of that.
Yeah, sure. It's always interesting with customers. They don't know what they don't know. So they come in and they think, well, we should do everything in SQL Server.
And wait a minute, we have these past solutions like SQL Database, which has flavors, it has serverless features of managed instance, it's a hyperscale, so we can handle databases in OLTP that can be extremely large. And the challenge is the technology is changing so quickly
that even though my full-time job at Microsoft
was keeping up with the data platform, I can barely do that.
And so you can't expect customers to keep up with it all.
So they would come to Microsoft,
and they had cloud solution architects
and MTC architects like myself that would educate them,
or they'd go to partners and help educate them.
Because the reason data warehouses fail in most cases techs like myself that would educate them or they'd go to partners and help educate them.
Because the reason data warehouses fail in most cases is just customers use the wrong technologies for their use case. And I would see customers who use a certain product and they would go,
why don't you use this other product? And they'd go, well, we didn't even know that.
Well, okay. That's the reason. And so it's really important up front to be aware of all the products and their use cases.
So choose it early and don't run into a mistake where you're a few months in, many months in,
many millions of dollars to spend and you realize, ah, this is not the right product.
And I go back to the beginning.
Wise words.
Well, let's switch gears a little bit.
As we were prepping for the show,
we talked about,
which I love that we started out
talking about sort of an extremely complex project
with all different types of data
and all different types of users.
And then talked about the complexity of data governance
and data lineage at scale. Let's step back a bit because something that you've written about a
good bit is actually just the fundamentals of the data warehouse. And you have a great
post on your blog and a great video on YouTube just that I think is called Data Warehouse Explained. And I'd love for you to
just give us an overview of that. And as I was saying before the show, I think
we get exposed to so many new interesting technologies in the data space that it's easy to
sort of assume that we know the fundamentals of a tool that we use
every day. And so I think zooming out and getting context for that is helpful no matter where you're
at in terms of working with data. So James, give us a high level overview of the data warehouse
explained. Yeah, sure. And it's in particular true of the smaller companies who are just beginning their
journey of trying to get better insights and make better business decisions through data on there.
And it could be that they have some source system. It could be a homegrown thing that's OLTP that
they collect all this data maybe about customers. They could be using some CRM or ERP
system like an SAP and they say, well, we want to generate some additional reports and we may want
to combine what we have with multiple source systems. Could be even, hey, why are our sales
slow in certain areas of the country? Well, maybe it's something weather related.
So we need to combine our data with weather data
or competitive data on there.
Well, okay.
So we want to generate better reports.
Well, what you don't want to do is try to cram that data
into say SAP or your homegrown application
and just hammering with reporting on there
because you're going to make the end user very angry.
And that's the first problem I see with customers trying to do reporting
on live production systems is they spike the CPU.
People start getting angry at IT.
What's going on here?
Somebody wrote a query that was malformed.
Man, I find a dollar for every time I did the kill command on the DBA. I'd be rich
right now. And so
you need to offload
the data from a production system. Now, you
can replicate that
data, and there's various
ways of doing it in SQL Server,
but a better way is
to take that data and copy
it into some location
where you can make it better optimized for
queries in there.
So I can put different indexes on it.
I can lay it out in a certain way.
I can position it in a certain way.
I could also change the field names and the table of names to make it easier for people
to understand.
And if it comes from some European system, you may have some
really cryptic names. Because the idea is
you want to have self-service BI. You want to create a
warehouse that has
cables in it that are very easy
for an end user to go to a tool
and just click and
drag those fields onto a
report and build it out without having to
get IT involved. So you need
to make it more presentable
by copying out of that source system into that data warehouse in there. Also, you can have a lot
more compute on top of that data in the data warehouse in there. You can ingest many different
sources of data. You can do the cleaning of the data in there. You can master the data in there.
And that gives you protection against, say, a source system upgrading.
Because if you're running reports running against a source system,
they upgrade to a new original, reports may break.
Well, if you copy that data into
a data warehouse,
well, the ETL into the data warehouse may break
with the upgrade, but at least the data in there
is okay, and
you're not going to have this huge
problem of having to go back and rewrite all these various reports
with your queries on there.
And it also allows you to clean the data
and find things that may result in holes in the source system
that you can go back to the source system and say,
you need to plug this hole in there because the data is not clean.
And by having that data in that data warehouse,
you have one version of the truth.
And that can be used
as the basis to
create all the reports and dashboards.
And you can put that data in the data warehouse
in a third normal form
that has many relational databases
that can be joined together
to produce those
queries. But a lot of times customers will go one step further and they will create a
star schema,
which has taken that data and those multiple tables and joining it together.
So you have this factor dimension table.
So you have a lot less complexity because somebody's done the work to create
those joins in there.
And so again,
that end user can very quickly and easily generate reports off that.
Now there's other steps you can take. You can aggregate it. You can put it into a product like
Azure Analysis Services, where it's a cube and it aggregates that data. So it's also for performance
reasons. And you can quickly get answers to queries that may take quite a long time. You can put hierarchies in there.
And so there's all this additional steps you can take.
Now, you may be saying, well, this additional cost and complexity. Well, it is, but there's a reason for that is that you are making
that end user very easy to have reports that are not only easy
to create, but very performant.
So there's that trade-off of cost and complexity, but it's worth it because you will have the
speed and the simplicity for your end users on there.
And so this is a lot of what I explain to customers as that data moves through this
modern data warehouse and lands on all these things and copies all these things.
The end result is going to be worth it.
But you have to do the work up front.
James, you gave a very good description of what a data warehouse is.
But I would also like to ask you about the concept of the data lake,
something that you hear more and more lately.
And can you tell us a few things about what a data lake is? What are the differences compared to a data warehouse?
And when someone should consider one or the other?
Sure.
And that's a very hot topic.
If we go over what I just mentioned and putting everything in a relational database, that
was the way it was for many years.
But there were problems arising on that.
The first is you
have to have this maintenance window we have to knock images off the system because if i'm loading
all this data and i need to clean it and master it do all these other things that's a lot of cpu a
lot of processing that's going to be done and many times i see maintenance windows over three hours
four hours we go to eight hours in there and what happens if you want somebody who wants access to data 24-7? What happens if you kick them off, but then there's
a problem and you run over the maintenance window? Maybe they tell them you can't get on the system
until it finishes fixing this bug or whatnot in there. So along came the data lake to help with
some of those problems in there. You can think of the data lake, and there's many reasons why
not the data lake, but one of them could be
I want to offload
all that transformations of data
that's staging area that you have
in a relational database and put that
into a data lake so the data
is copied into that data lake instead
and I put compute
on top of that data lake and I do all those
transformations without
affecting the data
warehouse, the relational data warehouse in there. And then that maintenance window essentially goes
away or just maybe a few minutes where you load the data after it's been cleaned in the data lake.
So the data lake becomes that staging area. And so that's one huge reason right there for a data lake.
Others are, I can hoard data in a data lake because if you look at the data lakes,
the cost can be very, very cheap, especially compared to putting in a relational database.
And so I could, as opposed to a relational database where it's very costly and I have to
delete data that's older or only keep data in there if I'm absolutely sure I need it,
I can just dump all this data in a data Lake and down the road can see if I need it
or I can keep a complete history of that.
And because the Data Lake is schema on read, meaning I can put data in there
without any upfront work.
It's like a glorified file folder on your laptop.
Create folders, put the data in there, as opposed to a relational database
where it's schema on right,
meaning I have to go in there,
create a database, create a table,
create a field in there,
write the ETL that landed in there.
And so it's a lot of extra work in there.
So I can put that data in the data lake very quickly.
And then somebody who has a skillset
to read that data in a data lake
can go in there and look at the data
and investigate it and see if it's even valuable before you go
through the work of putting in a relational database, which I spent many times as a DBA
doing all this work for an end user to put data in the database. And then they go and tell me,
oh, it turns out we don't need that data, or it's not relevant, or it doesn't give us the value we
thought. Wow, that's just weeks out of my life that are gone now.
I can instead just dump that into the data lake,
and if they have that skill set, they can query that data
and see if it's important before I do all that work to it.
Or maybe they just need a one-time report,
or maybe they're data scientists and they need to build a machine learning model.
So now they have that data lake to do it in there.
So it's kind of the best of both worlds by having that quick access to it.
However, you still, in most cases, want to have a relational database for a few reasons.
One of them is in the data lake, the metadata is separate from the data.
So it can be quite confusing and challenging for end users to make sense of the data if the metadata is not along with it. Now, this is changing, and products like Synapse have ways of making it easier
to make sense of that data.
But in the end, it's just files sitting in a folder system.
And so that could be too challenging for end users in there.
It could also have less security on there.
If I'm dealing with a file folder structure in there, I could also have less security on there. If I'm dealing with the file folder
structure in there, I could put security on a file. But what happens if that file needs access
by many different users who only should see certain rows in there? Maybe it's separated
by department. Well, you can't do that in a database. There's no role level security that
there is in a relational database. There's no columns level security and all
this additional security that have been part of relational databases for many years. Yeah,
there's certain workarounds in the data lake that give you some of that, but it's very
challenging, a lot of complexity, a lot of extra costs. So a lot of customers said, I'm
going to use the data warehouse as that security layer and that presentation layer. And I will use the data lake for the cleaning and
transforming of the data for its use of power users. So in most cases, and I've argued for
many years, you should just the data lake.
For example, you can use T-SQL and Synapse on data sitting in the data lake.
And that was the big problem before was a customer said,
well, I want to use data in the data lake,
and you're telling me I have to use something like Hive SQL or Spark SQL.
I just want to use regular T-SQL.
And as much as SQL could have been similar, it still wasn't enough. And products that Microsoft
had like U-SQL failed because it just was too different. And so it gives you the benefit of
using T-SQL. So what you can actually do is create a view on top of a file and then you have the
metadata in that view and you can use regular
SQL and then that made it a lot easier to open up the door for customers to say well maybe I'll
just keep everything in a data lake because you also have this serverless component that goes
scaled up and down that's only for the query so I can save money that way but the bottom line is
it still can be very confusing to have data in a data lake.
If you're dealing with many sources, many files, many folders, still in a large majority of time, you want to have a relational database with it.
But I can see a little bit of movement into getting away with just a data lake, especially when you look at things that Databricks has come into play with their data lake house
and their Delta Lake, which I bet you can talk more about too.
But understand that the data lake is not
what people thought of when it first came out, this land of rainbows
and unicorns that you just dump data in there and the magic comes out and it's all cleaned
and governed in there. It's more work to use a data lake in there, but you'll get a lot more
benefits out of your solution if you have a data lake in a data warehouse, but realize it doesn't
slow down the process. It doesn't speed up the process of data governance in there. It adds more
to it, but in return, you can get a lot more value out of your data.
It's interesting hearing you give these explanations. The term, hearing you describe
all of the practical uses and value you can get from a data warehouse,
it almost feels like data warehouse is a strange term. When you think
about a warehouse, at least the initial thing that comes to my mind is you're just storing a
bunch of stuff in a warehouse, right? And almost every part of the description you gave was actually
really active, right? You can do this, you can do this. It makes this process easier. There's
these sort of levels of security, which is really interesting. I guess maybe it's
more akin to maybe an Amazon warehouse where you have all these robots driving extreme efficiency
on the floor of the warehouse as opposed to just storing stuff.
One question, and I want to, yeah, we have plenty of time to cover the last topic that we wanted to
cover. But one question before we leave the data warehouse, data lake discussion. At scale,
it certainly makes sense to have a data lake and a data warehouse. We probably don't have time to
get into the details of the data lake house and some of the new architectures that we're seeing.
But one thing we've talked about on the show that I think is helpful is in the life of an organization, you go through phases where you hit breakpoints on needing to implement new technology or sort of scale or business reasons where you may want to
implement new technology. And we've talked about how, okay, two guys in a garage as a startup,
they're just querying their production database because they don't have enough data for it to
be worth it to add additional infrastructure. And then at the extreme scale, you have
companies with multiple data warehouses, multiple data lakes, data marts, complex orchestration, etc. In terms of a warehouse and a data lake, would love your perspective on which one comes first? And when does it make sense to augment with the additional tool? And I know that's a little bit of a loaded question because there's a lot of
dependencies, but we just love your high level thoughts on that.
Yeah, sure.
Most of the customers I saw that they have been down the road for a number of
years and they're having pain points.
Maybe they had just a relational database and they're going, well,
my queries are taking forever. I have this maintenance window.
I need to load more data.
The DBA is saying we have no more space,
no more compute to do all that.
And now the report starts suffering.
You can't augment with additional data.
So you have all these challenges in there.
And that's the case of a traditional data warehouse
is you have these limits, especially if you're on-prem.
And then my own data warehouse came out.
You can think of it as I'm migrating to the cloud
because in the cloud I have unlimited compute and storage
and also can then use some additional tools
that make it easier to live.
There's like a SQL database or synapse that has the solution of the platform as a service
on there. And then you can start adding, using additional tools to master the data, to clean
the data. And in the end, a modern data warehouse has five steps. You ingest the data, you store it, you transform it, you model it into a
form that's easier to use in a relational database, and then you visualize it
on there. And then along the way, there may be machine learning you're using on there.
So the idea is I need to
collect all this data. And a lot of customers, that's their first challenge.
And I have these four stages of maturity. The first one is, is I have this data that's sitting everywhere. It's
structured, but it's locally managed and you have spread marks and Excel spreadsheets. So stage two,
where most customers are at is you need to essentially locate the data. And it's always
surprising how many customers are not through stage two yet.
And that could be creating a modern data warehouse, putting all the data in one central location, and then starting reporting off of that.
And that's great.
And it's sort of a rear view mirror approach.
I can use that data to see where I've been and see trends.
But the next stage, stage three, is predictive analytics. I want to take all that data I've captured and I want to predict predictive analytics on there.
Maybe I want to use that to predict customer churn and take actions beforehand instead of being reactive.
I can be proactive.
Maybe I want to see when a part's going to fail and change that part through machine learning telling me that it's going to fail before it fails in there.
And then the next stage after that is transformative,
where you want to take data no matter what the size, the speed,
or the type of data and collect it all at a very large scale.
And this is where we get into showing customers the art of the possible.
If you ask an end user what would
they like in addition to what they have now, what would make it better, and they're using Excel,
they're just going to ask, tell you that they want additional features in Excel.
They may not be aware of a product like Power BI or some of the machine learning. And I always say,
show them the icing on the cake up front.
Give them the art of the possible.
They're going to look at those power reports and dashboards
and those machine learning tools to model them,
and they're going to go, oh, holy, I'm completely shocked.
You can see light bulbs going off in their head.
Sometimes you can physically see them because they get all these ideas.
They had no idea you could get all the value out of that.
And they start going, well, I see so many ways I can save money with my company.
I can see so many ways I can take shortcuts into generating reports. All this machine learning
stuff is awesome. You start showing the industry models that they can create and they just go
crazy because you're making their life easier in there. But then you have to tell them, okay,
well, to do that, you have to get to stage two at least and collect all that data. And it's a lot of work on there,
but you're now getting buy-in from the end users. You're getting buy-in from the business units
that may unlock some budget. And so I saw this trend of talking more with end users
that would come to me than IT because IT saw everything as just additional work.
And they may not be so excited about building this modern data warehouse.
The end users, they see the value.
They don't care about the technical details that have been passed on IT,
but they now see what they can get out of this.
And especially if you prototype things,
use something like Power BI that makes it easy,
they can quickly see and touch and feel that reports.
And then they can say this is awesome
this is what we want and and so that gives me a level set to say this is where customers if they
start out new they're going to use a data lake in almost every case they're going to use the data
warehouse relational one once every case if they've come from a traditional where they just use a data warehouse,
they usually want to incorporate a data lake. And there's ways of incorporating it where it's not
everything's going to the data lake at first and maybe just new data sets that they haven't been
able to ingest. And that goes to the data lake first. And so it's a little bit of variation
until they eventually get to the ultimate solution of having everything over data lake and then some of the data going into a relational database in there. easy to think about the sort of technological or data scale triggers that might necessitate
augmenting your stack, but that doesn't take into consideration trust, which has been a really big
thing on the show really since we began, of people who are going to consume data products that whatever your architecture is
produces. And I think the reporting example is great where it's okay, can we actually deliver
real value with this component of the stack to sort of an end user consumer within the business?
And then that of course justifies augmenting the stack for more complex use cases in the future. And that's just really helpful. I think that was, I think it was a really helpful
way to think through it. We're closing in on the end here. And one of the subjects that,
that we wanted to get your thoughts on is what we'll call sort of a data buzzword. And it came up on an episode, maybe two or three
episodes ago. And it was a term that I was really surprised we hadn't covered yet on the show.
And you've written a lot about it. So the term is data mesh. And I'll say the same thing I said
as we were prepping for the show, data mesh is one of
those things that it sounds cool. We all think it probably is pretty cool. But if you ask the
average person to define it, could you just define data mesh for me in a couple sentences?
It's actually kind of, it's kind of hard. And there are parts of it that are still sort of ambiguous on a practical
level. So can you give us your take on data mesh? And then we'll dig into a couple questions from
there. Yeah, I was unaware about a data mesh that buzzword until maybe eight or, well, maybe a year
ago was when I first came around. It was very confusing to your point.
And this is one of the challenges with the data mesh is how can you have a new
way of building a solution if nobody can agree on what the term means.
And I think it's got a way to go because I'm seeing people call everything a
data mesh now.
And the bottom line of a data mesh
is really focused on organizational change,
not a technical change.
The idea of a data mesh is a mind shift
where you go from a centralized storing of the data
to decentralized.
So everything I've been talking about has been copying all the data into a central location, a data lake.
Well, why not, and this is the data mesh theory, why not have all these various organizations in your company have data as a product, have a data domain, where instead of, say, HR and payroll and a homegrown application
that could be something maybe dealing with customer orders,
instead of copying all that into a central location,
you keep it decentralized and you have each of those teams
in those orgs who know the data best
keep the data in their organization and you as an it give them the rules and and sort of like
a contract that they have to follow to govern the data to clean the data and master data
but the data is is kept distributed so you're reducing the amount of etl to copy the data, to clean the data, and master data. But the data is kept distributed.
So you're reducing the amount of ETL to copy the location.
You're allowing the people who know the data best
create the reports and dashboards.
And you're reducing the bottleneck of IT having to do everything.
The idea being we can scale better now
because we're not limited to IT being the bottleneck.
We can have all these organizations
who now you embed IT-like people in these organizations
and they're all often doing their own thing.
And so it becomes decentralized ownership
instead of centralized ownership.
You have less pipelines going to a central location and the more local pipelines in there.
You think of data as a product by each of these organizations, and you now have cross-functional domain teams
instead of one siloed data engineering teams.
And that is the definition that I would say most people agree on,
but there's many, many different exceptions that people make to it,
which is why we see a lot of issues with the confusion
to it. And then while all that sounds great in theory, to implement that technology can be very
challenging. I don't think even technology is even there yet. And then the reason why I have a lot of
concern about the data mesh is because while it sounds great in theory to give each of these different domains,
their responsibility is imagine you're a large company and you have dozens of these domains.
And now you're going to tell all of them to control their own data, to give them extra work.
And you have to give them the benefits of why they're
going to do that and they're going to be thinking in their own terms of i'm just going to collect
what data to satisfy own needs they're not thinking enterprise why hhr may not be thinking
of how to combine their data with all these other pieces of domains in there and so somebody's got
to have that enterprise view and somebody's got to collect all that data.
And that's where it gets extremely challenging in there.
So while I like the idea of a data mesh,
I see it only used for maybe 1% of customers
because there is so much upfront work
to make that organizational change
that many companies are, it's not going to work.
And you also have to be at a size where you have this complexity of and challenges of scale,
which again, 99% of companies don't have that problem.
Many of the current solutions scale very well. They will continue to scale very well.
I've seen Microsoft have many petabytes of data and then make it work. So sometimes the argument in data mesh is things
are not scaling, but they are scaling. And sometimes I feel like they're creating a
panic point where there's not one on there. So that's where it, and I put in my latest blog, a lot of the challenges I see
with the data mesh, but I'm hopeful that for certain customers, it's going to be worth it,
that extra development time. And they're going to wind up getting a lot more benefit out of their
data if they build this and it works correctly. Yeah. super interesting topic. And it's been interesting to consider it
and have a couple conversations on the show. And I think you hit the nail on the head when you said,
conceptually, if you just say, decentralize your data, and sort of has these effects of a sort of democratized access and all these different
components, it actually is, it creates a lot of complexity practically in the stack for
most companies, at least as it seems to me.
And one of the concepts actually that's come up on the show
a lot over the episodes is that many times, especially when you're dealing with a sort of
particularly critical or high scale data concerns, it's like simpler is often better. And a friend that we've heard a couple of times
is, yeah, the way we do that is it's kind of boring, but guess what? Like it works and we,
it's reliable and it's going to deliver on the mission critical things for our customers or
internal stakeholders, et cetera. And, and, and so you see that, and also the tooling around
centralization is getting better and better and actually making things a lot simpler, right?
We didn't get into sort of what Databricks and Snowflake are doing around combining functionalities,
but things that were once harder becoming easier in the context of centralization. So
it is interesting. It kind of reminds me, I don't know if you remember, and I'm far from an expert on organizational design,
but I remember maybe five years or five years back, maybe there was a really big push for this
organizational design called holacracy. And I remember it's kind of like data mesh where on the outset, it was like,
yeah, that sounds really great. And I happened to be really close to a really large scale company
that was implementing this. And on the ground, practically all the employees just said,
this is way too complicated. Can I just go talk to my manager? And so it kind of feels the same
way, but at the same time, time will tell. And there
are certainly things that we said 10 years ago, because technologies didn't exist that do today,
and they changed the way that we thought about things. So we will certainly see where things
land. But I will say one thing that is neat to hear you point out is that you've actually seen it happen on the ground at a real company,
which we haven't talked to someone who's seen that before.
Yeah, you really hit the nail on the head. It's a lot of change. It's a lot of complexity.
The problem I have with data mesh is sometimes it's presented to be
almost an easy button.
And as customers get into it, they realize it's more work.
And if you look at some of the use cases that people, and there's not a lot that I've seen yet of implemented data mesh. They were spending, in some cases, years building a data mesh, even before data mesh was a word.
Because of the complexity and difficulty of getting
all the domains within our company sometimes there's dozens of those domains to buy into
the data mesh and the problem is if you just have one that says i'm not going to do a data
mesh you're telling me i got to do a little extra work you tell me it's going to take a year or two
and i got to get work done now. So forget the data mesh.
Well, now you have a data silo.
Now how do you deal with that?
Yeah.
And if everybody's going off doing their own thing, even though they said they'd be part
of the data mesh, and somebody's using SQL Server, somebody's using Oracle, you have
everybody just coming up with their own technology solutions in there, and you're the person
that's got to collect all this data and make sense out of it into one, now you're opening up a lot of extra work in there and you're the person that's got to collect the list data and make sense out
of it into one now you're opening up a lot of extra work in there and and then even the skill
set challenges agent those domains now have their own it like people to go and build these solutions
in there and it's and we're seeing in the why that to find the talent that can do that is so
difficult and because now you're asking to find even more talent in there.
And they may not be as skilled and have the expertise as somebody in IT.
So now they may build something that's suboptimal.
Somebody had a great analogy.
It's like telling all these cities to go and build their own roads.
Well, I can kind of think I can build a road.
I can dig a hole. But the end result was I may have some city, some roads that are not built well,
and I may do it a completely different way than the other cities. And so you have this huge mess.
And now you have to say all your cities have to combine all your roads together from one city to
the other. Well, who's going to do that? They're all going to say, it's not my responsibility.
Well, then IT's got to go and do it. And they got to combine
all the roads together. So it could take a lot of extra time, a lot of extra buying.
Again, it could be worth it, but you have to know these things up front. That's why I try to put my
log, all the concerns you have to go through and make sure that you address all those and go,
yeah, this could help us or no, we're going to take a pass.
Yeah, absolutely. I think that's the road analogy is the road analogy is a great one.
And I think that's a huge benefit of having Purview over all of the components of the data stack centrally.
But time will tell and technology will tell.
And unfortunately, we are out of time,
but I'm so glad we got to talk about the data mesh buzzword
and dig a little bit deeper into that.
Always fun to kind of talk about the buzzwords du jour.
James, thank you so much for taking the time
to join us on the show.
We learned a ton.
Really fun to hear about all the cool stuff at Microsoft
and all the cool stuff at Microsoft and all the cool
stuff you're working on at EY. And we'd love to have you back on the show sometime soon in the
future. Yeah, happy to come on again. I love talking about this. I can spend hours until my
voice goes out. And so thank you for having me for this hour. Absolutely. And tell us where people
can read your blog. It's a great blog. We read it a lot. And that's actually,
I've read a lot about data mesh on your blog. So where can people find your blog posts?
Yeah, it's my name, jamesserra.com, S-E-R-R-A. You'll find a lot of posts on data architectures
and your data mesh. There's a contact me button. If you have questions for me, feel free to shoot them over
and I'll be happy to answer them. Great. Well, thanks again for joining us and we'll talk again
soon. Thank you for having me. Right. My takeaway is not related to DataMesh, although I'm glad that
I shared some opinions with James and we were able to maybe not complain about data mesh, but point out some of the issues around it.
But my main takeaway was actually something that was on one of our earliest shows,
and that is all the different tools that Microsoft offers that are really cool.
And Microsoft, for some reason, well, maybe not for some reason, probably a lot of the reasons we know, but kind of has this weird feel of not being cool, especially for, you know, startups or data infrastructure.
But they actually offer some really cool tools.
So I was really, it was fun.
And I'm really glad you asked him about some of their products.
So that's my big takeaway.
Kostas? Yeah, absolutely absolutely i really enjoyed that part i would like a pretty good introduction to all the
different data infrastructure products related products that microsoft has and yeah we shouldn't
forget like microsoft is huge and regardless of what we think about them, I mean, they have built some amazing technologies,
like MS SQL is one of them, for example. And there are many companies out there
using Microsoft products, right? That's how Microsoft has become so big. So
that's something that we shouldn't forget. And they also do a lot of research.
That's also one of my takeaways the other one that
i found very interesting and important i think it has to do about data governance i think that
james with the description of like data lineage and data quality security he gave us like a good
description of how complex of a thing data management is. And I think this is the space where we are going to see a lot of innovation happening
in the near future.
And it was very interesting to hear his opinion about that and how important it is.
Absolutely.
And I know data lineage is a subject that you're particularly passionate about.
That's a big, it grabs a lot of headlines.
Yeah, it's the equivalent of data miss for you.
That's right. One thing, here's a quick hot take for those of you who make it to the end of the episode on the perception of Microsoft. Here's my one minute theory that I just came up with.
So do you remember how we talked about BigQuery
maybe having some brand perception problems because they also like people use Google Docs
and Gmail. And so it's like use Google for a large scale ML project on your warehouse,
because you also it's like your personal Gmail that you get a bunch of spam email to. So here's my quick one minute
theory on Microsoft. They started out, you know, they provided tons of, and they always have
provided tons of like data infrastructure products and other things like that. But that was
sort of a bigger deal, like several decades ago, before their consumer products gained worldwide traction, right? So I think a lot of the people
working in data today, their primary interaction with Microsoft was through the Office suite,
right? Which is sort of its own conversation. And so when the Office suite, which is still
the most widely used business software in the world, but you have all the cool kids now using Google Docs
and Microsoft Office is not cool.
And so if I'm going to go choose infrastructure
for my startup, I'm not going to choose Microsoft
because I have a weird taste in my mouth.
Yeah, that's true.
And to your point, we shouldn't forget
that probably one of the most sophisticated
and most used data manipulation software out there is Excel.
Yeah.
So we should never forget that.
Like regardless of what we are doing, I mean, many very serious decisions about our lives every day are based on stuff that is happening on Excel. So never forget that.
Never forget. Get some t-shirts that say Excel, never forget.
Yeah, let's do that.
Well, this is your little bonus round with some one minute theories and a t-shirt idea. Thank you again
for joining the show. We'll have more interesting guests and potentially surprise hot takes at the
end of the show for you coming up soon. We hope you enjoyed this episode of the
Datastack Show. Be sure to subscribe on your favorite podcast app to get notified about new
episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.