The Data Stack Show - 58: Data Federation is No Longer The "F" Word with Scott Gnau of InterSystems
Episode Date: October 20, 2021Highlights from this week’s conversation include:Solving problems with data has been a long-time passion of Scott’s (2:52)Day-to-day use of data at InterSystems (6:25)The technical aspects involve...d in constructing a data fabric (17:52)Companies at a variety of maturity levels can adopt a data fabric (26:49) A paradigm shift in the marketplace (28:39)Comparing and contrasting data fabric and data mesh (30:49)Sharing data across the business and not having it siloed in different departments (39:46)Privacy and security within a data fabric (41:22)The future of data fabric and pushing the edge (43:17)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. This week, we're talking with Scott from InterSystems,
and he's the VP of Data Platforms there. And he has a really long history of working
in data. So he was at Teradata for a long time. He was the CTO at Hortonworks,
and has just done a number of things that I think give him really interesting perspective on how the industry has changed over time.
And is today doing some really interesting things at InterSystems, namely sort of promoting this concept of a data fabric, which could be really interesting. This is not going to surprise Kostas, you or probably our listeners, but I just love talking with people who have worked in data from the very beginning of, I guess, what
we could call like the modern data industry, which actually goes back only a few decades,
amazingly enough.
And I always love hearing people's perspective when they look back through all the changes
that have happened.
So that is what I'm going to ask.
No surprise.
But as always, I think we'll get some interesting insights.
Yeah.
I mean, I'm also waiting to hear many stories of how things have changed.
I mean, he's been in this space for many decades.
And I think it's going to be great to hear from him how things have changed from like
going from mainframes to the cloud and then to the data fabric. And of course, I want to learn
more about the data fabric itself. Like what is this new thing? We have data meshes, data fabrics,
data lakes, data warehouses, lake houses, and who knows what else.
You can get the real story right here on the data stack show
yeah yeah and yeah i'd love to know more and like see how much of it is more of an architectural
pattern and how much of it like is an actual like technology that is implemented and what's the
impact that it has and i think we have the right person to answer all these questions. So let's go and chat with him. Let's do it. Scott, welcome to the show. Really excited to chat with you about lots of different
topics. We probably won't get through all of them, but I really appreciate you taking the time.
Thanks for having me.
You have an incredible resume. We talked a little bit about Hortonworks with the kind of a connection
that we have there from the East Coast, but would
love to just hear about your background, how you got into data, and then what you're doing today.
Yeah, I mean, sometimes it sounds like it's a plan, but it really isn't. Just solving problems
with data has always been a passion of mine, even from the first assignments that I had that
weren't necessarily very sophisticated analytically, but involved a lot of mine, even from the first assignments that I had that weren't necessarily very
sophisticated analytically, but involved a lot of data and being able to resolve that data into some
sort of a decision or action quickly. And I started my career in a massively parallel processing kind
of environment back in the dark ages, like the nineties, when the world's largest data warehouse at that time was, do you want to guess?
30 gigabytes. And that was huge. And it took racks and racks and racks of space to pull this
together. But the point is, there was a lot of information, there's a lot of intelligence in
that data. And I really started my career with the notion of parallel processing to kind of
break that down into hundreds and thousands of parallel threads so that the decision, so that the analytics could actually run really quickly without requiring mainframe class kind of compute.
And I always found that really interesting, not just because scientifically of doing it and the physics and all of the programming that goes into it and the analytics, obviously.
One of the things that thrilled me is that when you do it and you do it right, sometimes the answer you get back is completely unexpected and you learn something from your data.
And that's actually the cool thing.
Later on, when I moved into more of a big data world, it was the same problem to solve,
but with much more variety of data. It's no longer just transactions
and purchases and customers, but web logs and social sentiment kinds of data that can be
entered into those analytics to get a much more thorough view of what's happening and
much better kind of decision. So interesting. It is so crazy to think about. It wasn't actually that long ago when 30 gigabytes seemed like such a huge amount of space.
And now everyone's phone has more space than that, which is wild. And what do you do? What do you do today working in data day to day?
So now I'm here at InterSystems, which gives me an opportunity and it really feels like a synthesis of all the experiences that I've had across my career, whether it be in massively parallel processing, highly efficient processing to transaction processing, to large variety of data and adding in new kinds of analytics altogether, is really the mission that we're on and that I'm working
with here at InterSystems and our data platform organization.
Our technology at InterSystems actually started in the healthcare world.
And you might imagine there's a lot of data in healthcare.
It's really varied.
It could be your x-ray image.
It could be physician notes.
It could be the payment that you made for the visit that you just took, which is structured
transactional,
and synthesizing all of that together for better treatments and better outcomes is a use case.
And the physics behind that use case is very similar now to what we're seeing kind of
expand and extended analytics that take advantage of lots of different data of different origin
and delivering analytics or insights directly at the time of interaction
with a consumer or directly to someone's device. Super interesting. Could you just give our
listeners a quick example of some of the customers and use cases, just so they have
sort of a practical knowledge of what your work looks like day-to-day, you know, sort of in the
life of businesses and consumers? Sure. You know, day-to-day, you know, sort of in the life of businesses and consumers?
Sure. You know, day-to-day here at InterSystems and with our data platform,
we think that we capture more than half of North America's electronic medical records,
and certainly a large percentage outside of North America. So just think about any time
you're interacting with a physician or at a hospital or getting a treatment or a service
or have an insurance claim, that information is flowing through our technology and being used
for not just keeping track of you and your treatments and all of those things, but also
being used in many instances to provide for better outcomes, better treatments, better proactive
kinds of treatments, as well as from an operational perspective.
A lot of our clients will use that technology to optimize their own operations.
How many folks do I need on call at what period of time?
Is there seasonality and all those things so that we can line up the supply chain and
all of those things?
We also have a decent footprint in the financial services industry and capital markets.
And so about 15% of global equity trades, again,
go through systems that are managed by InterSystems' Irish data platform. And again,
you think about the synthesis of that very high volume, can't lose it. It's got to scale.
And I've got to make some decisions about pricing and adjudication in a very fixed amount of time.
Those are the kinds of problems that we solve with our technology.
Sure. That's incredible.
I mean, just thinking about the scale of 50% of EMRs
sort of interacting with your platform.
Well, Scott, one thing that we chatted about before the show,
which I just want to dive right into,
is this concept of a data fabric.
And we love breaking down terminology on the show. So recently
we talked about the term data mesh, and there's a lot of people excited about this term data mesh.
And you've talked a lot about this concept of a data fabric. So break it down for us. What is
a data fabric? Well, data fabric is kind of a logical construct that we like to use and think about that kind
of sets the bar to help enable our clients and folks in the industry to be successful.
And I'll back up and I'll talk first about the requirements and then that'll kind of
lead into how we think about and why we think about a data fabric as kind of a concept, right?
First, and we hit on it, and when I was talking about the introduction, right, there's
data volumes and data variety, it's just like, off the charts, crazy, right? Everything is now,
has it now has a digital footprint. And the devices in our hands are compute devices,
and they're creating digital footprints and all kinds of new data connected and on the web and social media and all of that interaction data, as well as some of the more
traditional transactional systems that folks have, whether it be stock trades or your checking
account or retail purchases. So first and foremost, data is just everywhere. It's high in
variety and it's extremely high in volume and it can be very volatile.
And so when you think about that, that's different than certainly 10, 15, or 20 years ago,
when the majority of data was kind of created inside of a corporate firewall,
largely by mainframes connected to PCs. It was very structured, transactional,
and very controllable. Now it's kind of out there.
And so one of the results of that is kind of traditional processing of, hey, let me consolidate the data into a place and try to normalize it and then do something with it just doesn't work. It's
just kind of physically impossible, A, to do it, B, just to keep up with it. So that means now it's
more important to think about connectivity
of data than consolidation of data. So one of the key underpinnings of a successful data fabric
is the notion of data connectivity. Can I really play it where it lies? Can I get access to it
in a very seamless fashion? Okay. So there's that. Another thing that's happening, obviously,
and we see it on the nightly news and people talking all the rage about artificial intelligence and machine learning and deep learning.
Because there are massive amounts of compute available in the world today and massive amounts of bandwidth available, right? analytics that it's possible to actually deploy. Not only is it possible to actually deploy,
but it's possible to actually get a relevant answer and use it for a much more sophisticated
kind of analytic and ultimately drive a better insight, a better connectivity with a client,
customer, or prospect. And so analytics are no longer just aggregating and summarizing and
joining tables, but now include all of these
other kinds of capabilities as well. So there's a requirement for some sort of flexibility
in the model of what kind of analytic can I run, when can I run it, and how can I interject new
analytics as they're invented into those pipelines in real time without starting over. So that's another construct of what we're talking about in a data fabric.
Certainly on the first point, I talked about data variety, the variety of data and tomorrow
there'll be some other kind of data that we hadn't thought of, right?
And so it's no longer possible, just it's no longer possible and it's no longer efficient to just have a tool that is a SQL engine
or a NoSQL engine or this or that. You got to think about your data fabric as being able to
consume and store any kind of data in its natural format without having to change it when you store
it. You don't want to convert it into rows and columns. You don't want to apply any change to it.
And well, why is that? Well, number one, if you're going to connect back to it,
you want to see it in its native state. But more importantly, if you're going to generate trust across this ecosystem, you've always got to be able to map back to the origin of the data and
how the data came to you. And if you make changes to it, you can't do that and you can't build that
level of trust. Think about running a machine learning algorithm to optimize a treatment for a patient.
You really want to trust that the data came from the right place and that the prediction
that you've made is accurate.
It's a life or death kind of scenario.
So being able to have that kind of construct in it.
And I'd say kind of the last thing is that you've got to think about being able to have that kind of construct in it. And I'd say kind of the last thing is that you've got to think about being able to deploy insights at various places along the chain.
So it's no longer relevant to run a bunch of batch nightly uploads. And over the weekend,
you run some data mining algorithms and Monday morning, knowledge workers show up and they do
something, right? You've got to deliver a recommendation that's relevant to a device, to a consumer while they're interacting with you.
And so, I mean, just the raw physics of that and kind of speed of light means that you've got
processing out to the edge and that you've got to have the capability and the sophistication to
deliver in real time or in the right time.
So if you take kind of those constructs that I just described as four main kind of constructs, that's what we think of as kind of the requirement set for a modern data fabric, where you're
able to weave together different kinds of data with different timeliness from different
sources in the cloud on-prem with different kinds of
analytics with different destinations. And one of the things that we were talking about,
we were talking earlier was, so what about the cloud in all of this? Well, the cloud is actually
kind of a culprit to some degree. Number one, because you can now create almost infinite
compute resources on demand, which means there's a whole new set of analytics. It's possible that wasn't possible before and it's affordable. That's actually really cool.
The other thing is with all of these connected devices and everything that's going on with
cloud-based technologies, it just is a further distribution of data. And for the first time,
you know, in a generation, right, there are, there is a whole class of data that will actually live its entire life cycle
only in the cloud. And that leads to the ability to do connectivity in a very seamless and
transparent way that generates trust and traceability back to the source.
Interesting. What an interesting concept. I've never thought about being at a point where
data will live its entire life cycle in the cloud. And that's
just so interesting to consider. But one point you made, Scott, that I'd love to dig into a little
bit, I know Costas probably has a bunch of questions, but I'll retain the microphone for
just a minute longer. So one thing you said, so an emphasis on connectivity, connecting data because it's coming from various sources in different formats, as in the context of a cloud data warehouse that you
mentioned, are still just trying to collect all their data in one place, right? It's like,
if we can just get all of our data into Snowflake or BigQuery, we'll get so many answers. And so
that seems to be a trend that's still pretty strong, but it may depend on the size of the
company and sort of the complexity. But I just love to dig into that a little bit more since we do see a trend towards
companies working really hard to collect data where you're saying connectivity is actually
sort of the bigger problem, it seems like. Yeah, I think connectivity is more sustainable,
right? When you think about consolidation of data, there are a whole lot of aspects to it.
One is just the sheer movement of data, which is expensive and time consuming. Just the latency of data movement can mean that you're getting access to the data too late whole consolidation kind of a scenario. When the consolidation fails or when there's a failed network port or something like that,
then human beings have to get involved.
And human beings are more expensive than software typically, have to get involved to kind of
resolve what happened, what's going on.
If data gets out of sync when you're consolidating it, then you violate
some of the trust you've built up because you may get different answers at different points
along the pipeline. So I think really solving that problem and thinking about judiciously
using consolidation versus doing connectivity becomes a really important new paradigm.
And so certainly there's a lot of buzz in the market about folks moving data and analytics to the cloud.
And isn't this really great?
I think that that will abate very quickly because in the end, it'll still have the limitations that I described, but because it's still a consolidation play, you just happen to be consolidating in a different place that happens perhaps to be a bit more expensive than the place you used to be consolidating.
Interesting.
I have like one main question right now,
which is about data fabric.
And so it's kind of like,
it sounds like a great idea, right?
Like, yeah, it makes total sense
that instead of like replicating the data
from all the different places where it lives
and like trying to move it into one place
and all these things,
we can just connect the data together
and work on top of that.
But how is this data fabric created and implemented on a more technical?
What are the components of a data fabric?
So, and that's a really good question.
So obviously what I described is kind of a logical construct.
And you think about it from an architectural perspective, where the rubber meets the road
is how the heck do I go implement this, right? And so there are a lot of different folks
and a lot of different companies
talking about different ways to go do it.
I would say that in many,
certainly the cloud vendors are saying,
yeah, you can go build this kind of stuff.
You basically cobble together a collection
of seven or eight different technologies
and you can kind of this functionality.
And we see people doing that and trying to make that successful.
Certainly, the data fabric definition that I provided is kind of the bar that we set
all of our technology investments towards at InterSystems with the Iris data platform,
likening to be able to do that with a single set of technology and provide a little bit
lower risk
to our customers. But like I say, it's more a logical construct and then it becomes kind of
the bar that we set for ourselves when we're making investment decisions in the technology
and the flexibility that we create. And like any set of blueprints that you get from an architect,
right, you can choose your materials and build out the structure differently. Certainly, we like to think that we can compete with the set of
materials that we bring, which is very simple and easy to support. Is there a set of like, let's say,
fundamental components that this architecture has? Something that like, let's say, you cannot have a
fabric without at least these components.
Yeah, I mean, bringing it down a little bit further, certainly you need persistence.
You need pipelines, transports, and then you basically need the calculate functions.
And I think about it mostly in kind of a microservices kind of architecture. architecture, we say, if you're, if you're able to move stuff, if you're able to persist stuff, and then if you're able to calculate whether that's an analytic or a transformation or whatever,
and you have those, and you can kind of cobble those three services together, pretty much like
our DNA is made up of four base materials in different combinations, you can make very complex
and interesting things. It's the same thing in this data fabric.
And what you need, the underlying technology of your data fabric
and the standards that you choose to be able to do
is to be able to host those different things
and combine those things up and then manage them
as end-to-end applications that you build.
That's super interesting.
Is there like some kind of relationship between a data fabric
and what's technically used to call a query federation? Because you mentioned a lot of
being able to have this kind of decentralized architecture where, from what I understand at
least, I have an analytical function and I can run it wherever the data is instead of having like to get the data in one place and execute and my query is there. So I remember, for example, Presto. Okay, of course,
like Presto works in your own like environment. So it wasn't so decentralized. But in a way,
the whole idea of Query Federation was that instead of copying the data and bringing into
one place, let's execute the query there where the data leaves, get the results back and somehow connect and consolidate the results.
Is there some kind of like relationship there between the two ideas?
Yeah, there is. And I'll have to tell you, you may edit this out.
But for the first 25 years of my career, federation was the F word to me because it never really worked, right?
But when you think about the ability to do connectivity, you now have a new set of tools that can make that kind of a use case, although there are different terms, data virtualization
and other things come up. You now have some more tools that make it more of a reality.
I think there really are two things that we kind of hold ourselves accountable for in
intersystems. One is actually, yes, being able to push the processing to the data or the data to
the processing if you want, but typically you want to push the processing data because that's the
cheapest thing to do and then get the result sent in a pipeline somewhere else, right? And then the
second thing that we do that I think is kind of unique in
the industry is that we are multilingual in the kind of process that we allow to run against our
data. So we're multilingual being you can speak SQL, Java, Python, you can interact with the data.
And we think that's really important in the data fabric construct, because what I described is,
all these new analytics.
So it's no longer just a SQL statement, but you might want to bring in some machine learning thing that's written in Python.
And you just want to push that out and have the technology stack figure out how to run that process in an optimal way and then get the answer back. We're trying to see some interesting use cases from our customers who are
able to do this because certainly in a traditional machine learning data science model that doesn't
use a data fabric or intersystems tech, there's this huge data extract and you extract a bunch
of data and give it to the data scientists and then they run their stuff and they find something
that's interesting. It's like, okay, we think this is interesting. Then they wipe the data scientists and then they run their stuff and they find something that's interesting it's okay we think this is interesting then they wipe the data out and they go get more data and
they run it and they get the answer but then they have to take the answer and manually put it back
and think about all the latency that's created there if you can just run the machine learning
model on the data where it lies but even if your process was inefficient and ours isn't but even
if it was inefficient though you're removing all that latency from the process.
You now have the reality of being able to have
a much better decision in time to do something about it.
Yeah, that's very interesting.
And this kind of, that's why I have like a very specific question
about machine learning specifically,
but is this model like applicable both for training
and inference or it's more about one of them? Because I can imagine that like inference can
happen much easier at the edge, let's say, or like on a mobile application or like whatever,
but training, because it's kind of like a little bit more complicated, more iterable kind of
process. Can it happen? All the use cases around
like working with data can be like served on a data fabric. We actually support all of those
modes. And I think when you think about the fabric that you deploy and the technologies that you
choose, you really think about all those modes really needing to exist as close to the data as
possible. What are the most common use cases that you have seen deployed on the data
fabric? Is it machine learning? Is it like more BI related use
cases? What do you have seen your customers like? I'd say it's kind of like
the market, right? The market, everybody gets BI. Human beings
kind of think relationally to some degree and they're used to
interacting with those tools. So just like that's kind of think relationally to some degree, and they're used to interacting with those tools.
So just like that's kind of the base of experience in the industry right now, you probably see
more of a predominance of BI algorithms than the others. But the machine learning stuff is starting
to grow. And I think, gosh, I remember in the late 80s, BI was a new concept, right? And yes, I'm that old, sorry.
But it was a new concept.
And it took a while for people to catch on that it actually really worked and was very
meaningful, right?
Kind of people think, well, I know my business.
I don't need BI to tell me my business, right?
We're seeing some of that now with some of the early machine learning stuff.
And certainly some of the early adopters get it and so on and so forth.
But if you look at kind of the middle of the market, kind of the mid to late adopters, they're still just playing with it.
They haven't fully bought in. But when they do, that capability will become even more important.
You'll see that volume grow. I think also that being able to combine those modes is also a really interesting thing. Thinking even about some of the most simplistic,
I want to do pricing adjudication on a transaction. And into that, I want to factor the risk capital
impact that it will have on my business. And I want to consider the total relationship
that I have with the customer. And oh, by the way, I want to run an ML model that maps the vector of
the securities pricing of the underlying security to predict whether or not this will be a good
transaction. And then I can put all that together in a couple of milliseconds and price the
transaction. It combines all of that technology together. Okay. That's super interesting.
Do you, I mean, from your experience with like the customers that they are working with implementing like a data fabric,
do you think that today it is specific like type of company or organization that it's like, let's say, more ready or more mature to implement and adopt a data fabric?
Or it's something that you think of like a benefit
or it can even be implemented like in any company?
I'm seeing it across the board.
I would say that in some of the less mature companies,
since they're kind of coming in at this point in history,
it's almost like the de facto requirement
and that they're building from
versus a
more mature company.
That's got all kinds of legacy applications and legacy businesses.
It certainly can't be compromised. I see,
I see them doing things a little bit more incrementally and thinking about how
do I go transition? And the cool thing about the data fabric is if you,
if you weave it correctly with the right technology,
it'll plug into the legacy stuff and
leave that kind of unadulterated and start to build new applications in this space and it'll
start to take on critical mass. And at some point you'll kind of see that across the chasm and that'll
be now the de facto standard. Yeah, that's interesting. And I'll go back again to the components of the data fabric, mainly because, I mean, if someone follows the market and the news and what is happening out there, might be aware of things like the data lake now we have the lake house which is like the hybrid between the two i don't know what will follow after this but
because like companies are out there investing right now like huge amounts of both money and
effort to implement like all these patterns like architectural patterns right how do you see these
fits under the concept of the data fabric is Is there like some kind of conflict there?
Is the data fabric like something that sits on top?
How do you see it?
And also like give us some best practices and some like advice
on how we should think as architects,
as data architects with all these different components.
I think it's really the next generation architecture, right?
There was a time when data marts was state-of-the-art architecture for analytics, and then that became enterprise data warehouse,
and then that became data lake. And then, and I think this is just kind of what comes after,
right? And as the industry matures, and by the way, each of those things in and of themselves
were extremely relevant when they came to market. But the market, because it's
changing rapidly, and there's this huge volume explosion, a variety explosion of data. And also
because the bar is set higher, because you and I, and all of us are much more educated consumers,
we expect that the folks that we interact with will understand us better, right? So all these
things kind of come together and say that the ball has moved. And so, and you mentioned like, and a lot of people are not so, oh, I'm
going to run my data warehouse in the cloud. That's interesting, but it's kind of like putting
new leather seats in a 40-year-old car that the transmission just fell out of. Sure, your ride
will be more comfortable and that's interesting, but is that really your sustainable mode of transportation? And so I really think about it as kind of a paradigm shift, not to use a buzzword,
but a paradigm shift in the marketplace where all of these things were relevant at a time and
data lakes were very relevant at a time, right? Because I got to capture all the data and figure
out what it is and understand it. The thing that sometimes is missing in data lakes is the notion of traceability and connectivity for trust, and they become data swamps. And so there's that.
And again, not bad technology, not bad concepts for the time, but I think the world and the market
has moved on. And this is a new place, whether you call it data fabric or data mesh or some other
thing, I think whatever that thing is, is really driven by kind of the
four underlying pressures that are happening in the marketplace that make each of those
previous technologies less interesting to go solve the entire problem.
Yeah, yeah.
You mentioned another technology that I'm still trying to figure out, to be honest,
which is data mesh.
So what's the relationship between the data mesh and the data fabric?
Or like, where are the differences or the overlap?
Well, again, I think it depends who you're talking to.
So I defined what I meant by data fabric
and that's what I mean.
I have heard people use data mesh and other terms
to kind of describe 80% of what I'm describing
and then another 20%.
And so I think, again,, I think that the big notion is
that the world's moving on from data lakes
and certainly from data warehouses
into kind of this next generation data infrastructure.
And whatever you call it,
it's going to be driven by the marketplace requirements,
which some of what I think I described for you.
And folks who figure out, number one,
folks who figure out how to deploy and actually build out that architecture to make their business
successful, I think will be much more successful than those who don't. And I think technology
vendors like us who can actually provide a better mousetrap will get some good attention
as the market kind of moves into that space.
Scott, one question. The enterprise scale, when you think about the work that you do in the healthcare industry, the work that you do in capital markets, massive scale. And you have
worked in data for a long time. And so I'd love for you to speak to those in our audience who
are hearing what you're saying and they probably agree in theory that like that makes sense,
but then they're kind of facing the day-to-day of like, okay, like my charge is to go implement
or sort of get value out of the data lake and the data warehouse setup that is
what our company is sort of implementing. But I'd love for you to speak to them and sort of like,
how do you manage the, because there's sort of a long tail of the market where if you're not sort
of solving these extremely complicated problems at scale, maybe some of those tools are sufficient, at least for
sort of the problems that you're solving. How do you, how would you tell someone like that to think
about the future? And how do you prepare for that? And when do you begin to sort of tactically
think about things like migration, adopting new technologies, all that sort of stuff?
Yeah, so I think there are a couple of things,
right. And, and you're also spot on. It can be a daunting task to go sell a vision of this nature
inside of an organization that tactically needs to get things done. Right. And so,
so I think there are a couple of things just, just, just like early in my career,
I talked about breaking things down into small problems and doing parallel processing. Break it down into small problems. There are the drivers
that I described. And go look at some of those small problems and figure out, okay, is data
variety, volume, and location going to impact my application? If not, okay, I'm not going to worry
about that for today. But I certainly need to at least have adjudicated that decision.
I think also the notion, and this requires certainly corporate CTO buy-in and things of that nature, is one of the really cool things about cloud is it's easy to spin up applications quickly using a collection of microservices.
There's no big capital acquisition,
et cetera, et cetera, et cetera. The problem is that also creates sprawl and silos in a way that
we've never seen before, right? And again, for my entire career, the marketplace clients have
been talking about the problem of data silos, right? And I think that in today's world, it's even harder, right? Because it's
easier to create them. And there's no, you know, in the 80s, this was back when you had to go to
capital committee and get approval and move data and da, da, da, da, da. And there was still data
silos. Today, you don't need any of that approval and you can create your own thing. And so there
are more and more and more. And so my point there is certainly try to take
the long view and say, I can't afford to go build a five-year plan to go build a data fabric because
I have to run my business. I get that. But there are very easy architectural decisions that you can
make to make that transition easier and to avoid the continuation of this data sprawl and data
silo population where you end up with disconnected
data that you can't analyze, that ends up being extremely expensive, and potentially redundant.
And ultimately, when you start to look at it, and from that perspective, the ROI on at least
agreeing to a data fabric kind of architecture becomes very easy to justify. And then it becomes
how do I technically go solve this problem while
new applications are going to use this architecture and legacy applications are not? And just because
you've decided to use an architecture doesn't mean you have to slow down rolling out solutions. It
just means that the choices that you make on storage and transport and the algorithms and
the actual technology standards that you choose are a forethought and
not an afterthought. Sure. That's interesting. Two things there that I just want to reiterate
that I think were really helpful. One is, I don't know if it's just subconscious. I'm sure I was
exposed to some sort of marketing messaging with all of these cloud tools, but the cloud kind of had a promise of like helping solve the data
silo issue. And in many ways, it's refreshing to hear you say it's worse now because anyone can
go into AWS and spin up whatever service they need for whatever they're doing. And then all
of a sudden you, even a small to midsize company have sort of these pockets of replicated technology that are sort of
managing data independently of each other, which is super interesting. And then the other one is
that when we think about technology migrations and you think about something like an on-prem to a
cloud where it really is sort of a major overhaul, right? Like there is a massive migration.
If everything's in the cloud,
I think your point about saying
you can make decisions now in the cloud that aren't,
it's not like you're migrating the entire infrastructure
of all the technology of the company.
You're dealing with, like you said,
sets of microservices
and you can choose to construct those
in a way that sort of paves a path
as opposed to thinking about it as, okay, you go from data lake and warehouse to
fabric, and it's a massive one-time, you know, sort of painful migration.
Yeah, and it's just kind of like good programming practice to not put constants
into your program, but actually always point to variables so you can change your mind later,
right? Being able to think about it as disaggregated from a specific cloud vendor,
but more as an entity of its own becomes very freeing because then the cloud is a source,
a provider, but it also avoids potential lock-in or other downstream impacts because you've
actually up-leveled the whole architecture.
And I think that's important. I mean, just as human nature and the nature of business,
right? Just through my career, right? Most large companies, you say, well, what's your BI tool?
Well, we have all of them. Well, what's your database or standard? Well, we have all of them.
And you say, well, what's your cost standard? Well, we have all of them.
It's going to happen.
Or you acquire a business that had a different cloud.
And you want to get the lifeblood of the data and the intelligence and the insights that can be driven from it.
If you've up-leveled your architecture and you think about it in a virtualized abstract across multiple clouds, that's also very valuable in terms of future-proofing what you're
rolling out. And again, I'm not here to say one cloud vendor is bad and one cloud vendor is good,
or it's got nothing to do with that. It's just the nature of the market is going to dictate
that it's going to change. And it may change suddenly and without a whole lot of notice
and without a whole lot of logic.
And if it disrupts the value chain of the insights that you're driving and the interactions you're having with your customers, that's very bad.
Scott, when we are discussing about data messes, one of the definitions of a data mess that comes like many times is that data mess is like, let's say, 80% about an organization architecture
and not a data architecture.
It says a lot about how companies should be working with the data or how they should be
organized around the data.
And I'd like to ask you, companies have to change in order to adopt this new paradigm,
right?
What are you think the main changes that a company has to make, especially the bigger ones that it's like more difficult to change in order to maximize the value that they can get from something like the data fabric?
Or maybe there isn't, but if there is something that has also changed to the perception, like how the structure of the company is, what is this?
I think one of the things that is really important in that scenario is really a C-level kind of discussion or board level kind of discussion, right?
And that is data about my business belongs to my business and not to an individual department,
right?
And then, and I think most companies, most large companies are at that point now and kind
of get it competitively, especially thinking about all the new fintech stuff because they're not
segregated by business unit. They're innovating across all this data that's available.
And then that leads into more pragmatically kind of the notion of balancing between security and
privacy and purpose and access to the data, right?
So if all of my data is my business's asset and it's lunatic conclusion, you say, I'd
want everybody in the company with a need to know to have access to everything.
Okay.
Oh my God.
How do I manage that in a world where, you know, security and privacy and cyber attacks,
all that kind of stuff.
So that's why I say, I think this is a C-level kind of discussion that has to happen of, okay, we agree that this is a corporate asset. Here's how we
intend to use it. Here's for what purpose we intend to use it and kind of set that vision at the top
level so that it can then be applied to different rule sets and different implementations of those protections and what use cases are considered possible versus not.
You mentioned privacy and security.
Do you see any implications around that
when someone implements a data fabric
or it's actually like a better architecture
to promote both privacy and security?
It's an architecture and then it becomes implementation.
So just because I
have access to data for my job, I may not actually be able to see your discrete record or identify it
with you, but it's important for me to see the diagnosis, the outcome, the treatment, et cetera,
et cetera, so I can aggregate that with a whole bunch of others to understand trends in that space. And so that's what, when I talk about kind of at a C-level
data usage, just kind of policy statements become really important because that can then
frame it, right? So yeah, I mean, very few employees, except maybe an attending physician
would need to know this has cost us information because he's sitting here in front of me.
But there are a lot of use cases where your data, not associated with you necessarily personally, can be used for managing and anticipating supply chain and what people need to be on and appointment scheduling and all those kinds of things.
And inside of the data fabric, certainly,
there are plenty of technologies that can be deployed to kind of protect that.
Great. And one last question from me, and then Eric can continue with his questions.
Are there any limitations in the decentralization of a data fabric? And what I mean by that,
I can think of saying instead of moving all the data from our databases into a data fabric. And what I mean by that, I can think of saying
instead of moving all the data
from our databases
into a data warehouse,
we can, let's say,
federate or connect directly
to these databases
and execute the queries.
But can we push this even farther
and get executing these queries
or these analysis and functions
to a mobile device?
Is it something that's applicable
also in IoT cases where you have the edge and you need to do some processing there? What are
the limits and what do you see happening in the next couple of years? Yeah, I mean, the limits are
how far out to the edge you want to go, right? So the edge isn't one thing. It's like the boundary of an amoeba and it's changing all the time, right? And it's expanding because edge devices are getting
smarter and smarter and smarter. And so the edge is moving and ebbing and falling. And like I said,
ultimately five years from now, we'll probably be talking about some other really cool stuff that
you can do further out in the edge because the edge has broadened its boundary. But certainly
playing the data as close to where it's created and where it lives is the important concept there.
And so certainly from an IoT use case, you try to push, there's not a single edge like the end
device, but there are multiple layers of the edge and you just try to push out as far as is
appropriate. And yeah, so certainly data fabric
and data fabric architectures and technologies
need to take that into account.
And yeah, I mean, think about ARM processors
and how much more powerful they're getting.
And I had a Raspberry Pi sitting here somewhere
that runs a complete image of our database.
And it's like, okay, great.
If that makes sense,
and there's an analytic we can push out there
and there's data that's contained that's been consumed into that device, then we want to be
able to make that happen. Well, we're actually getting close to time here. Scott, one thing I'd
love to do, you have seen major life cycles in the world of data. And one thing I'd love for you to do
is just give some advice to our listeners, especially maybe those who are
early in their career or who especially may be aspiring to a leadership role in data. And what
are the types of things that you would encourage them to be thinking about now or sort of lessons
that you've learned that might be helpful to them? It's not an assembly line kind of job,
meaning it's not repetitive, right?
One of the things that I was interested in reading a couple of years ago about data scientists was a
new hip job to go get a degree in. And why was that? Because every day you come to work, it's
a different job. So if you like variety and creativity, it's a great place because I don't see any slowing of the rate of change in,
in any of the aspects of what's happening in, uh, in our environment. So,
so there's, there's definitely, uh, there's definitely that.
And I'd say the other thing is I actually learned this in university is writing
papers, whatever it's like,
you can often find and make data, tell whatever story you want
to tell and try not to do that because if you're going to be successful, you got to learn stuff
from the data that you didn't expect. And that's when you're really doing your job. Well,
really good advice. And actually I was talking with an advisor earlier this week and he was
talking about how do you do reporting really well? And he said, the thing that makes it really hard is that you can tell whatever story you want,
which is true. Just, I hope my boss isn't watching this because then they'll never
trust any report that I bring in. Yeah, that's really great advice and I really appreciate that
insight. Well, Scott, it's been really wonderful to have you on the show. I loved learning about Data Fabric.
It was really helpful for you to break that down
and love just learning from you in general
about all the amazing things that you've done
about data in your career.
So thanks for giving us the time to be on the show.
Thanks very much.
It was fun being here
and hopefully we'll see you all again soon.
What a great show.
My takeaway is very specific,
but I don't know if I've ever heard anyone say
the cloud is making the data silo problem worse.
And I think that's because there's so many cloud tools
that maybe promise to solve that problem.
And I just found that very refreshing
because I think a lot of people sort of experienced that pain,
although it may not be as challenging
from a pipeline perspective to solve for that
as it was in on-prem days without sort of streaming tools.
But yeah, that was just really interesting.
So that's my takeaway, Costas.
Yeah, absolutely.
I don't think I can agree more with you.
And he's right.
Like, I even built a business because of that, right?
Like, we started Blendo to consolidate the data from the cloud to the cloud.
So there are business opportunities everywhere.
Yeah, he's right.
Like, I mean, just because you have the cloud
and you can move everything in the cloud
doesn't mean that like you don't still have silos, right?
And maybe the problems are even bigger there
because at least like back in the days
where you only had your mainframes behind your firewall,
you had total control.
With the clouds, you don't.
When you are using a cloud-based ticketing system or CRM, you don't have that much control
over what are the interfaces, how the data will look like, how fast I can access the
data and all that stuff, which introduces some very interesting challenges out there.
So yeah, that's very interesting.
And I have to say that after the conversation we had with Scott today, I'm very, very curious
to see how these new patterns like the fabric or the data mesh are going to evolve.
There are some very interesting technical challenges there.
There's a lot of value if we might not like to implement them, obviously.
But as with everything else,
usually reality is a little bit different
than what we have in our minds.
And I think it's going to be,
we're going to have a couple of very exciting years
in front of us in terms of like new technologies
and how they are going to be implemented.
No question.
Well, thank you for joining us on the Data Stack Show
and we'll catch you on the next one. We hope you enjoyed joining us on the Data Stack Show and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.