The Data Stack Show - 58: Data Federation is No Longer The "F" Word with Scott Gnau of InterSystems

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. This week, we're talking with Scott from InterSystems, and he's the VP of Data Platforms there. And he has a really long history of working

Starting point is 00:00:38 in data. So he was at Teradata for a long time. He was the CTO at Hortonworks, and has just done a number of things that I think give him really interesting perspective on how the industry has changed over time. And is today doing some really interesting things at InterSystems, namely sort of promoting this concept of a data fabric, which could be really interesting. This is not going to surprise Kostas, you or probably our listeners, but I just love talking with people who have worked in data from the very beginning of, I guess, what we could call like the modern data industry, which actually goes back only a few decades, amazingly enough. And I always love hearing people's perspective when they look back through all the changes that have happened. So that is what I'm going to ask.

Starting point is 00:01:27 No surprise. But as always, I think we'll get some interesting insights. Yeah. I mean, I'm also waiting to hear many stories of how things have changed. I mean, he's been in this space for many decades. And I think it's going to be great to hear from him how things have changed from like going from mainframes to the cloud and then to the data fabric. And of course, I want to learn more about the data fabric itself. Like what is this new thing? We have data meshes, data fabrics,

Starting point is 00:01:56 data lakes, data warehouses, lake houses, and who knows what else. You can get the real story right here on the data stack show yeah yeah and yeah i'd love to know more and like see how much of it is more of an architectural pattern and how much of it like is an actual like technology that is implemented and what's the impact that it has and i think we have the right person to answer all these questions. So let's go and chat with him. Let's do it. Scott, welcome to the show. Really excited to chat with you about lots of different topics. We probably won't get through all of them, but I really appreciate you taking the time. Thanks for having me. You have an incredible resume. We talked a little bit about Hortonworks with the kind of a connection

Starting point is 00:02:42 that we have there from the East Coast, but would love to just hear about your background, how you got into data, and then what you're doing today. Yeah, I mean, sometimes it sounds like it's a plan, but it really isn't. Just solving problems with data has always been a passion of mine, even from the first assignments that I had that weren't necessarily very sophisticated analytically, but involved a lot of mine, even from the first assignments that I had that weren't necessarily very sophisticated analytically, but involved a lot of data and being able to resolve that data into some sort of a decision or action quickly. And I started my career in a massively parallel processing kind of environment back in the dark ages, like the nineties, when the world's largest data warehouse at that time was, do you want to guess?

Starting point is 00:03:26 30 gigabytes. And that was huge. And it took racks and racks and racks of space to pull this together. But the point is, there was a lot of information, there's a lot of intelligence in that data. And I really started my career with the notion of parallel processing to kind of break that down into hundreds and thousands of parallel threads so that the decision, so that the analytics could actually run really quickly without requiring mainframe class kind of compute. And I always found that really interesting, not just because scientifically of doing it and the physics and all of the programming that goes into it and the analytics, obviously. One of the things that thrilled me is that when you do it and you do it right, sometimes the answer you get back is completely unexpected and you learn something from your data. And that's actually the cool thing. Later on, when I moved into more of a big data world, it was the same problem to solve,

Starting point is 00:04:21 but with much more variety of data. It's no longer just transactions and purchases and customers, but web logs and social sentiment kinds of data that can be entered into those analytics to get a much more thorough view of what's happening and much better kind of decision. So interesting. It is so crazy to think about. It wasn't actually that long ago when 30 gigabytes seemed like such a huge amount of space. And now everyone's phone has more space than that, which is wild. And what do you do? What do you do today working in data day to day? So now I'm here at InterSystems, which gives me an opportunity and it really feels like a synthesis of all the experiences that I've had across my career, whether it be in massively parallel processing, highly efficient processing to transaction processing, to large variety of data and adding in new kinds of analytics altogether, is really the mission that we're on and that I'm working with here at InterSystems and our data platform organization. Our technology at InterSystems actually started in the healthcare world.

Starting point is 00:05:31 And you might imagine there's a lot of data in healthcare. It's really varied. It could be your x-ray image. It could be physician notes. It could be the payment that you made for the visit that you just took, which is structured transactional, and synthesizing all of that together for better treatments and better outcomes is a use case. And the physics behind that use case is very similar now to what we're seeing kind of

Starting point is 00:05:55 expand and extended analytics that take advantage of lots of different data of different origin and delivering analytics or insights directly at the time of interaction with a consumer or directly to someone's device. Super interesting. Could you just give our listeners a quick example of some of the customers and use cases, just so they have sort of a practical knowledge of what your work looks like day-to-day, you know, sort of in the life of businesses and consumers? Sure. You know, day-to-day, you know, sort of in the life of businesses and consumers? Sure. You know, day-to-day here at InterSystems and with our data platform, we think that we capture more than half of North America's electronic medical records,

Starting point is 00:06:35 and certainly a large percentage outside of North America. So just think about any time you're interacting with a physician or at a hospital or getting a treatment or a service or have an insurance claim, that information is flowing through our technology and being used for not just keeping track of you and your treatments and all of those things, but also being used in many instances to provide for better outcomes, better treatments, better proactive kinds of treatments, as well as from an operational perspective. A lot of our clients will use that technology to optimize their own operations. How many folks do I need on call at what period of time?

Starting point is 00:07:11 Is there seasonality and all those things so that we can line up the supply chain and all of those things? We also have a decent footprint in the financial services industry and capital markets. And so about 15% of global equity trades, again, go through systems that are managed by InterSystems' Irish data platform. And again, you think about the synthesis of that very high volume, can't lose it. It's got to scale. And I've got to make some decisions about pricing and adjudication in a very fixed amount of time. Those are the kinds of problems that we solve with our technology.

Starting point is 00:07:46 Sure. That's incredible. I mean, just thinking about the scale of 50% of EMRs sort of interacting with your platform. Well, Scott, one thing that we chatted about before the show, which I just want to dive right into, is this concept of a data fabric. And we love breaking down terminology on the show. So recently we talked about the term data mesh, and there's a lot of people excited about this term data mesh.

Starting point is 00:08:13 And you've talked a lot about this concept of a data fabric. So break it down for us. What is a data fabric? Well, data fabric is kind of a logical construct that we like to use and think about that kind of sets the bar to help enable our clients and folks in the industry to be successful. And I'll back up and I'll talk first about the requirements and then that'll kind of lead into how we think about and why we think about a data fabric as kind of a concept, right? First, and we hit on it, and when I was talking about the introduction, right, there's data volumes and data variety, it's just like, off the charts, crazy, right? Everything is now, has it now has a digital footprint. And the devices in our hands are compute devices,

Starting point is 00:09:00 and they're creating digital footprints and all kinds of new data connected and on the web and social media and all of that interaction data, as well as some of the more traditional transactional systems that folks have, whether it be stock trades or your checking account or retail purchases. So first and foremost, data is just everywhere. It's high in variety and it's extremely high in volume and it can be very volatile. And so when you think about that, that's different than certainly 10, 15, or 20 years ago, when the majority of data was kind of created inside of a corporate firewall, largely by mainframes connected to PCs. It was very structured, transactional, and very controllable. Now it's kind of out there.

Starting point is 00:09:51 And so one of the results of that is kind of traditional processing of, hey, let me consolidate the data into a place and try to normalize it and then do something with it just doesn't work. It's just kind of physically impossible, A, to do it, B, just to keep up with it. So that means now it's more important to think about connectivity of data than consolidation of data. So one of the key underpinnings of a successful data fabric is the notion of data connectivity. Can I really play it where it lies? Can I get access to it in a very seamless fashion? Okay. So there's that. Another thing that's happening, obviously, and we see it on the nightly news and people talking all the rage about artificial intelligence and machine learning and deep learning. Because there are massive amounts of compute available in the world today and massive amounts of bandwidth available, right? analytics that it's possible to actually deploy. Not only is it possible to actually deploy,

Starting point is 00:10:46 but it's possible to actually get a relevant answer and use it for a much more sophisticated kind of analytic and ultimately drive a better insight, a better connectivity with a client, customer, or prospect. And so analytics are no longer just aggregating and summarizing and joining tables, but now include all of these other kinds of capabilities as well. So there's a requirement for some sort of flexibility in the model of what kind of analytic can I run, when can I run it, and how can I interject new analytics as they're invented into those pipelines in real time without starting over. So that's another construct of what we're talking about in a data fabric. Certainly on the first point, I talked about data variety, the variety of data and tomorrow

Starting point is 00:11:35 there'll be some other kind of data that we hadn't thought of, right? And so it's no longer possible, just it's no longer possible and it's no longer efficient to just have a tool that is a SQL engine or a NoSQL engine or this or that. You got to think about your data fabric as being able to consume and store any kind of data in its natural format without having to change it when you store it. You don't want to convert it into rows and columns. You don't want to apply any change to it. And well, why is that? Well, number one, if you're going to connect back to it, you want to see it in its native state. But more importantly, if you're going to generate trust across this ecosystem, you've always got to be able to map back to the origin of the data and how the data came to you. And if you make changes to it, you can't do that and you can't build that

Starting point is 00:12:23 level of trust. Think about running a machine learning algorithm to optimize a treatment for a patient. You really want to trust that the data came from the right place and that the prediction that you've made is accurate. It's a life or death kind of scenario. So being able to have that kind of construct in it. And I'd say kind of the last thing is that you've got to think about being able to have that kind of construct in it. And I'd say kind of the last thing is that you've got to think about being able to deploy insights at various places along the chain. So it's no longer relevant to run a bunch of batch nightly uploads. And over the weekend, you run some data mining algorithms and Monday morning, knowledge workers show up and they do

Starting point is 00:13:02 something, right? You've got to deliver a recommendation that's relevant to a device, to a consumer while they're interacting with you. And so, I mean, just the raw physics of that and kind of speed of light means that you've got processing out to the edge and that you've got to have the capability and the sophistication to deliver in real time or in the right time. So if you take kind of those constructs that I just described as four main kind of constructs, that's what we think of as kind of the requirement set for a modern data fabric, where you're able to weave together different kinds of data with different timeliness from different sources in the cloud on-prem with different kinds of analytics with different destinations. And one of the things that we were talking about,

Starting point is 00:13:49 we were talking earlier was, so what about the cloud in all of this? Well, the cloud is actually kind of a culprit to some degree. Number one, because you can now create almost infinite compute resources on demand, which means there's a whole new set of analytics. It's possible that wasn't possible before and it's affordable. That's actually really cool. The other thing is with all of these connected devices and everything that's going on with cloud-based technologies, it just is a further distribution of data. And for the first time, you know, in a generation, right, there are, there is a whole class of data that will actually live its entire life cycle only in the cloud. And that leads to the ability to do connectivity in a very seamless and transparent way that generates trust and traceability back to the source.

Starting point is 00:14:36 Interesting. What an interesting concept. I've never thought about being at a point where data will live its entire life cycle in the cloud. And that's just so interesting to consider. But one point you made, Scott, that I'd love to dig into a little bit, I know Costas probably has a bunch of questions, but I'll retain the microphone for just a minute longer. So one thing you said, so an emphasis on connectivity, connecting data because it's coming from various sources in different formats, as in the context of a cloud data warehouse that you mentioned, are still just trying to collect all their data in one place, right? It's like, if we can just get all of our data into Snowflake or BigQuery, we'll get so many answers. And so that seems to be a trend that's still pretty strong, but it may depend on the size of the

Starting point is 00:15:43 company and sort of the complexity. But I just love to dig into that a little bit more since we do see a trend towards companies working really hard to collect data where you're saying connectivity is actually sort of the bigger problem, it seems like. Yeah, I think connectivity is more sustainable, right? When you think about consolidation of data, there are a whole lot of aspects to it. One is just the sheer movement of data, which is expensive and time consuming. Just the latency of data movement can mean that you're getting access to the data too late whole consolidation kind of a scenario. When the consolidation fails or when there's a failed network port or something like that, then human beings have to get involved. And human beings are more expensive than software typically, have to get involved to kind of resolve what happened, what's going on.

Starting point is 00:16:41 If data gets out of sync when you're consolidating it, then you violate some of the trust you've built up because you may get different answers at different points along the pipeline. So I think really solving that problem and thinking about judiciously using consolidation versus doing connectivity becomes a really important new paradigm. And so certainly there's a lot of buzz in the market about folks moving data and analytics to the cloud. And isn't this really great? I think that that will abate very quickly because in the end, it'll still have the limitations that I described, but because it's still a consolidation play, you just happen to be consolidating in a different place that happens perhaps to be a bit more expensive than the place you used to be consolidating. Interesting.

Starting point is 00:17:23 I have like one main question right now, which is about data fabric. And so it's kind of like, it sounds like a great idea, right? Like, yeah, it makes total sense that instead of like replicating the data from all the different places where it lives and like trying to move it into one place

Starting point is 00:17:39 and all these things, we can just connect the data together and work on top of that. But how is this data fabric created and implemented on a more technical? What are the components of a data fabric? So, and that's a really good question. So obviously what I described is kind of a logical construct. And you think about it from an architectural perspective, where the rubber meets the road

Starting point is 00:18:01 is how the heck do I go implement this, right? And so there are a lot of different folks and a lot of different companies talking about different ways to go do it. I would say that in many, certainly the cloud vendors are saying, yeah, you can go build this kind of stuff. You basically cobble together a collection of seven or eight different technologies

Starting point is 00:18:22 and you can kind of this functionality. And we see people doing that and trying to make that successful. Certainly, the data fabric definition that I provided is kind of the bar that we set all of our technology investments towards at InterSystems with the Iris data platform, likening to be able to do that with a single set of technology and provide a little bit lower risk to our customers. But like I say, it's more a logical construct and then it becomes kind of the bar that we set for ourselves when we're making investment decisions in the technology

Starting point is 00:18:56 and the flexibility that we create. And like any set of blueprints that you get from an architect, right, you can choose your materials and build out the structure differently. Certainly, we like to think that we can compete with the set of materials that we bring, which is very simple and easy to support. Is there a set of like, let's say, fundamental components that this architecture has? Something that like, let's say, you cannot have a fabric without at least these components. Yeah, I mean, bringing it down a little bit further, certainly you need persistence. You need pipelines, transports, and then you basically need the calculate functions. And I think about it mostly in kind of a microservices kind of architecture. architecture, we say, if you're, if you're able to move stuff, if you're able to persist stuff, and then if you're able to calculate whether that's an analytic or a transformation or whatever,

Starting point is 00:19:53 and you have those, and you can kind of cobble those three services together, pretty much like our DNA is made up of four base materials in different combinations, you can make very complex and interesting things. It's the same thing in this data fabric. And what you need, the underlying technology of your data fabric and the standards that you choose to be able to do is to be able to host those different things and combine those things up and then manage them as end-to-end applications that you build.

Starting point is 00:20:20 That's super interesting. Is there like some kind of relationship between a data fabric and what's technically used to call a query federation? Because you mentioned a lot of being able to have this kind of decentralized architecture where, from what I understand at least, I have an analytical function and I can run it wherever the data is instead of having like to get the data in one place and execute and my query is there. So I remember, for example, Presto. Okay, of course, like Presto works in your own like environment. So it wasn't so decentralized. But in a way, the whole idea of Query Federation was that instead of copying the data and bringing into one place, let's execute the query there where the data leaves, get the results back and somehow connect and consolidate the results.

Starting point is 00:21:10 Is there some kind of like relationship there between the two ideas? Yeah, there is. And I'll have to tell you, you may edit this out. But for the first 25 years of my career, federation was the F word to me because it never really worked, right? But when you think about the ability to do connectivity, you now have a new set of tools that can make that kind of a use case, although there are different terms, data virtualization and other things come up. You now have some more tools that make it more of a reality. I think there really are two things that we kind of hold ourselves accountable for in intersystems. One is actually, yes, being able to push the processing to the data or the data to the processing if you want, but typically you want to push the processing data because that's the

Starting point is 00:21:56 cheapest thing to do and then get the result sent in a pipeline somewhere else, right? And then the second thing that we do that I think is kind of unique in the industry is that we are multilingual in the kind of process that we allow to run against our data. So we're multilingual being you can speak SQL, Java, Python, you can interact with the data. And we think that's really important in the data fabric construct, because what I described is, all these new analytics. So it's no longer just a SQL statement, but you might want to bring in some machine learning thing that's written in Python. And you just want to push that out and have the technology stack figure out how to run that process in an optimal way and then get the answer back. We're trying to see some interesting use cases from our customers who are

Starting point is 00:22:46 able to do this because certainly in a traditional machine learning data science model that doesn't use a data fabric or intersystems tech, there's this huge data extract and you extract a bunch of data and give it to the data scientists and then they run their stuff and they find something that's interesting. It's like, okay, we think this is interesting. Then they wipe the data scientists and then they run their stuff and they find something that's interesting it's okay we think this is interesting then they wipe the data out and they go get more data and they run it and they get the answer but then they have to take the answer and manually put it back and think about all the latency that's created there if you can just run the machine learning model on the data where it lies but even if your process was inefficient and ours isn't but even if it was inefficient though you're removing all that latency from the process.

Starting point is 00:23:27 You now have the reality of being able to have a much better decision in time to do something about it. Yeah, that's very interesting. And this kind of, that's why I have like a very specific question about machine learning specifically, but is this model like applicable both for training and inference or it's more about one of them? Because I can imagine that like inference can happen much easier at the edge, let's say, or like on a mobile application or like whatever,

Starting point is 00:23:57 but training, because it's kind of like a little bit more complicated, more iterable kind of process. Can it happen? All the use cases around like working with data can be like served on a data fabric. We actually support all of those modes. And I think when you think about the fabric that you deploy and the technologies that you choose, you really think about all those modes really needing to exist as close to the data as possible. What are the most common use cases that you have seen deployed on the data fabric? Is it machine learning? Is it like more BI related use cases? What do you have seen your customers like? I'd say it's kind of like

Starting point is 00:24:36 the market, right? The market, everybody gets BI. Human beings kind of think relationally to some degree and they're used to interacting with those tools. So just like that's kind of think relationally to some degree, and they're used to interacting with those tools. So just like that's kind of the base of experience in the industry right now, you probably see more of a predominance of BI algorithms than the others. But the machine learning stuff is starting to grow. And I think, gosh, I remember in the late 80s, BI was a new concept, right? And yes, I'm that old, sorry. But it was a new concept. And it took a while for people to catch on that it actually really worked and was very

Starting point is 00:25:15 meaningful, right? Kind of people think, well, I know my business. I don't need BI to tell me my business, right? We're seeing some of that now with some of the early machine learning stuff. And certainly some of the early adopters get it and so on and so forth. But if you look at kind of the middle of the market, kind of the mid to late adopters, they're still just playing with it. They haven't fully bought in. But when they do, that capability will become even more important. You'll see that volume grow. I think also that being able to combine those modes is also a really interesting thing. Thinking even about some of the most simplistic,

Starting point is 00:25:49 I want to do pricing adjudication on a transaction. And into that, I want to factor the risk capital impact that it will have on my business. And I want to consider the total relationship that I have with the customer. And oh, by the way, I want to run an ML model that maps the vector of the securities pricing of the underlying security to predict whether or not this will be a good transaction. And then I can put all that together in a couple of milliseconds and price the transaction. It combines all of that technology together. Okay. That's super interesting. Do you, I mean, from your experience with like the customers that they are working with implementing like a data fabric, do you think that today it is specific like type of company or organization that it's like, let's say, more ready or more mature to implement and adopt a data fabric?

Starting point is 00:26:43 Or it's something that you think of like a benefit or it can even be implemented like in any company? I'm seeing it across the board. I would say that in some of the less mature companies, since they're kind of coming in at this point in history, it's almost like the de facto requirement and that they're building from versus a

Starting point is 00:27:05 more mature company. That's got all kinds of legacy applications and legacy businesses. It certainly can't be compromised. I see, I see them doing things a little bit more incrementally and thinking about how do I go transition? And the cool thing about the data fabric is if you, if you weave it correctly with the right technology, it'll plug into the legacy stuff and leave that kind of unadulterated and start to build new applications in this space and it'll

Starting point is 00:27:31 start to take on critical mass. And at some point you'll kind of see that across the chasm and that'll be now the de facto standard. Yeah, that's interesting. And I'll go back again to the components of the data fabric, mainly because, I mean, if someone follows the market and the news and what is happening out there, might be aware of things like the data lake now we have the lake house which is like the hybrid between the two i don't know what will follow after this but because like companies are out there investing right now like huge amounts of both money and effort to implement like all these patterns like architectural patterns right how do you see these fits under the concept of the data fabric is Is there like some kind of conflict there? Is the data fabric like something that sits on top? How do you see it? And also like give us some best practices and some like advice

Starting point is 00:28:33 on how we should think as architects, as data architects with all these different components. I think it's really the next generation architecture, right? There was a time when data marts was state-of-the-art architecture for analytics, and then that became enterprise data warehouse, and then that became data lake. And then, and I think this is just kind of what comes after, right? And as the industry matures, and by the way, each of those things in and of themselves were extremely relevant when they came to market. But the market, because it's changing rapidly, and there's this huge volume explosion, a variety explosion of data. And also

Starting point is 00:29:11 because the bar is set higher, because you and I, and all of us are much more educated consumers, we expect that the folks that we interact with will understand us better, right? So all these things kind of come together and say that the ball has moved. And so, and you mentioned like, and a lot of people are not so, oh, I'm going to run my data warehouse in the cloud. That's interesting, but it's kind of like putting new leather seats in a 40-year-old car that the transmission just fell out of. Sure, your ride will be more comfortable and that's interesting, but is that really your sustainable mode of transportation? And so I really think about it as kind of a paradigm shift, not to use a buzzword, but a paradigm shift in the marketplace where all of these things were relevant at a time and data lakes were very relevant at a time, right? Because I got to capture all the data and figure

Starting point is 00:29:59 out what it is and understand it. The thing that sometimes is missing in data lakes is the notion of traceability and connectivity for trust, and they become data swamps. And so there's that. And again, not bad technology, not bad concepts for the time, but I think the world and the market has moved on. And this is a new place, whether you call it data fabric or data mesh or some other thing, I think whatever that thing is, is really driven by kind of the four underlying pressures that are happening in the marketplace that make each of those previous technologies less interesting to go solve the entire problem. Yeah, yeah. You mentioned another technology that I'm still trying to figure out, to be honest,

Starting point is 00:30:40 which is data mesh. So what's the relationship between the data mesh and the data fabric? Or like, where are the differences or the overlap? Well, again, I think it depends who you're talking to. So I defined what I meant by data fabric and that's what I mean. I have heard people use data mesh and other terms to kind of describe 80% of what I'm describing

Starting point is 00:31:00 and then another 20%. And so I think, again,, I think that the big notion is that the world's moving on from data lakes and certainly from data warehouses into kind of this next generation data infrastructure. And whatever you call it, it's going to be driven by the marketplace requirements, which some of what I think I described for you.

Starting point is 00:31:23 And folks who figure out, number one, folks who figure out how to deploy and actually build out that architecture to make their business successful, I think will be much more successful than those who don't. And I think technology vendors like us who can actually provide a better mousetrap will get some good attention as the market kind of moves into that space. Scott, one question. The enterprise scale, when you think about the work that you do in the healthcare industry, the work that you do in capital markets, massive scale. And you have worked in data for a long time. And so I'd love for you to speak to those in our audience who are hearing what you're saying and they probably agree in theory that like that makes sense,

Starting point is 00:32:14 but then they're kind of facing the day-to-day of like, okay, like my charge is to go implement or sort of get value out of the data lake and the data warehouse setup that is what our company is sort of implementing. But I'd love for you to speak to them and sort of like, how do you manage the, because there's sort of a long tail of the market where if you're not sort of solving these extremely complicated problems at scale, maybe some of those tools are sufficient, at least for sort of the problems that you're solving. How do you, how would you tell someone like that to think about the future? And how do you prepare for that? And when do you begin to sort of tactically think about things like migration, adopting new technologies, all that sort of stuff?

Starting point is 00:33:03 Yeah, so I think there are a couple of things, right. And, and you're also spot on. It can be a daunting task to go sell a vision of this nature inside of an organization that tactically needs to get things done. Right. And so, so I think there are a couple of things just, just, just like early in my career, I talked about breaking things down into small problems and doing parallel processing. Break it down into small problems. There are the drivers that I described. And go look at some of those small problems and figure out, okay, is data variety, volume, and location going to impact my application? If not, okay, I'm not going to worry about that for today. But I certainly need to at least have adjudicated that decision.

Starting point is 00:33:48 I think also the notion, and this requires certainly corporate CTO buy-in and things of that nature, is one of the really cool things about cloud is it's easy to spin up applications quickly using a collection of microservices. There's no big capital acquisition, et cetera, et cetera, et cetera. The problem is that also creates sprawl and silos in a way that we've never seen before, right? And again, for my entire career, the marketplace clients have been talking about the problem of data silos, right? And I think that in today's world, it's even harder, right? Because it's easier to create them. And there's no, you know, in the 80s, this was back when you had to go to capital committee and get approval and move data and da, da, da, da, da. And there was still data silos. Today, you don't need any of that approval and you can create your own thing. And so there

Starting point is 00:34:41 are more and more and more. And so my point there is certainly try to take the long view and say, I can't afford to go build a five-year plan to go build a data fabric because I have to run my business. I get that. But there are very easy architectural decisions that you can make to make that transition easier and to avoid the continuation of this data sprawl and data silo population where you end up with disconnected data that you can't analyze, that ends up being extremely expensive, and potentially redundant. And ultimately, when you start to look at it, and from that perspective, the ROI on at least agreeing to a data fabric kind of architecture becomes very easy to justify. And then it becomes

Starting point is 00:35:24 how do I technically go solve this problem while new applications are going to use this architecture and legacy applications are not? And just because you've decided to use an architecture doesn't mean you have to slow down rolling out solutions. It just means that the choices that you make on storage and transport and the algorithms and the actual technology standards that you choose are a forethought and not an afterthought. Sure. That's interesting. Two things there that I just want to reiterate that I think were really helpful. One is, I don't know if it's just subconscious. I'm sure I was exposed to some sort of marketing messaging with all of these cloud tools, but the cloud kind of had a promise of like helping solve the data

Starting point is 00:36:06 silo issue. And in many ways, it's refreshing to hear you say it's worse now because anyone can go into AWS and spin up whatever service they need for whatever they're doing. And then all of a sudden you, even a small to midsize company have sort of these pockets of replicated technology that are sort of managing data independently of each other, which is super interesting. And then the other one is that when we think about technology migrations and you think about something like an on-prem to a cloud where it really is sort of a major overhaul, right? Like there is a massive migration. If everything's in the cloud, I think your point about saying

Starting point is 00:36:49 you can make decisions now in the cloud that aren't, it's not like you're migrating the entire infrastructure of all the technology of the company. You're dealing with, like you said, sets of microservices and you can choose to construct those in a way that sort of paves a path as opposed to thinking about it as, okay, you go from data lake and warehouse to

Starting point is 00:37:10 fabric, and it's a massive one-time, you know, sort of painful migration. Yeah, and it's just kind of like good programming practice to not put constants into your program, but actually always point to variables so you can change your mind later, right? Being able to think about it as disaggregated from a specific cloud vendor, but more as an entity of its own becomes very freeing because then the cloud is a source, a provider, but it also avoids potential lock-in or other downstream impacts because you've actually up-leveled the whole architecture. And I think that's important. I mean, just as human nature and the nature of business,

Starting point is 00:37:51 right? Just through my career, right? Most large companies, you say, well, what's your BI tool? Well, we have all of them. Well, what's your database or standard? Well, we have all of them. And you say, well, what's your cost standard? Well, we have all of them. It's going to happen. Or you acquire a business that had a different cloud. And you want to get the lifeblood of the data and the intelligence and the insights that can be driven from it. If you've up-leveled your architecture and you think about it in a virtualized abstract across multiple clouds, that's also very valuable in terms of future-proofing what you're rolling out. And again, I'm not here to say one cloud vendor is bad and one cloud vendor is good,

Starting point is 00:38:31 or it's got nothing to do with that. It's just the nature of the market is going to dictate that it's going to change. And it may change suddenly and without a whole lot of notice and without a whole lot of logic. And if it disrupts the value chain of the insights that you're driving and the interactions you're having with your customers, that's very bad. Scott, when we are discussing about data messes, one of the definitions of a data mess that comes like many times is that data mess is like, let's say, 80% about an organization architecture and not a data architecture. It says a lot about how companies should be working with the data or how they should be organized around the data.

Starting point is 00:39:18 And I'd like to ask you, companies have to change in order to adopt this new paradigm, right? What are you think the main changes that a company has to make, especially the bigger ones that it's like more difficult to change in order to maximize the value that they can get from something like the data fabric? Or maybe there isn't, but if there is something that has also changed to the perception, like how the structure of the company is, what is this? I think one of the things that is really important in that scenario is really a C-level kind of discussion or board level kind of discussion, right? And that is data about my business belongs to my business and not to an individual department, right? And then, and I think most companies, most large companies are at that point now and kind

Starting point is 00:40:05 of get it competitively, especially thinking about all the new fintech stuff because they're not segregated by business unit. They're innovating across all this data that's available. And then that leads into more pragmatically kind of the notion of balancing between security and privacy and purpose and access to the data, right? So if all of my data is my business's asset and it's lunatic conclusion, you say, I'd want everybody in the company with a need to know to have access to everything. Okay. Oh my God.

Starting point is 00:40:36 How do I manage that in a world where, you know, security and privacy and cyber attacks, all that kind of stuff. So that's why I say, I think this is a C-level kind of discussion that has to happen of, okay, we agree that this is a corporate asset. Here's how we intend to use it. Here's for what purpose we intend to use it and kind of set that vision at the top level so that it can then be applied to different rule sets and different implementations of those protections and what use cases are considered possible versus not. You mentioned privacy and security. Do you see any implications around that when someone implements a data fabric

Starting point is 00:41:16 or it's actually like a better architecture to promote both privacy and security? It's an architecture and then it becomes implementation. So just because I have access to data for my job, I may not actually be able to see your discrete record or identify it with you, but it's important for me to see the diagnosis, the outcome, the treatment, et cetera, et cetera, so I can aggregate that with a whole bunch of others to understand trends in that space. And so that's what, when I talk about kind of at a C-level data usage, just kind of policy statements become really important because that can then

Starting point is 00:41:57 frame it, right? So yeah, I mean, very few employees, except maybe an attending physician would need to know this has cost us information because he's sitting here in front of me. But there are a lot of use cases where your data, not associated with you necessarily personally, can be used for managing and anticipating supply chain and what people need to be on and appointment scheduling and all those kinds of things. And inside of the data fabric, certainly, there are plenty of technologies that can be deployed to kind of protect that. Great. And one last question from me, and then Eric can continue with his questions. Are there any limitations in the decentralization of a data fabric? And what I mean by that, I can think of saying instead of moving all the data from our databases into a data fabric. And what I mean by that, I can think of saying

Starting point is 00:42:45 instead of moving all the data from our databases into a data warehouse, we can, let's say, federate or connect directly to these databases and execute the queries. But can we push this even farther

Starting point is 00:42:57 and get executing these queries or these analysis and functions to a mobile device? Is it something that's applicable also in IoT cases where you have the edge and you need to do some processing there? What are the limits and what do you see happening in the next couple of years? Yeah, I mean, the limits are how far out to the edge you want to go, right? So the edge isn't one thing. It's like the boundary of an amoeba and it's changing all the time, right? And it's expanding because edge devices are getting smarter and smarter and smarter. And so the edge is moving and ebbing and falling. And like I said,

Starting point is 00:43:35 ultimately five years from now, we'll probably be talking about some other really cool stuff that you can do further out in the edge because the edge has broadened its boundary. But certainly playing the data as close to where it's created and where it lives is the important concept there. And so certainly from an IoT use case, you try to push, there's not a single edge like the end device, but there are multiple layers of the edge and you just try to push out as far as is appropriate. And yeah, so certainly data fabric and data fabric architectures and technologies need to take that into account.

Starting point is 00:44:11 And yeah, I mean, think about ARM processors and how much more powerful they're getting. And I had a Raspberry Pi sitting here somewhere that runs a complete image of our database. And it's like, okay, great. If that makes sense, and there's an analytic we can push out there and there's data that's contained that's been consumed into that device, then we want to be

Starting point is 00:44:29 able to make that happen. Well, we're actually getting close to time here. Scott, one thing I'd love to do, you have seen major life cycles in the world of data. And one thing I'd love for you to do is just give some advice to our listeners, especially maybe those who are early in their career or who especially may be aspiring to a leadership role in data. And what are the types of things that you would encourage them to be thinking about now or sort of lessons that you've learned that might be helpful to them? It's not an assembly line kind of job, meaning it's not repetitive, right? One of the things that I was interested in reading a couple of years ago about data scientists was a

Starting point is 00:45:10 new hip job to go get a degree in. And why was that? Because every day you come to work, it's a different job. So if you like variety and creativity, it's a great place because I don't see any slowing of the rate of change in, in any of the aspects of what's happening in, uh, in our environment. So, so there's, there's definitely, uh, there's definitely that. And I'd say the other thing is I actually learned this in university is writing papers, whatever it's like, you can often find and make data, tell whatever story you want to tell and try not to do that because if you're going to be successful, you got to learn stuff

Starting point is 00:45:51 from the data that you didn't expect. And that's when you're really doing your job. Well, really good advice. And actually I was talking with an advisor earlier this week and he was talking about how do you do reporting really well? And he said, the thing that makes it really hard is that you can tell whatever story you want, which is true. Just, I hope my boss isn't watching this because then they'll never trust any report that I bring in. Yeah, that's really great advice and I really appreciate that insight. Well, Scott, it's been really wonderful to have you on the show. I loved learning about Data Fabric. It was really helpful for you to break that down and love just learning from you in general

Starting point is 00:46:32 about all the amazing things that you've done about data in your career. So thanks for giving us the time to be on the show. Thanks very much. It was fun being here and hopefully we'll see you all again soon. What a great show. My takeaway is very specific,

Starting point is 00:46:46 but I don't know if I've ever heard anyone say the cloud is making the data silo problem worse. And I think that's because there's so many cloud tools that maybe promise to solve that problem. And I just found that very refreshing because I think a lot of people sort of experienced that pain, although it may not be as challenging from a pipeline perspective to solve for that

Starting point is 00:47:13 as it was in on-prem days without sort of streaming tools. But yeah, that was just really interesting. So that's my takeaway, Costas. Yeah, absolutely. I don't think I can agree more with you. And he's right. Like, I even built a business because of that, right? Like, we started Blendo to consolidate the data from the cloud to the cloud.

Starting point is 00:47:43 So there are business opportunities everywhere. Yeah, he's right. Like, I mean, just because you have the cloud and you can move everything in the cloud doesn't mean that like you don't still have silos, right? And maybe the problems are even bigger there because at least like back in the days where you only had your mainframes behind your firewall,

Starting point is 00:48:04 you had total control. With the clouds, you don't. When you are using a cloud-based ticketing system or CRM, you don't have that much control over what are the interfaces, how the data will look like, how fast I can access the data and all that stuff, which introduces some very interesting challenges out there. So yeah, that's very interesting. And I have to say that after the conversation we had with Scott today, I'm very, very curious to see how these new patterns like the fabric or the data mesh are going to evolve.

Starting point is 00:48:39 There are some very interesting technical challenges there. There's a lot of value if we might not like to implement them, obviously. But as with everything else, usually reality is a little bit different than what we have in our minds. And I think it's going to be, we're going to have a couple of very exciting years in front of us in terms of like new technologies

Starting point is 00:48:58 and how they are going to be implemented. No question. Well, thank you for joining us on the Data Stack Show and we'll catch you on the next one. We hope you enjoyed joining us on the Data Stack Show and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Your Ad Here

The Data Stack Show - 58: Data Federation is No Longer The "F" Word with Scott Gnau of InterSystems

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.