The Data Stack Show - 84: Why Are Analytics Still So Hard? With Kaycee Lai of Promethium

Episode Date: April 20, 2022

Highlights from this week’s conversation include:Kaycee’s background and career journey (2:34)Why analytics are hard (7:28)Defining “data management” (11:47)Defining “data virtualization” ...(15:57)The relationship between data virtualization and ETL (18:34)Where a company should invest first (21:40)Building without a Frankenstein stack (25:19)How Promethium solves data stack issues (27:53)Giving context to data (35:14)Cataloging: background, at Promethium, future (39:29)Who uses data catalogs (48:00)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. On April 27th, we have another Data Stack show live stream coming up. We love doing these because we get to record a show live and you get to join and ask your questions to the audience in real time. It's super fun. We are tackling the subject
Starting point is 00:00:39 of data quality on this show, and I'm super excited. We have people from BigEye, Metaplane, LightUp. We're working on getting great expectations in the mix. But Costas, my question for you is, data quality companies seem to be exploding across the data landscape. Why do you think this is such a big deal and has become such a hot topic and such a proliferation of companies starting in the space. Yeah. I mean, you know, like as we make it easier and easier for people like to get access to their data, they start like focusing on implementing insights and reports and even like data applications, whatever we want
Starting point is 00:01:19 to call them on top of that. And then like everyone realizes that, you know, we are still like, let's say cost as usual, like the garbage in, garbage out kind of situation, right? Like if your data is bad, no matter how good your outboards are, like your application is at the end, the outcome is going to be also bad, right? So making sure that we have a good understanding of the quality of the data and how much we can trust the data that we have. It's quite important.
Starting point is 00:01:47 And it's actually pretty tough also to do. Even refining what data quality is, it's not an easy task. It's not an easy task to understand or figure out where data quality should live. Is it like a pipeline thing? Is it the data warehouse thing? Is it at the collection level of the data? Or is it like a pipeline thing? Is it the data warehouse thing? Is it like at the collection level of the data? Or is it at the BI level? Like, maybe it's on every place.
Starting point is 00:02:11 We don't know. We're still like working on trying to figure out the answers to these questions. And it's a space right now where like there's a lot of things happening, a lot of innovation. And I think it's going to be awesome like to have all these great people in one place
Starting point is 00:02:27 and see exactly, hear from them, their experience, what made them get into this problem space, what they've learned and what are the challenges. So I think there's plenty to learn and I'm very, very excited and looking forward to chatting with all these people. Absolutely.
Starting point is 00:02:44 You can register at datastackshow.com slash live. That's datastackshow.com slash live. And if you register, you'll be entered to win one of our nifty drop mechanical keyboards. I just plugged mine in yesterday, Costas, and it is awesome. So definitely make sure you register and we will catch you on the next Datastack Show live stream. Welcome to the Dat Sack Show. Today, we're going to talk with Casey from Prometheum and a really interesting background. And I'm always interested in Casas by talking to people
Starting point is 00:03:14 who build technology based on not just sort of seeing like a market opportunity maybe, or, you know, thinking of a cool technology, but who have worked in context around the problem and just repeatedly experienced different kinds of pain that relate to the same problem. And that's what Casey experienced. And that's why he built Parmethium. I'm really interested. He talks a lot about in just some of his blogs and the materials, like analytics are pretty difficult, even though we live in an age of like modern tooling. And I want to ask him why that is. I think it's something that, you know,
Starting point is 00:03:51 different people in different roles and companies feel different pain around, but it can be kind of hard to articulate, like why are analytics still actually pretty hard, you know, and why are they huge projects, even at, you know, mid-sized companies? So anyways, that's my question. I think it's a great opportunity to learn more about a new term,
Starting point is 00:04:09 which is data fabric. So I'd love to learn more about it and put some context around it. Why we need the new term and what it means and how it relates to the rest of the technologies that we use. And also revisit and all the terms, which I have the feeling that it is related to the data fabric and not the data catalog. We have talked about many different, let's say, data management, data governance-related tools so far. I think data cataloging is not something that we have touched that much. Although I think it's quite important.
Starting point is 00:04:42 And I'd love to hear from Casey about the data catalog, how to use, why to use, and its evolution into the data fabric. Casey Weadey- All right. Well, let's dive in and talk with Casey. Let's do it. Casey, welcome to the Data Sack Show. Casey Weadey- Hey, thanks for having me. All righty.
Starting point is 00:05:01 Well, let's start where we always do. I'd love to hear about your background and then what led you to Promethean. Yeah, thanks. So my background, a little bit mixed. Got a little bit of go-to-market as well as product, as well as financial analysis kind of all mixed in. And it probably explains how I got into the data management space and how I became a founder of a data analyst company. So I started my career actually as a business and data analyst. The guy crunching numbers, getting insights. And as I like to say, the guy getting yelled at by my executives for always taking too long with those insights, which led me to do everything from take a SQL class and learn how ETLs work and why data warehouses were structured the way they
Starting point is 00:05:45 were and why I couldn't get a data mart refresh every minute, why it had to be every three months. So my journey kind of led from there into being more on the go-to-market side with sales, business development, marketing, and then eventually back to product management. And 20 years later, after being an analyst, I somehow ended up as president COO of a data catalog company selling data management tools. And one of the things I realized when doing that was that the problem not only didn't go away, it actually got a lot worse in 20 years. So when I was a young guy, crunching numbers, I was lucky enough to have one data warehouse, one BI tool. And most customers we talk to today, unfortunately, for them to be competitive and leverage their data,
Starting point is 00:06:32 they have to get data from multiple databases, SaaS applications, data warehouse, data lakes, multiple clouds. And to make it all worse, they can't even standardize on a single BI tool. And so this is a challenge that I saw a lot in my old job as president and CEO of Waterline Data. And it led me to want to find a way where, gosh, can we just make analytics easy for people, please? And can we make it so that way it doesn't matter what type of data source you have, it doesn't matter what kind of BI tool you have, can we actually streamline this process so that way it doesn't matter what type of data source you have. It doesn't matter what kind of BI tool you have. Can we actually streamline this process so that way you don't
Starting point is 00:07:09 pay a tax just to try and use your data? And that's been sort of where I exercise the product management background in me as well as kind of the go-to-market in terms of figuring out that product market fit and how do you actually deliver a product that hasn't been built before because the old way was simply creating more of the same problems over and over again. Definitely. Okay. I have so many questions there, but I have to ask a question. So I snuck around on your LinkedIn and noticed that early, early in your career, you were an analyst at the federal reserve and so i'm just interested to know what did you work on like what types of problems you're trying to solve and then did you discover anything like that really interested you or surprised you in that role i'm not sure i'm at liberty to say, Eric. I'm kidding.
Starting point is 00:08:06 Wasn't that exciting? Trust me. We work for the Fed. You actually do a lot of macroeconomic analysis, right? Looking at housing trends and big trends, stuff like that. Specifically, I was also looking at things that were affecting the banking landscape, like things that were driving M&A, regulations, how some of those regulations can monitor and enforce the monetary policies. So I would say that was the day job. I was in the statistics department, so I was doing a lot of number crunching. Believe it or not, in my spare time,
Starting point is 00:08:43 I realized someone should actually build a database of all the different M&A activity that's happening. And so I actually found time to actually do that. And that's where I kind of really got interested in the whole. No way. Yeah, I know. You know, we worked with government. You actually have a lot of time. It's not like that as the founder of.
Starting point is 00:09:00 That's so interesting. I don't know what you're talking about. Don't know what I'm talking about. Thank you for entertaining me. Okay. i want to dig into a question i i think you know you said analytics is hard and people experience that in so many different ways right i mean on sort of the i'll use a marketing example you know because i'm a marketer by trade but i was like okay i'm just trying to get like these events into Google analytics and get my Google analytics accurate. And you know, it's like, okay, well that's painful. Right. But then on the other end of the spectrum, it's like, okay,
Starting point is 00:09:32 I have, you know, legacy systems. I have new systems. I have multiple lines of business. I have, you know, all this sort of stuff. Right. And it's really fragmented. Yeah. Could you help us understand what, like from your perspective, why is analytics hard? And I agree with you. It seems crazy that it's still hard today because the tooling has gotten, the different tools have gotten way better in many ways, but it is still hard for sure. It is. I looked at it in a couple of different ways. One way I looked at it is that, so the analytics landscape has changed a lot as we morph from like everyone just put everything first in their databases and they said, hey, don't do it in database, put it in your data warehouse.
Starting point is 00:10:18 And then from there, we had new paradigms with data lakes, with Hadoop, and then with cloud and so forth. And that's okay. I feel like, lakes, with Hadoop, and then with cloud and so forth. And that's okay. Like, I feel like, okay, those are shifts that we can deal with, right? The thing that made it worse, in my opinion, is, you know, the vendors, these damn vendors
Starting point is 00:10:35 who made that data management tools. If you look, it's kind of crazy, right? It's like, hey, I'm only going to do this piece of the whole data management process. And I'm only going to do it for this platform, for this type of data. And I don't know who started that trend, but it became Vogue to start doing that. Like, well, hey, if those guys can only do it for RDBMS, I'm going to do it for HDFS, or I'm going to do it for time series, or I'm going to do it for, you know, whatever, just stay on AWS. So I think the challenge has been a lot of the data management tools only do part of the workflow and only address part of the environment or part of a data
Starting point is 00:11:16 type. Now, that may not be so bad if you never, ever change your data infrastructure. So if you said, I forever this cloud, this environment, this data warehouse. Awesome. The problem is that never happens, right? It's every day. There's a batter, newer data warehouse, data lake, some SaaS application out there, and I've never seen it. So we're going to keep what we have.
Starting point is 00:11:38 Like we're never going to innovate and get something new. The problem is once the business unit starts consuming analytics, consuming reports, and it starts getting operationalized, good luck telling someone that you're going to shut that down as you buy something new. Never happens, right? It never happens. You end up keeping it and then you say, oh, for the new stuff, I'm going to move it onto a new platform. And so what happens is you end up having to support the legacy data management tools on top of the legacy analytics structure and so forth.
Starting point is 00:12:11 So this is where it gets hard. And then to make it worse, the knowledge of the human being doesn't necessarily go back 30 years, especially the tech knowledge. So today's grad, if you threw a mainframe at them, they would kind of look at you funny. Like, what are you, what are you doing? Why are you giving me this?
Starting point is 00:12:29 Right. So, and the tech stack is moving so quickly. So I also find that it's also very hard for data teams to even do this. And so this is why it's become super challenging. And then the last part of the nail on the coffin is I think, you know, 10, 20, 30 years ago, it was okay to make data-driven decisions once in a while, right? And that was kind of the norm. But I think companies have shown us that, hey, if you can be data-driven, like Amazon, like Facebook, like the Googles of the world, you can go out and really kick butt and do really, really well. And so as companies
Starting point is 00:13:04 now realize, I have to be data-driven. You look at the pandemic, it's actually kind of taught us a lesson. You can't make a decision in three months. You're going to be around in three months. So it's forced people now into this mad rush of, oh my gosh, I have to somehow make it work with all this legacy, this stuff I have to deal with and the knowledge gap that I have. So I think that in my opinion is one of the leading factors of why things are as hard as they are today.
Starting point is 00:13:33 Henry Suryawirawan- Casey, can we, we started the conversation and you used the term data management a few times. Casey Weadey- Because I'm old. Well, you know, with it, it's also wisdom. There's also wisdom there. So I would say you're wise, but can we spend like some time like defining what data management is based on like your experience so far, because you know, I, I feel like one of the, not exactly problems, but like something that's very interesting, like with this industry is that we use a lot of different terms, and the semantics are not very clear. Everyone has a slightly different meaning of how we use it.
Starting point is 00:14:13 Drives me nuts. Drives me absolutely nuts. I know what you're talking about. Yeah. So what define data management? Okay. My high-level definition of data management is all the stuff you do after the data lands in the database, the warehouse, the life. So after it lands, the write has been committed. It's all the stuff that you do to get the insights. That's my rough overview definition of data management, right? And so it is the ETL process. It is the data cataloging process. It is the prep.
Starting point is 00:14:51 It is the modeling. It is the query, query optimization, query federation, the SQL, the process, and even to some extent, the visualization process, right? But I would say it's, I think the hard part when people think of fork, the hard part when people have a negative reaction to the word data management or a strong visceral reaction to data management, it's because they're reliving some traumatic events they used to experience through those processes that I just talked about.
Starting point is 00:15:21 Henry Suryawirawan, Yeah. Is there like a minimum set of, let's say, activities that every company needs to have? I mean, I would assume that, okay, if you would like to consume data, some kind of visualization tool, it's going to be, right, or a database system. Well, what's in your experience is like, let's say, let's name it like the minimum viable data stack, right?
Starting point is 00:15:48 Yeah. How do you define it? David Pérez- Yeah. Well, so I'm going to start at the two ends, right? The first end is basically where the data originates, right? And so this is SaaS applications, RDBMS, the data sources. And I even would even include like data with like a data warehouse there. I know, yes include data with Lake and Data Warehouse there. I know, yes, you pull data and put it in there,
Starting point is 00:16:08 but for the purpose of analytics, I kind of think of data sources as anything that stores, houses, or generates data. So I kind of put that in one end. And then at the end of the stack, on the other end, is your typical BI tool visualization. But that's even evolved right
Starting point is 00:16:25 like i would say the last few years we've gone beyond that to the dashboard isn't enough anymore people want narration people want storytelling people you know there's this trend that hey maybe i don't want to have to go look at a dashboard every single time i figure out what's going on maybe i just want you to tell me maybe i want the tool to tell me this is what I should care about. Right. So, but I would say, you know, let's call that the insight part for the, for the moment.
Starting point is 00:16:51 Right. And so for the most part, every organization has somewhat figured that out. Right. They figured out where the data is coming from, where it's stored. They figured out kind of how to visualize it. And for the most part, anything in between, I think that's where it gets messy, right? I've seen everything from like data scientists and data engineers
Starting point is 00:17:10 doing crazy Python and, you know, extraction with R and Scala to, you know, cobbled up bespoke solutions that you may have had three SIs over 20 years come in and build for you. I think the best practice has been to, you know, put in different data management tools, like, you know, it might have a data catalog for discovery, for governance. You might have a tool to do, you know, prep modeling, obviously an ETL or a
Starting point is 00:17:37 pipelining tool to get the data into a data warehouse or somewhere, somewhere else, and, and from there, it's kind of, we've also seen kind of things like data virtualization, uh, technologies as well. So I would say for the most part, right. I would classify it as you have your discovery governance layer, right? You have your prep modeling layer and you have your, I'll call it access access layer and the access is what I would lump ETL, what I would lump the moving of the data,
Starting point is 00:18:06 the pipeline data, as well as the query of the data. I would lump, you know, virtualization there as well. So I would say these are the three broad categories, right,
Starting point is 00:18:14 in the middle of, you know, the data sources and the, you know, BI visualization and analytics tools. Mm-hmm. Mm-hmm.
Starting point is 00:18:21 That's very interesting. And, okay, I have another question for another term Jeff mentioned, and that another question for another term. Yeah, Jeff mentioned that that's like data virtualization. So yes, what's that? Wow, how much time do we have? There's like it that term is so misused, like, you know, the storage guys have what have a definition of data virtualization. I'm sure the
Starting point is 00:18:42 virtualization, the VMware guys have a different definition. then there's data virtualization in more of the data management space. It's been around since Cisco had a version of that a while back, popular ones, Romeo, Starburst obviously talks about that. So I'll talk about the more recent as well as the more relevant to analytics definition. And that is really the ability to use a layer to allow you to have virtual access to the data sources where you don't have to do the ETL first. You don't have to load the data first before you can query it. And being able to also do federated queries, because if you look at something like Starburst, you actually abstract out the SQL query execution engine, right? So away from the underlying data sources.
Starting point is 00:19:29 So when you can do that, then you can actually push a lot of the operations like joins, aggregation, so forth away from the underlying data sources. And you can actually do parallel processing or the SQL execution so you can get better performance. But it also means it is now possible to actually run the query where you're joining data from multiple sources, which before that, you know, we would never think about that, right?
Starting point is 00:19:50 We would say, oh my gosh, no way. I have to do the ETL, you know, transform everything, land it into, you know, one single data warehouse and do that. And so I think, you know, when I say data virtualization, I'm really talking about kind of more recent incarnations data virtualization, right? A la D kind of more recent incarnations, data virtualization,
Starting point is 00:20:05 right? A la Dremio, a la Starburst, those type of technologies. Did you see, and I'm assuming, let's say the role of like a data engineer who is like pretty new in this discipline. I hear you describing, let's say this data stack and you mentioned both data virtualization and also ETA, but when I see you describing virtualization, it makes me feel like we don't really need ETA, right, if we have, let's say, I don't know, like an idea of virtualization there. So how do you see the relationship there between the two? And I want to ask you to give me like a definition that's like as
Starting point is 00:20:53 pragmatic as possible, right? Like, what do you think at the end is possible out there? Are we going to kick out the idea and give you our notes? Yeah, well, it's, it's a good question, Kosas, because I think my thinking actually has evolved, right? I would say two years ago, I was probably, you know, if you, if you had a video of me somewhere, you know, I was probably out there protesting the sign, you know, that's the ETL, no ETL, we've got data virtualization networks are fast, you
Starting point is 00:21:24 know, CPU memory on servers are good enough. We've got data virtualization, networks are fast, you know, CPU, memory on servers are good enough. We don't need it. Well, I have to say in the last year or so, I've had to change my mind and it comes from just practical experience with customers, right? So I'll tell you what I mean. So I think data virtualization is fantastic when you're exploring, when it's ad hoc, when you have an idea, you're not sure yet. What data virtualization allows you to do is get you a quick way to validate,
Starting point is 00:21:52 you know, is this the data you're looking for? Will it answer your question without waiting for the complex task, waiting for the data to be loaded and so forth. So that's awesome, right? The harsh reality is that physics exists, right? I love, I love my brothers at Starburst, right? And all the data virtualization helpings. But when you get into customer environment and they say, hey, I've got 12 billion rows across these two tables, one's in the cloud
Starting point is 00:22:22 and one's an on-prem Postgres database. And I need you to do this, show this query. And I don't want it to take more than a minute. Like it, physics, man, like, look, man, it's just no matter how many nodes they add, how much memory, like you still, there's still that extra hop that you're going to take. Right. And so what I've kind of changed my thinking as I've seen in customers' appointments is
Starting point is 00:22:47 the data virtualization is a good way to start off with if it's an ad hoc stuff. But when you're talking about operational pipeline, operational jobs, operational analytics, and you have SLAs to meet in terms of time and in a big data sense, you're not going to win versus a dedicated query. You're just not going to win, especially if it's against the data warehouse that is being tricked out.
Starting point is 00:23:12 And the data engineers know exactly how to tune it for performance. And so this is where I do see them coexisting. I don't see it as a replacement. I'd say, hey, look, the best practice I always tell our customers is use the data virtualization and make sure this is what you want very quickly. And then if this is what you want, you know, build the pipeline, right? But now you have full confidence that this pipeline is actually going to deliver exactly what you want, which is a lot better than before you trial and error with the ETL, the complex pipeline, right? Only have it break multiple times before you figure out, you know, this is the one that you want.
Starting point is 00:23:48 So it actually is a nice marriage. And I actually think that is a good way to actually combine the two technologies to get the best of both worlds. Yeah, makes a lot of sense. Makes a lot of sense. Where do you think that like a company should start from? Let's say, like you have like an medium-sized, like small company that's at the point where they want to start implementing some kind of data initiative and build their data warehouse, wallet, all that stuff. Where should they invest first? Is it the annual virtualization or both? Yeah. And then this is where I'm going to be a little controversial because I know I'm going to say is you're never going to find in any book or manual that you read out there. I know every book, manual or consultant you talk to is going to say, start the day at
Starting point is 00:24:35 the warehouse, start the day late. Move everything into one single place so you can find everything. That's what everyone is going to tell you and that is the current core conventional wisdom. The problem with that approach that we've seen time and time again is two problems. One, take it from an old infrastructure guy, moving data sucks, it's hard, it's complex, things break. So you're going to have to have a long project that probably won't even finish on time after you actually build a data warehouse in the end. Two, take it from an ex-catalog guy, whatever you just moved in there, you're probably not going to be able to find 80% of that stuff. So your users are now really mad
Starting point is 00:25:10 at you for the next 18 months, why they can't leverage data beforehand. So this is where conventional wisdom breaks. And look, it was relevant 20 years ago when you didn't have that many different data sources. It's got in that much data. But when you now't have that many different data sources. It's got that much data. But when you now have millions and billions and tens of billions of tables and multiple data sources and types in a single place, this is just the problem you're going to run into. And so my suggestion is don't start with that. Do that last. Actually start with, if you can, a data discovery process, right? And data discovery process meaning, you know, and some will use a catalog, but it doesn't have to be, but a way to which you know where your data assets are, number one.
Starting point is 00:25:56 When I say data assets, I mean the whole gamut, right? I mean tables, views, and queries, right? Start with knowing what you have, where it is, and then start with knowing what people are actually using. So you have a way to actually prioritize. Because a lot of people, when they think about doing these types of data, LinkedIn, migration, data warehouse migrations, they think they have to move everything. And I can tell you, nobody uses every single table.
Starting point is 00:26:18 Nobody actually uses every single query. In fact, most people have a lot of orphan queries, stale queries, or even stale ETL jobs or enough stale tables. Start with the ones that people actually care about and people are actually using and use that as a basis to say, okay, this is what we want. This is what we want. Let's really make sure we know how to optimize that from a discovery, governance, and performance
Starting point is 00:26:41 perspective. And if you can do that and you know people are going to use it, then actually building your data, like your data warehouse, first with that set of data is going to give you the best experience. It's going to give you the fastest experience of getting that data,
Starting point is 00:26:57 like data warehouse up and running. And your customers are actually really happy with you because they're not waiting 18 months for you to tell them, okay, it's ready to use. So I would say start with that discovery process to rationalize what you have and where it is and why people are using it and what are the most popular ones. Then from there, like I said, the data virtualization is great for you to validate and then having
Starting point is 00:27:19 a data warehouse or data lake for that fast local performance of the next. So that's kind of the three steps that I would recommend. Casey, question on that. So this is such an interesting topic. So you talked a lot in the context of, okay, you kind of already have these disaggregated sources and the go-to conventional wisdom today is just get everything collected into a warehouse or a data lake.
Starting point is 00:27:44 Let's just imagine a world where you can start over from scratch, right? You don't have the, which I know, do you know, entertaining here, but yeah. But let's say you are, you know, you are starting out and you didn't have to deal with, you know, sort of this legacy,
Starting point is 00:28:03 you know, sort of integration debt and technical debt, you know, from a Franken legacy, you know, sort of integration debt and technical debt, you know, from a Frankenstein stack and all that sort of stuff. Would that change the way that you approach, you know, sort of augment or building out or sort of, you know, scaffolding the analytics, you know, infrastructure and practice inside of a company? Temporarily, yes. And why I say temporarily, yes, is I've seen many examples
Starting point is 00:28:25 where people say, hey, we're building it from scratch. The problem is, where does the data come from? And there's only so much of it you can control for that's your internal data. The minute a business starts expanding, we have to take on new partnerships. Oh, hey, look, their data is from another source and they have to pipe that to us. Yep. Um, the minute marketing starts going, Hey, I actually need this type of third-party data, new social media feeds. There's new data being added.
Starting point is 00:28:53 Right. And, and what happens is this is just a cycle that has played itself out over and over again is you can probably get that started for like your core main app, your next native and the minute someone says we're growing as a company, we're coming out with a new line of business, a new app, someone there in the IT and our development stack goes, well, I don't want to be on the same stack as yours. No way, man. I want independence to be able to do my own thing. So what happens is it's a temporary solution where you get to say, I'm going to read this iPhone from scratch. Eventually you get into the world there.
Starting point is 00:29:28 Oh my gosh, I do have data in multiple places. Sure. They might be newer systems, right? You could, you could say, Hey, all my data are now all cloud data, you know, platforms and data warehouses, but it's still, there's still separate formats. There's still separate APIs you got to connect to. And if you try to do cross-source analytics,
Starting point is 00:29:46 you're still going to run into a problem that I just talked about. For sure. For sure. Temporary. And you could be, you could be living in bliss
Starting point is 00:29:52 for a little bit, but eventually you got to pay the pilot for a man. Yeah. It's like, well, it's like, you know, you start a new job
Starting point is 00:29:59 and like your calendar's empty and you're like, your inbox is empty and you're like, wow, I have so much time just to work on stuff. And then, you know, two or three weeks later, you know, the train is off the tracks. Okay.
Starting point is 00:30:12 I'd love to hear. So we've talked a ton about the problem. This has been super helpful. How do you solve some or all of those types of issues with Prometheum? And how does the product do it? Like, what's your approach? Yeah. Well, number one, I would say, whatever you think you've figured out,
Starting point is 00:30:32 there's nothing more humbling than actually going out to customers and getting kicked in the face. And so we've had the luxury of getting kicked many times. You know, I used to be much better looking. You have some aggressive customers. Have you worked with the new engineers?
Starting point is 00:30:49 No, I'm kidding. No, no, it's all good. I think, you know, what you think you can do in the real world, it's always very different, right? And so one of the things that we realized very early on
Starting point is 00:31:02 was you got to connect to every data source out there for the most part, right? You get like just assume every customer you're going to walk into that has this problem probably has a smattering of relational data sources, data lakes, data warehouses, cloud, Hadoop, you name it, times two, right? That's good. Just assume that's going to happen. So right off the bat, that means you do have to know how to connect everything very quickly. And when I say connect, I actually mean being able to figure out what they have or being able to show people what they have very quickly.
Starting point is 00:31:33 So the old way of, well, I'm going to load everything into me. I am going to connect, but I'm going to copy everything into me. Well, that's a horrible idea, right? Because you're now actually creating yet another data silo. And number two, the performance impact to actually go scan every system to do that, it's gone awful. So some of the earlier versions of data catalogs went through that problem. And I can tell you, a lot of times, it would take six months to just finish scanning, right?
Starting point is 00:31:58 By which time, you're now behind by six months. So Promethean, we've actually figured out not only how to connect, but how to very quickly within minutes kind of give you a logical view of every table, every query, every view that you have. And then you got to figure out when you deal with enterprises that not everyone's a good citizen,
Starting point is 00:32:17 not everyone puts all their data in the database, this data warehouse. So you have to figure out how to get them for like Git repos. I kid you not, right? And being able to do the same thing. So just in the ability to connect and kind of give you this normalized view is one way that Prometa can do. And literally in minutes, in literally one day, we've had customers tell us that you've shown me in 15 minutes what my existing legacy data catalog
Starting point is 00:32:40 guys took a year and a half to show, right? So get that global visibility one day very quickly. Then the next thing you need to do is help them understand the meaning behind the data and what it can be used. I think this is where some of the drawbacks of a lot of data catalogs have is like, yeah, they can tell you the metadata information and so forth, but like, is that really that, that helpful?
Starting point is 00:33:06 If I'm trying to know, can I use this table to answer a specific question? Right. Or is it more helpful if I tell you this table has been used to answer these five questions that are actually very similar to the one that you're asking. So that ability to actually extract context and how it's actually being used is super important. And then the last part that I think is even more important is the ability to actually let you use the data.
Starting point is 00:33:30 So a lot of the metadata tools, they're only metadata only, or if they do have some quote-unquote preview, it's very light. It's a small subset, and you have to move data from them to preview, or you have to pay huge tax. So this is where Prometheum has actually figured out a very lightweight view of actually seeing what's inside preview it, but then as than when you need, let you actually work with the data, right? Actually let you join, let you build queries, let you build this
Starting point is 00:33:56 visualization very quickly on the fly. And that's a whole different experience because before what people are used to is I found it, I needed to use it. Let me go call someone else and let's hope they have access to it and let's hope they can get it for me. Let's hope that they validate or I can't find it. It hasn't been built. Crap. I can't do it.
Starting point is 00:34:15 Let me go get someone else to do it. And so, you know, being able to actually do that actually all the way through is where Promethean shines and then the part that I think a lot of folks overlook is you have to make this into a seamless workflow because today these all exist as separate processes potentially done by separate people using separate tools. I don't naturally assume what I find in my catalog where I go fly, I can instantly build on the fly, query, virtualize. That doesn't exist today normally. And so how do you make that not just easy and clear and intuitive, but also performance? You got to make sure that it performs fast so that for us, our goal has always been get to the answer in three minutes. Try as hard as you can.
Starting point is 00:35:06 Have a single easy workflow, but get to the answer in three minutes. And what we found is with analytics, you don't necessarily need to answer the question right away. With analytics, when I was an analyst, most oftentimes, whatever you think that you're going to answer, it actually wasn't what you're going to answer until you start working the data. And we look at the data and be like, ooh, hey, I didn't think about that. Or, ooh, wait a minute, there's something else.
Starting point is 00:35:34 Or, wow, this is wrong. And so the faster you can get to those points of iteration, as I call it, the better your analysis will be. The longer it takes, this is where things start getting hairy. It's really like, well, maybe I could convince my boss to just accept this. All right, let's use the data from three years ago. So those are the things that Promethean can actually help is not only giving you that fast connection understanding, but actually allow you to actually work with it all in a single platform with end-to-end collaboration between the business person and the data team. Here's the thing.
Starting point is 00:36:12 As a data analyst, I have no idea what the marketing guy really is going to do with it until the very end, right? And as the marketing guy, you have no idea how to do the gnarly extractions, right? And pull the data from different data sources.
Starting point is 00:36:25 So why do we actually make you wait until someone finished that task that only for Eric to say, hey man, this ain't it. This is not what I need. Why not have them collaborate in real time together? Right? And that's kind of sort of this new era
Starting point is 00:36:40 of collaborative analytics that we can just bring into the table. Well, the dirty secret is that the marketing person may not know at all what they want to do anyways. I won't tell if you won't tell. I know Costas
Starting point is 00:36:59 has a bunch of questions. One specific question for me before I hand the mic over. You mentioned giving context to the data, right? So show me everything that I have, right? Which is really useful. I mean, goodness gracious, like even in small companies like ours, it's like there's nooks and crannies already in the warehouse, you know?
Starting point is 00:37:19 So that's helpful. And then you talked about, you know, this table has been used to answer five other questions like this one. Yeah. That on the surface feels like it's, there's a very high level of subjectivity and like context there. How do you do that? I mean, are you like, you know, sort of diffing SQL queries that have been run on the table or like, how does that even work? Wow. How much time do you have?
Starting point is 00:37:52 This is actually part of the secret sauce, right? Of Remedium and one that we actually have a patent on. And so we figure out kind of both ways. One is how do you actually figure out the semantic and also the context, right? Of a query or table, right? How do you figure out the relationship that has with other tables and other queries and then, you know, cause you also have to, it's almost kind of like. Understanding it from a graph perspective, right? A graph database perspective to understand these multiple relationships
Starting point is 00:38:22 could actually exist between multiple objects. And the object could be a table, it could be a query, it could be a view, right? It could be a tag, right? It could be a BI tool, right? And so figuring out how all these semantic objects actually map to each other is hard, but actually it's very useful, number one. But number two is also taking advantage of crowdsourcing, right? Knowing what people have rated, reviewed, frequency of access, those type of metrics
Starting point is 00:38:49 come in play. So one of the things I learned earlier on is that very rarely can you rely on one metric to determine viability or relevance, right? Oftentimes in an organization, we look for multiple data points. We look for for has this been used by someone i trust number one right so who actually uses frequency right when was last time how often it was used that hence give people a level of comfort and crowdsourcing right four stars thumbs up thumbs down believe it or not that gives people comfort and then you you know
Starting point is 00:39:24 some people want to get a little deeper. They want to look at a minute. Show, tell me what actually came from, show me what happened. Show me the transformation actually happened and prove that way. I can get a sense of comfort, how it was actually built. And so with Promethean, we actually realized number one, every organization has multiple things they look at and it's never just one, unfortunately. But what we found is that everyone probably uses the same six or seven things and they might just
Starting point is 00:39:54 weigh them differently. And so we've actually figured out, number one, not only how to get those things, but how to actually create an algorithm to rank, right? Based upon those six or different things in terms of relative importance and then have it tuned. So it's kind of like our own little page rank, if too well, right? That kind of determines the level of accuracy and there's a scoring behind it. And if you don't like it, right, you vote it down. If you don't like it, you don't access it. And so it's always live. And this is where customers have actually started seeing a lot of value because it's not static. The problem with data catalogs and data governance tools is
Starting point is 00:40:32 it's kind of static. It's what someone says or it's that profile that data says. But if you don't actually know how it's actually being used and not just itself, but also parts of it and so forth, you don't really get a complete picture. And so this is where we've been able to do this. So you don't realize it, but as you're using Chromiki to answer questions and very quickly build things out, you're actually contributing to the governance as well because you're actually contributing to one of those factors
Starting point is 00:40:58 in the scoring. Fascinating. That's a very interesting case. I don't remember if we talked in this show before about data catalogs. So I'm going to ask you a bit more about it because I also understand that the cataloging process is something that is quite important in general, but also it has like a central role in the product itself, if I understand correctly. Right. So can you give us a little bit of, let's say, background about cataloging? Because
Starting point is 00:41:35 it's a really new term, right? Like, as you said, you worked in the past with catalogs, you had high expectations at some point in your career as well as catalogs. You got hurt by them, probably have some kind of trauma. How did it start? How would you like and how would you like to play with the innovation from Promethean? And also, if possible, talk a little bit about the future too. Yeah, that's a good question. So I think, you know, catalog started decades ago as kind of just a way for, yeah, DBAs to be able to annotate things or find things, right.
Starting point is 00:42:13 And we, you know, they heard terms like data dictionary, right. From you to just putting in little things that, that, that helped me understand what this term actually means, what does this column actually means. And then people started adding in things like lineage, right, to really understand how the, you know, because as things move from the sources, some into the data warehouse, transformation can take place. And so you want to understand, hey, how did the transformation happen?
Starting point is 00:42:37 So lineage started becoming a big thing. And then you have things like data quality score, et cetera, that allow people to rank, you know, trustworthiness of the data and so forth. So I would say they all kind of started with a very heavy governance influence for data catalog. That's where most of them actually have that background. The one thing that most catalogs have in common is really search. We ask most people, why do you want to have?
Starting point is 00:43:07 No, the number one reason is always search. And I would tell you that once someone's actually bought and implemented a catalog, if you ask them, hey, which feature do you actually use, right? Search and tagging is like 80 to 90%. The rest is like, and the reason is sometimes the rest either doesn't work that well, it's hard to implement, but for the most part, I would say you get a catalog to number one, find where things are, find where they come from and a way to put some sort of information that helps you assess
Starting point is 00:43:38 whether or not this is good or bad. That's the high level, you know, view of kind of how catalogs usually work. And then what we find is that recently, people, because like I said, the problem of multiple data sources, multiple data types and so forth, people are asking more from their catalog. I want to profile, I want to see cardinality, I want to see statistics and so forth. And you can do that. Where the catalog stopped, unfortunately, was always, am I either going to find new things that are already there?
Starting point is 00:44:15 That means good data sets that you can use, right? Or I'm going to find new things that are raw, tables which you probably wouldn't know what to do with. And so a catalog never allowed you to actually build. It never allowed you to actually experiment with the data other than saying, Hey, I found it in costa said, this is good, Eric said, this is bad. Yeah. They each gave it four stars and it comes from this source and so forth. What happens is that the user has to make a lot of interpretations before they even know whether or not this, you can actually use it.
Starting point is 00:44:47 And so if you look at most usage, most catalogs are being used by a data governance team. Really to, you know, quote unquote, manage whether or not something should be used, whether it poses a risk, who should access it and so forth. And the reason why is because that next step of actually using it for analytics, which requires you to actually work with the data, it's a separate persona. It's a separate requirement that a catalog doesn't do, right? The catalogs for it's like 80, 90% metadata, right? It doesn't do that building part. And so it doesn't worry about performance.
Starting point is 00:45:22 It doesn't worry about scale. It doesn't worry about, you know, can you actually answer questions and query optimization and all that stuff. So because of that, you know, take away the marketing. It's actually not as useful as you might think for analytics, right? Because if you can't do all that, how are you going to really help a data analyst, right? Or a data engineer determine this is the data set to use to answer a question. Mm-hmm. So that's kind of, I would say catalog circa 2017, 2018. And so where Promethean's gone is we realized that the most natively intuitive thing to everyone, regardless of
Starting point is 00:46:00 function, regardless of creed, et cetera, is search. Nobody needs to really teach you how to use search. It's the most natural thing to use. So we didn't start out wanting to build a catalog. What we realized was almost everyone can search. Almost everyone's used to the notion of tags. Almost everyone's used to the notion of ratings and reviews. I know four stars means something's better than two stars. I know thumbs up means it's better than two stars. I know thumbs up means it's better than thumbs down, right?
Starting point is 00:46:26 And so if you can leverage the catalog as an entry point or use that capability as an entry point and build on top of that. So once I find something or once I think these five things are what I want, then add the building part to it, right? Then figure out a way for people to prep and model, query, and see the results. Then figure out a way for people to prep and model, query, and see the
Starting point is 00:46:45 results. Then you have a way to very quickly get most people to actually be able to work with the data as opposed to having to stop post-discovery and then having to go and ask someone else to use another tool. This is where
Starting point is 00:47:01 you've seen the term data fabric come up. I'll talk about data fabric, data pipeline, data mesh, and data catalog, right? Because they're actually not the same. But the problem is the marketing is so darn confusing, right? So right now I'm seeing a lot of catalog guys go, ah, I'm a data fabric. Okay. So the way to think about a data fabric in terms of what it should do and what it needs to have is actually a fabric. Yes, it needs to have the ability to connect to multiple data sources.
Starting point is 00:47:31 It needs to have a catalog-like functionality or metadata, in fact, metadata governance. But it actually needs to have the data modeling and access layer that you can actually use and then be able to have some sort of coordination orchestration layer of saying, this is who uses it. This is how you should use it. This is what you should do next, right? That's kind of the broad overall definition of what a data fabric is, you know, Gardner's definition as well.
Starting point is 00:47:59 Now we've taken that and kind of modified it a bit in the sense that we think that the access layer should be both direct and federated, right? Because if you still require people to move data, it's not going to be a good experience. And we also believe that you do need visualization because for a lot of people, that is a better way to validate whether or not this is what you're looking for or not. I challenge anyone to say, I can send you a 50-page SQL query and you can tell me this is the data you're looking for.
Starting point is 00:48:29 Me, in cost, because you look like a smart guy, I think you can do it. But if you gave it to me, I wouldn't be able to do it with you, my friend. But I can look at a pie chart and I can say, yeah, it's probably looking pretty good. Or the narration and the storytelling to be able to tell you the value. So we actually kind of go above and beyond what the standard partner definition of a data fabric is. Now that means
Starting point is 00:48:50 it's doing all these things that cataloging is not doing. Cataloging is stopping at the metadata management and discovery. It's not getting to the access layer, it's not getting to prep or visualization. Now data pipeline is just the moving of the data, right? And if you think about, you know, my dad was an English teacher, so I probably spend way too much time analyzing words and their meanings. So a pipeline has a connotation of it's steel, it's rigid, right?
Starting point is 00:49:13 Once I put it in, it's just going to move. Whereas fabric, it's loose, it's flexible, right? And that is because a fabric is as flexible as a question. If I ask some questions, a wrong question, I ask another one. I'm going to iterate. I'm going to change a question. A fabric allows you to kind of on the fly, very quickly change what you're looking for, very quickly build what you're looking for, have that flexibility.
Starting point is 00:49:38 So this is kind of how I think we think about the world and data mesh. Our friends at Starbirds I know have done a lot of work on the data mesh. I think there's a fabric and a data mesh kind of co-exist together. I think a data mesh is a framework, right? That encompassed a lot of things and you can have a data fabric in the data mesh framework. So that's kind of how I see those things. Yeah.
Starting point is 00:50:03 Yeah. I agree and it makes sense. And I think we are still in the process of properly defining all these terms and understanding them. One last question from me. So you mentioned at some point that the data catalogs were mainly used by the data governance people in the organization, right? After this evolution of the data catalog into the data fund, right? Who are the people who use and consume this tool? Yeah, so we're seeing data analysts and data engineers now actively using the data fabric to be able to automate
Starting point is 00:50:42 the building of data sets, automate the building of on-demand SQL queries, et cetera, right? I think the next evolution or the next iteration is, as you later on, no code, actually later on, NLP and NLG, a lot of business users, a lot of, you know, right now I would say even the kind of fairly technical business analysts could also use the data fabric. But I think the goal, at least the goal I have is, how do we get the data fabric in the hands of even the non-technical people at all, the guys that just want to ask a question
Starting point is 00:51:16 and get an insight? And that's where a lot of work around NLP, NLG, and AI, and free text search and so forth is really going to come to play and kind of take that to the next level. So that's the part that makes it very, very interesting because now the fabric can actually span to laymen, citizens, business folks, business analysts, data analysts, data engineers, and even the governance team. And I think where you can have that, that's where things start to
Starting point is 00:51:45 make sense and that you can have governance analytics, BI all under the same framework, which today it's a necessity because these crazy governance rules like GDPR, CCPA, they're really hard. They're really, really hard to actually be compliant. And I can't think of a way to do it if you still live in the cycled world that most people have, where everything is, I only do this, I only do this. It's going to be a nightmare.
Starting point is 00:52:13 And so I see the data fabric as, it's finally, there is a way to actually do this, to drive velocity in decision-making, but also do it in a way that automatically takes care of governance. I feel like we have more to chat about, to be honest. But I know that we are close to time here, so I'd like to allow Eric to ask any last
Starting point is 00:52:35 questions that you might have. So Eric, all yours. Eric Boerwinkle I think we're at the buzzer. I think, I think Brooks is telling us we have to close it down. Casey, this is, whenever we run along, we know that it's a topic that is not only deeply interesting and valuable to us, but also to our audience. So thanks for digging in. Thanks for letting us get a little technical.
Starting point is 00:52:54 And thanks for teaching us about not only data catalog, but helping to further demystify data mesh, data fabric, and all the other terms that marketers like me proliferate across the industry. Am I going to see a blog from you on the data mesh and data fabric now? Oh, man, I'd have to dig pretty deep for that one. But cool. Well, thank you again for your time.
Starting point is 00:53:18 It's been great to have you in the show. Yeah. Thank you guys for having me. I had an absolute blast. So appreciate it. Thank you, Casey. Thank you. I'm glad that Casey let me ask him about working as an analyst at the Federal Reserve.
Starting point is 00:53:29 I just had to know, and he said it was boring, which I kind of expected, but at least he built a database in his spare time, which is pretty cool. I'm glad he was a good sport. I was really interested by the sort of, it sounds like a system that learns about the value in context of data over time that they've built, the Prometheum. And I mean, it sounds like they even have a patent on it. That was pretty interesting and is a really compelling way to think about the challenge of data governance and sort of a self-optimizing system. And I don't know if we've talked to a guest who's brought up that approach yet, which is really interesting. So that was my thing to think about for the week. Ah, yeah. I totally agree. I think this time I'm going to think about the same as you do. It is a bit of surprising that we haven't heard more about this ever-changing
Starting point is 00:54:29 nature of data and how this impacts the things that we do, the products that we build, the infrastructure that we design and those things. So yeah, I think it's something very, very interesting. It's intuitively, I would say, very important. We see, I mean, probably makes more sense to face this challenge when we are talking about data cataloging, because, I mean, makes a lot of sense that this temporal nature of data is more obvious there. But I think it's something that has a much broader impact with pretty much whatever data and the products around
Starting point is 00:55:08 it. So I think we both should keep the mental note to ask more about that, I guess, from now on. I agree. All right. Well, many more great episodes coming up. Subscribe if you haven't, and we will catch you on the next show. We hope you enjoyed this episode of the
Starting point is 00:55:26 Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.