The Data Stack Show - 193: Introducing the Cynical Data Guy: Is Data-Driven a Myth?

Episode Date: June 12, 2024

Highlights from this week’s conversation include:Introducing a special edition of the show with the cynical data guy (0:19)Metadata and LLMs (2:32)Data-driven culture (8:44)No-code orchestration too...ls (17:09)No Code vs. Low Code (19:58)Enterprise Challenges with No Code Solutions (20:08)No Code Tools for Small Companies (21:40)Inappropriate Use of Tools (23:06)Final thoughts and takeaways (24:05)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show. We have a very special kind of episode that we're going to try and start doing monthly. And you're going to love the name for this show.
Starting point is 00:00:41 It's called The Cynical Data Guy. And it's because we have a special guest who is the Cynical Data Guy. Matt Kelleher Gibson is joining John and I. So Cynical Data Guy, welcome to the show. Thanks for having me. I'm so excited. I can't wait to dig into it. But we do need to give the listeners some context here. So how did Cynical Data Guy come about? Well, first of all, John's data consulting practice is called Agreeable Data. So we kind of jokingly call him, you know, the Agreeable Data Guy. stack from deep in the bowels of corporate America doing all sorts of data stuff. And you have tons of war stories and tons of scars from trying to do data at very large organizations. And so when we talk about topics, you often have, let's say, a salty view that's been dashed against the rocks of corporate reality. I prefer to think of it as a realistic view, but others seem to disagree.
Starting point is 00:01:55 Well, of course, when John and Matt and I chat in the office about topics or LinkedIn posts that we see, we just enjoyed Matt's hot take so much that we started jokingly calling him the cynical data guy. And then of course, one day we stopped and said, this has to go on the podcast. So we got to do it. We got to do it. Okay. So here we are. Here's the format. So I'm going to act as the moderator. We have the cynical data guy. We have the agreeable data guy. And I just pulled some LinkedIn posts that I think are interesting topics to discuss. Okay.
Starting point is 00:02:35 So Matt, I'm going to just present some of that. I'm going to read some snippets here. Okay. Some of them I'm going to keep anonymous. Others I won't for obvious reasons. Okay. Are we ready? We're going to do, we're going to try and do three lightning rounds here.
Starting point is 00:02:49 So if I interrupt you and move on, sorry. Not sorry, I guess. Sorry, not sorry. Definitely not sorry. Let's do it. I'm definitely not sorry, and neither is Brooks. Okay, I'm going to read. Here's the first one.
Starting point is 00:03:02 You ready? Go. Yeah. Metadata will have a profound impact on the success of modern LLMs. With better assets, developers can leverage APIs to access and utilize their organization's data more efficiently in their applications, enhancing functionality and capabilities and streamlining the development of AI models. Okay. Metadata and LLMs. I'm sure that's true. Once CEOs actually start caring about metadata.
Starting point is 00:03:30 Do CTOs even care about metadata? I don't think I've ever met a person who cares about metadata like in an actual corporate. So is this just trying to sell software? Is this a pipe dream? No. I mean, everyone just trying to sell software? Is this a pipe dream? I mean, everyone's trying to sell software. Is it a pipe dream? I don't know.
Starting point is 00:03:54 They'll probably say, hey, I can do it for them, but I don't know that's going to happen. What about the data quality aspect? Yeah, so how much data quality do you see corporate America really investing in? Not a lot. In your job, in your most recent job prior to Rutter Sack, at a publicly traded company, did you ever use the word metadata in a meeting?
Starting point is 00:04:19 Like with a business user or ever? Ever. Might have come up two, three times in a year and a half. Okay. Truthfully, yeah, it doesn't come up that much. Yeah, yeah, yeah. I remember the first time I think I heard metadata and I was like, what is that?
Starting point is 00:04:40 John, give us an agreeable take here. I think that is the right answer. But what percentage of companies have metadata in place that AI would be useful? Like today? 0.1 maybe, right? I want to be there when you meet with another company and say, let me tell you about metadata.
Starting point is 00:05:07 Metadata. I want to see how quickly their eyes glaze over. I think it's an easier sell than data quality. I think that's my take on it, right? Because data quality was a thing. It's like, oh yeah, data quality got a better quality data. Data you can trust. That's still a thing.
Starting point is 00:05:23 But now it's like, really yeah, data quality, got a better quality data that you can trust. That's still a thing. But now it's like, you know, really it's about the metadata. And people don't understand metadata more than they don't understand data quality. So does that mean then metadata is just going to become like, we're just going to shove all of data quality into metadata? It's like, you know, it's about
Starting point is 00:05:40 the metadata. Like, making sure your pipelines don't break. No, I think people... This is going to sound cynical your pipelines don't break i know i think people this is gonna sound cynical you've corrupted someone in the first like minutes of the show no it's been longer it's just been offline right yeah that is no i think when people sell this metadata has the benefit of being less clear right because if your data quality like oh yeah data is wrong data is right metadata is like people will use it and not know what they know what it means i think everyone's of being less clear. Because if you're data quality, like, oh yeah, data's wrong, data's right. Metadata is like, people will use it
Starting point is 00:06:06 and not know what it means. I think everyone's going to think you're talking about the metaverse. Is that like data in the metaverse? Or meta, right? The company, yeah. All right. Okay, so man, co-host corruption in six minutes.
Starting point is 00:06:20 That's a wrecker. That is a wrecker. Technically, John and I have known each other for like eight years. Yeah. So I've been working a long time. Oh, that is true. That is very true. Yeah, he's been undermining it a long time.
Starting point is 00:06:36 Okay, so the future, the very future of LLMs is based on an ambiguous concept that no one cares about and that no he's actually have that what i'm getting from this is the key to llms and selling it is picking terms that nobody understands what they mean that's probably true for a lot yeah it is true but my agreeable take on it is that there are some there's there there is some progress i think like in the bi space with
Starting point is 00:07:06 people doing some neat stuff with llms like zenlytic for example has a pretty neat semantic layer that you can put on top of your data and then the llm interacts with the semantic layer which does work better than like hey generate sql like yeah from GBT. That is the right answer. People are doing it. But I think there's a lot of overhead. And in some ways, if you're a small company with not that much data and the semantic layer is like, well, all my data came from Shopify and my ERP,
Starting point is 00:07:40 somebody could do that for you, right? And you could have something reasonably usable. I think where you get into corporate, it's like, man, this is could do that for you, right? And you could have something reasonably usable. I think where you get into like corporate, it's like, man, this is just like an impossible amount of work. So scalability, like everything. Yeah, yeah, yeah, yeah. But you can have those early like proofs of like,
Starting point is 00:07:55 hey, it actually worked for this like smaller company. And then people will extrapolate, oh, like, yeah, it's going to take over the world. That'll be the dream of every, the dream of, look, we did it with one data source. How hard could it be for one? Right, right, yeah, it's going to take over the world. That'll be the dream of look, we did it with one data source. How hard could it be with one? Right, right. Okay, okay. Ding!
Starting point is 00:08:12 Moving on to the next round here. I can't wait for this one. I'm just going to read a snippet here. This author will also remain anonymous. I'm going to choose a couple of pieces here. Data-driven people do not equal people looking at dashboards. You don't achieve data centricity through the wide adoption of a BI tool.
Starting point is 00:08:31 Skipping down a little bit. While access to business data is a crucial first step in achieving a data-centric outcome, it is only a small and early step in the overall journey. And then where's the zinger here? True data centricity or data drivenness is achieved when there are tangible commercial and operational outcomes stemming from the use of data at all decision-making levels in the business. Are you using data to effectively generate more value for the business?
Starting point is 00:09:01 Are the top leaders openly asking asking what does the data say? Or have we tested this assumption yet? Okay. So data-driven culture is a myth. No, they just don't. They don't care. Everyone says they want it, but when it really comes down to it,
Starting point is 00:09:19 you're fighting against usually a VP or someone who spent 30 years fighting their way to the top of that corporate structure and they haven't been using data. The idea that they're going to suddenly care about what data sells now, kind of naive. John?
Starting point is 00:09:38 Yeah, I don't even respond to that. He stunned you. The correct response is, why, yes, that's correct. What comes to mind is the whole, like, data-informed thing. Like, it was like, we need to be data-driven. And then it's like, let's pump the brakes a little bit and, like, go back to, like, data-informed because there's this, like, space for, like, intuition and blah, blah, blah.
Starting point is 00:10:05 Which I got to hear your take on that before I move on. But what's your take on the phrasing? Well, I mean, data-informed? What does that mean? I gave you data. Yep. I'm doing whatever I want anyways. Okay.
Starting point is 00:10:18 You were informed. Right. right so i think the take on it for me is data is absolutely helpful from a forensic standpoint like i need to find out what happened super helpful to have it's helpful from a behavior standpoint of like almost all of us have like watches now that like track steps and if you track steps like do you walk more yeah like you do if you care about it right so i think that type of data is super useful and like that is a form of data driven of like we have a goal like we're not like really clearly like tracking this activity and we need to get here every single day like that's a wonderful use of data like the stuff beyond that where it's predictive or it's
Starting point is 00:11:00 you know like recommendations like where you get into like the AI ML stuff, I think mileage may vary. Looking back on it, part of it, I've been in plenty of meetings where it was essentially pick your metric. Oh, we just did a big campaign. How did it do? I don't know. Pick the top three metrics that show the best results. That's what we're now saying. And the data says, which is amazing. Look at the number of views.
Starting point is 00:11:30 I mean, I've been in situations where, you know, literally, like you're looking at stuff like year over year or whatever, and it's down. It's like, well, it's down, but it could have been down by more. So, you know, really, this is a success and we should roll this out everywhere. And then that argument won. So is it, okay, so the, okay,
Starting point is 00:11:53 let's talk about the executive who battled their, you know, 30 years through the valley to emerge on the other side and they don't use any data. Why aren't they using data? I mean, it probably started out partially just because there wasn't as much available when they were going up, right? Yeah. I mean, this isn't like a, oh,
Starting point is 00:12:13 they just have always been data's for nerds. Yeah. I mean, they most likely had to do stuff without it. They either didn't know it was available or it wasn't available. Yeah. And they had to make decisions. And one of the things we do as we're successful is we reinforce and say, well, this is what got me here. So when you're then going to a person and saying, hey, I know your gut or whatever you've been doing or however you've been going about it
Starting point is 00:12:38 has been working for the last 20 years, but I have some numbers that say you should do the opposite. You're fighting against nature. You're going to win. One other thing that I've noticed is that a lot of really good executives, if you break a business down into its most basic building blocks, there aren't a lot of numbers that actually drive the business forward. And there really are only a couple that are mission critical to move in the right direction from the executive standpoint.
Starting point is 00:13:16 Now, there's, of course, like a ton of data and a ton of stuff that like ladders up to that. But I think that there can often be this, you know, everyone needs to be really data-driven, meaning there needs to be this mass democratized access to easily drill down reports and all that sort of stuff. When in reality, that person's probably been successful because they know which two numbers matter and they push aside literally everything except for the stuff that moves those numbers in the right direction.
Starting point is 00:13:48 Right. Yeah. I think also it's one of those if you are in a position where, you know, you're in a company and like, just to be honest, the VPs are probably only going to be on your side if the numbers agree with what they want to do. Your best hope long term is to start going at people who are still early in their managerial career where they're still forming these habits and what do they trust and work with them they're more likely to be open and they're more likely to to work with you and see opportunities and ways
Starting point is 00:14:21 they can make better decisions with that. I mean, it's a little bit of like, there's also probably a chance they're going to move on to another company in three years, but it's still a better approach that you're going to have than trying to really convince that 65-year-old CEO, you really got to trust my numbers right here. Yeah, I think tying things to financials
Starting point is 00:14:47 is like the best way to be data-driven in most companies because the numbers that matter are like profit and loss. Like if you're a VP over something, like it's whatever your profit and loss is going to show up
Starting point is 00:14:59 at the end of the month, at the end of the quarter. Yep. So if you can say like, hey, these are drivers that impact pnl then like that's i think a conversation you can have and get a vp on your side of like oh like okay cool like yeah yeah we should work on this we should track this but what i think matt's referring to which yeah i've sat in those meetings too where it's like mark like pick on marketing like marketing's not doing well like again like this quarter and they like just rotate through vanity metric after vanity metric of like views like switching out
Starting point is 00:15:36 views and sessions on row as like high row as unlike campaigns that like were like a hundred dollars and like high views on campaigns with awful ROAS, just like the shuffle. Totally. Totally. Just say it, Matt. You shied away from the mic. Well, I will say it also matters when you catch them because I've literally sat in meetings
Starting point is 00:15:55 where we were giving a very financially based, it had to do with the pricing thing. And the meeting started with the CEO saying well you know we've got to do whatever the data tells us we showed them why raising prices was not going to be a good idea and literally the decision was well but we put it in the budget at the beginning of the year so if you don't catch it at the right time it's like yeah but the budget says that right yeah you're not going to hit those numbers yeah but that's what the budget is I mean we got to do with the budget yeah it has a very like you know I don't make the
Starting point is 00:16:30 rules I just think them up and write them down yeah I agree timing matters yeah yeah yeah okay lightning round three are we ready go for it yeah okay I'm going to read this phone actually this is a great this is great so I'm going to mention so this is Adam Lenning who's a data platform engineer at Ben Labs. And I'm just going to read this. It's
Starting point is 00:16:51 kind of a short post, but it's great. And it ends on my question for the cynical data guy. Ever heard of a tar pit idea? Basically, a tar pit idea is one that seems very appealing, and many people have tried to make it work, but ultimately no one has really achieved product market fit. Already so good, huh? So as I've been thinking about it, I'm starting to believe that no code orchestration tools and no code ETL tools may be tar pit ideas. Tools like Fivetran, Airbyte, Gather, and literally a billion others all claim to handle moving data from A to B with low, no code, and they work in 90% of use cases. The issue lies in the last 10%, which will almost always need a code solution. Whether these tools may be useful, if we always need two plus tools to get data into our warehouses,
Starting point is 00:17:42 are they really worth it? I think many people would argue yes, but what's your take cynical data guy are the is that a tarpon idea yes yeah i mean yeah there and i think there's a lot of those out there i mean you know anything text too i feel like a lot of times it's just got that siren call for a lot of technology people and when you actually sit there you go who actually cares about this yeah so that's yeah there's a lot of those out there and they get they just get recycled over and over again what break down the no are you skeptical of no code i mean you've built tons of data pipelines in your career but you've also interacted with, you know, non-technical users or semi-technical users, is the no-code thing a myth?
Starting point is 00:18:30 Yeah. I mean, let's be honest. When you're like, well, we've got non-technical users and they can do this. Do you really want non-technical users building this stuff with a bunch of building blocks? I don't think you actually want that most of the time. Eventually, you need someone with more expertise. I mean, you know, it has the feeling of basically saying, hey, we're going to build data like McDonald's,
Starting point is 00:18:56 where we're going to just have a handbook and anyone can cook a burger then. I just don't think that works. Agreeable data guy? On the no code side I really like it or like embedded analyst like I can think back
Starting point is 00:19:12 to a role I had where we had an analyst that sat like on the floor with ops like was there probably at the company like seven years and like did so many really incredible stuff with I'd call probably low code would be the right thing. And by the end, he could kind of code.
Starting point is 00:19:28 Did some really incredible operational stuff. From an IT and governance and quality, well, quality was decent, but IT governance perspective, scalability, I mean, awful, right? It just doesn't work. But from a business knowledge capture to something IT people could take
Starting point is 00:19:44 and then go do it the right way and scale and stuff like I think that's a great use case. And what companies end up doing is they let that get out of control. They hire a bunch of analysts and they all have their own things and like that's where it becomes like a big problem. Also there's a difference between no code and low code. Low code gives you some ability in there. No code is like just trust us. It's all gonna work. That's a problem. Which I would argue there are actually very few
Starting point is 00:20:09 no code solutions out there. Alteryx is a great example of like it has a GUI but that is not a no code solution. They've got all sorts of little spots. But to go back to what you said, it comes down also to your discipline with it right
Starting point is 00:20:25 so as you said if you have one person who's really there and they can do these things and then it gets handed off that's great the way this stuff typically ends up happening and the way it gets sold to people to a certain extent is well you can give this to everybody you can get it everywhere and now oh and we'll sell you the tool that will help orchestrate collect all of these which are all just out of control and they're just duplicates of the same four things with slight changes in them and it just becomes this mess that's someone that then they try to hand off to data or it like you can just fix that i think specifically in the orchestration space, I think Fivetran does a really nice job of no code. They do a really nice job.
Starting point is 00:21:08 They have the ability to do low code. Well, it's not even low code. They have the ability to do custom connectors in their platform. I think they're a really good example of if you're, again, things break down at scale mainly for them mainly at cost right like it's just very expensive to run them at scale but for small companies like they do a nice job of connectors that a lot of these small companies need they're all together
Starting point is 00:21:35 like yeah it can get a little expensive but it really is a pretty much a no code and i think it's no kind of like no your level too. It's perfectly fine if you're a small company or a startup and you just need these tools in order to get it. It's when you insist on trying to use them as you get bigger and more complex. Or the other one is, well, we bought it for one small part
Starting point is 00:22:00 of the organization. I think that's a big trap you can get into. We just need it right here, but eventually it's got to come outside if it's going to be there. A lot of enterprise stuff, there's a reason they've hired teams for a lot of this. They have a lot of edge cases. It's got to be fit to their exact specifications. You're just not going to get that with no-code stuff. Yeah. I agree with Adam's post that, like many topics, it's very dependent on the context, but completely no code.
Starting point is 00:22:33 I agree. I don't think it's realistic for this stuff. But the one thing I would say is if we always need two plus tools to get our data into warehouses, are they really worth it? But the reality in the enterprise is you're going to have far more than two different people. It's not like you can, I mean, I think that is actually one of the fallacies
Starting point is 00:22:55 of like you should have one single tool that handles. I just don't, I mean, is it possible? Yes. Is it reality? I don't think so. Yeah yeah I think it's less like oh we have to have one tool that does this and it's more the idea of like people are going to want to use this stuff in inappropriate ways and can you contain that right you're like well we just use it for this really really I mean so like the biggest example of this to me is every like Jupiter notebook
Starting point is 00:23:26 ish type thing where they're like oh well but you're not supposed to use this for production I'm like really then why does it say schedule notebook like oh it's really good in this isolated situation but we made it so that you can use it hog walk it just doesn't work.
Starting point is 00:23:46 It's like, this is just for development. We'll put no restrictions on it to use it in production. All right. Well, I think we're going to schedule this episode to go into production and end on a high note. Cynical Data Guy, thanks for joining us and we'll see you again in a couple of weeks. Great to be here.
Starting point is 00:24:05 The Data Stack Show is brought to you by Rudderstack, the warehouse-native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.