The Data Stack Show - 88: What Is Data Observability? With Tristan Spaulding of Acceldata

Episode Date: May 25, 2022

Highlights from this week’s conversation include:Tristan’s background and career journey (2:43)Updating old technology (11:40)Defining “data observability” (18:44)The primary user of a data ob...servability tool (29:56)Handling an incident (33:01)Why multipliers for data observability (37:06)Early symptoms of a data drift (43:12)Tuning in the context of data engineering (50:11)What keeps Tristan working with data (55:12)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Exciting news. We have another Data Stack show live and we're going to talk about streaming. There are huge questions. Is streaming a superset of batch? What is real time and what does that mean? Well, we've collected some of the top minds in the industry. We have people from Stripe, Materialize, open source projects that enable
Starting point is 00:00:43 streaming. And we're going to ask them a bunch of questions and give you the chance to ask your questions live as well. It's June 1st. Go to datasackshow.com slash live and register. Again, that's June 1st, datasackshow.com slash live and register. We'll see you there. Welcome to the Data Sack Show. Today, we're going to talk with Tristan from Acceldata. He works for an observability company. I'm sure you have questions about that, Costas. My burning question is actually around the type of company that he has served in his work in data engineering and ML throughout his career. So he worked at Oracle, he worked at DataRobot, and now he's at Acceldata.
Starting point is 00:01:30 And we're talking about big enterprises who are facing data challenges that, let's say, your standard Silicon Valley data startup, it's a very different customer than they're serving. And I think that market is huge, you know, relative to how we kind of think about the legacy company. So I just want to ask him about that because I think he's going to help me and our listeners, like, develop more of an appreciation for that world, you know, if they don't already live in it. So that's what I'm going to ask. How about you? Yeah, I'll be a little bit more predictable than you, I guess, in what I'm going to ask. We've been talking with quite a few other vendors in this space, like with data quality, data observability.
Starting point is 00:02:15 It's a very interesting and a very vivid market. So I want to see and ask him like how they perceive data observability, like what it is, why we name it the way we name it, and why the people are using it. And like get a little bit more into like the problem space and the product itself and see how it works and how it delivers value. Great. And you just used the word vivid to describe a market. And so you get some vocab points. I'll get some major vocab points for describing a market as vivid.
Starting point is 00:02:47 And what concerns me about that is you might actually turn into a venture capitalist if you keep using that language. Maybe one day, who knows? Okay, let's go talk about observability with Tristan. Indeed. Tristan, welcome to the Data Sack Show. So great to have you. Thanks so much for having me here. I'm excited to join the illustrious list of guests we've had here and hope to hopefully
Starting point is 00:03:11 live up to it. Oh, yeah. So many fun things to talk about. Okay. Give us your background and tell us what led you to be head of product at Accel Data. Sure. So for me, it's actually a long story that starts with being a philosophy major and getting into NinticWebs and all this stuff. But, you know, basically that took me through a path with the Indeca group at Oracle, working on sort of search, BI, analytics, big data, things like that.
Starting point is 00:03:37 And then I found this nice Boston company called DataRobot and sort of got in there, you know, one of the early product managers and sort of, you know, helped see that grow from where we were to now, you know, big company, big player in the AI, ML, auto ML space, things like that. And so the interesting thing about that though, and really the thing that led me to AccelData is that when you sit at that end of the spectrum of the data, sort of a data lifecycle spectrum. You're dealing with these ultra refined data sets, the best data sets that there are, and you're doing some really sophisticated things on top of it. But for a company that's based, you know, or was based really, really around automation and can we automate a lot of the data science lifecycle? Like what you start to notice is when
Starting point is 00:04:21 that, that works really well, but then you run out of problems because there simply isn't data sitting around that's perfect, that's pristine, even a static form, much less in an actual dynamic data pipeline that would be suitable to run out. And so looking at this on the one hand, you know, in one year I'm hearing about all these awesome developments in the machine learning world, you know, across the world, like it's incredible what's happening. And the other year I'm hearing, oh no, like we, you know, we're not ready for ML. Like we can't do that yet. So we're not, we're not ready for that. And so, you know, for me, one of the things I was sort of looking at is like, is there a way, like what are the barriers to being in a world where something like ML data robot tools like that can be used, you know, constantly nonstop because we've solved this data problem. And Excel is a really interesting company sort of sitting, you know, in the middle of this, sitting at the layer, not where, you know, it's sort of moving
Starting point is 00:05:14 data between places, but where it's kind of observing all these tools together. And so I thought this is a really interesting thing. Obviously, observability, you know, application observability, application monitoring, huge areas. I know there's a Obviously, observability, you know, application, observability, application monitoring, huge areas. I know there's a debate to be had, you know, hey, does that really apply to data as well? And I think, you know, my take, my bet, and, you know, I think people on this podcast, you know, may agree as well, but like, there's so much diversity and so much innovation happening in the data world that, you know, my belief and our belief at Accelidated very much is
Starting point is 00:05:44 like, yeah, let's go into this. Let's understand all of these rich tools that are becoming increasingly specialized, increasingly powerful, and kind of provide a common layer to understand all aspects of your data pipelines. So that's kind of my story in brief. And, you know, excited to see this. I think it's turned out to be, these pipelines are as messy as we thought and the benefits from cleaning them up are as appreciated as you would imagine. Yeah. Love it. Okay. So Costas knows exactly the question that I'm going to ask. And actually, this is in your wheelhouse because it's related to philosophy, Costostas. You know, and you come from sort of the seat of philosophy. Philosophy.
Starting point is 00:06:28 Of course, he's laughing. Okay, this is something that I've asked a ton of people on the show and it's probably one of my favorite questions because it's just so fascinating to see how people's sort of education and experience influences their work, especially when their sort of training doesn't necessarily fit a mold of what you would think of for someone who ends up, you know, in sort of a very data heavy or engineering heavy role. So how did studying philosophy, like, what are some of the main things that have influenced your work as an engineer or working with data that came from your study of philosophy? Yeah, no, I think it's an interesting question. And then there's answers at different levels of this.
Starting point is 00:07:10 So I think one of them that's, I think will resonate with everyone that works with data is you'll find that most of your time actually is in some sense about philosophical debates here. Like, well, how do you define what really is sales? And like, which way do we do this? And is the thing we really care about this or the thing we care about that? And of course you're answering, you know, at some level, this is all about answering things with data, but at the meta level of actually defining and articulating the
Starting point is 00:07:38 connection between these things and what you're looking at, what matters, it's, you know, precisely the question that the data can't answer because the data has been structured, you know, with one thing there. So I think, you know, on one side, there's a, there's all the, the fun arguments with that. I think, you know, the, the, the tactical answer for me actually was that, you know, a lot of what I ended up doing actually was this was in the, you know, maybe initial attempted heyday of the semantic lab that things like this was really trying to go out and map, you know, essentially's essentially data modeling by a different name. So map a domain, get into these nasty, you know, data legacy databases and kind of map those to these beautiful ontologies that we created in OWL and all these things, you know, but really underneath it's all this Nazi SQL query.
Starting point is 00:08:19 So, so I think that also, you know, is an element of like, you know, operating in multiple levels here and kind of connecting the things together. But for me personally, like, you know, is an element of like, you know, operating at multiple levels here and kind of connecting the things together. But for me personally, like, you know, I ended up, I went through this path of, you know, wrangling data, like just throw it out there, you know, make things work, you know, figure out what is Linux, you know, like how to work with KavanLite, things like that. But where I ended up was in product management. And so product management, I think, is a great, you know, example of sort of applied philosophy. And it's most glamorous parts, which are 2% of the time. The other 90%, 98% is a little different skill set. But, you know, I think the parts where it's relevant, you know, it really is about trying to clarify, you know, what you're working on here and what's worth doing it. And are you doing it for the right reasons or not? And so I think I've always been one of those annoying, you know, product managers, I think,
Starting point is 00:09:09 who wants to know, you know, we'll let someone get away with like, well, we're doing this because it's just easy. You know, it's like, think like that. We're going to do that. Or like, oh, we'll take up too much time. Right. And like, I think everyone who's worked in projects and worked in technical projects knows that, you know, it always takes more time than you think, you know, even taking into account that it takes more time than you think. And so I think being really careful about, you know, why are you, why is this important, something to work on? Like, why are you really, you know, the best in your category or why does it make sense to win in this dimension becomes something interesting. So I think those are all traits, but I think the biggest
Starting point is 00:09:41 trait of all was really, you know, you spend time, you know, in these philosophy classes and then like you realize this is a waste of time. And like, I want to get hands on and build something. And you become really impatient to actually build that. And you get really tired of talking about philosophical debates because you know what the point is. You don't actually ship something at the end of a philosophy class. Exactly. And that's really what it is. It's like you come in and you're like, oh, I spent all this time arguing, whatever. And then you're like, oh, wow, I can
Starting point is 00:10:08 write some code and like it does this cool stuff. Or like I can work with people who know how to write code very well or do excellent designs or sell things or make solutions work. And, you know, that's incredibly rewarding and you never want to go back to what you're doing before. Yeah, for sure. No, I love it. I've worked with some people who have studied philosophy over the years and some of the best people are like running down to the, to the, you know, sort of root of a problem or a question. And it is like, it can be difficult, right? Cause you're just like, we're trying to move fast here.
Starting point is 00:10:37 Right. But like answering those fundamental questions is so important for sort of producing, you know, ultimately what can be sort of the best outcome. Okay. important for sort of producing, you know, ultimately what can be sort of the best outcome. Okay. I know Kostas has a ton of questions on the technical side, and we want to try to define observability and do some of those things. But one thing I'd love for you to give some perspective on is that, you know, we have the benefit of talking with a wide range of people on the show, right? So we've talked with some founders who are building some like really neat tools, you know, sort of new tools, stuff that's like spun out of like Uber and
Starting point is 00:11:07 LinkedIn and open source technologies, like really, really cool stuff. You've spent a decent bit of your career sort of operating at a strata that, you know, probably the people who are working in, you know, let's say just the Silicon Valley data startup, just aren't as exposed to. So like Oracle, DataRobot, we're talking about companies who have been around a lot longer than maybe some of the customers of these newer Silicon Valley companies. So they're literally built on technology that's just been around for a lot longer, which has a lot of implications. That market's also pretty large and the data problems that they face are pretty different. And I would just love to get your perspective, what are the things that you see in terms of the challenges that that sort of,
Starting point is 00:11:56 call it enterprise strata or companies that have been around for multiple decades that have technology and run their business on technology that they adopted when it was state-of-the-art 20 years ago, but like, you know, are still running that today and trying to modernize. So any thoughts on that that would be just helpful to give us some perspective? Yeah, no, I mean, it's always been eye-opening, I think, to go around and you realize, you know, one, and you work in B2B, you know, products anywhere, like you, you realize all of these companies and all these layers of the economy, basically in like a society that maybe you didn't know about before. And so I think, you know, certainly as you work
Starting point is 00:12:35 in these, these enterprise software companies, you realize things are, you know, maybe more complex than, than at least I had thought as you get in there. And I think as you actually start working with these companies, you realize their internal landscape is incredibly complex. And so the reason we have such awesome sort of modern startup companies or companies that have now gone public and things like that, the Ubers of the world is the classic example of Facebook's. You know, these are able to move fast, you know, these, these are able to move, move fast, you know, because they don't have a lot of things weighing them down. Like they're able to do things differently. They're able to build things that they want, and then they get to a scale where they have to do things, you know, that the way that they want to, or a way that hasn't been
Starting point is 00:13:18 done before. And then you end up with, you know, Presto, Trino, you know, it was like in any, any number of other examples that came out of Uber or LinkedIn like this. And I think those become quite appealing to the next wave of the company. So the next generation comes out and says, look, like we've, you know, we, we know what you're feeling. We're operating at huge scale. Like we were, you know, we started with an app, you know, that's what we had. And then we go from there. I think, you know, if I, if I look at the challenges that these companies have and the reason it's sometimes you end up with different, you know, if I look at the challenges that these companies have and the reason it's sometimes you end up with different, you know, different fits and different product priorities for that group. I think, you know, one is maybe the obvious one, but there is a history here of technology investments. And so I think the obvious aspect of that is that, okay, there's a lot of technology around that, you know, is not maybe not the new, you know, cool one.
Starting point is 00:14:03 People aren't giving talks about it, things like that, but it's been around, it's driven the company for a long time. And there are also people there that, you know, have, have used it and know how to use it. And that's just what they, they do and things like that. And so I think, you know, one of the, like, one of the things then that you need to consider is like, obviously integrations, you know, how do you connect with those things like that? But I think more than that is actually looking at the change process. So like these companies are very smart people. Like these are giant companies, like they've been very successful. They've been around for a long time for a reason. And the people in, in sort of decision-making, you know, and leadership and architecture roles there are always thinking about how do I transition this?
Starting point is 00:14:42 And they're always thinking about previous rounds of choices that have been made and how to basically, you know, do better in the future and things like this and help with the migration. And so, you know, I think, you know, one of the interesting things about sort of the Accel Data background actually is like a lot of the core founding team, founding engineers, these were Hortonworks engineers. So they were very much, you know, building out Hortonworks and this powerful Hadoop system and helping install it and support it at these super complex installations.
Starting point is 00:15:12 And so I think, you know, they all had a very, and I had a different layer when I was at Oracle, had a different, you know, different perspective on this of the same phenomenon, right? Of people going very big on, you know, this technology that seemed to offer a lot of promise, you know, for sort of cheaper to view it, a bigger processing and things like that. And then, you know, in some cases, I think there were some successes with Hadoop, but in many cases, like now it's viewed as, you know, probably it was an over investment in many cases. And people are looking, A, like, how do I deal with that investment? So
Starting point is 00:15:42 I don't just leave it, you know, I make use of it. I train it as soon as the part that makes sense to transition. I leave the parts that are there. On the other hand, now that I'm evaluating the new wave of technologies, the modern data stack, you know, so-called, and maybe there'll be a, you know, a data stack after that, like, you're going to be facing the same discussion. Like, hey, how do I actually evaluate this? Like, and how do I adopt this in a responsible way that's not sort of like lurching from one thing to the next? And I think, so navigating these decision processes is definitely something that's quite relevant for these groups because they're looking at a much wider span and much more significant tracks here. So I think,
Starting point is 00:16:19 certainly for AccelData, one of the places where we spent some time with people is basically starting to become the experts on this modern data stack and trying to advise people on this For Acceldata, one of the places where we spent some time with people is basically starting to become the experts on this modern data stack and trying to advise people on this in an informal way. We want them to be successful running these workloads. We want them to adopt new things. And we want to stay, like everyone here, we want to stay abreast of what's the latest technology here and the best choice for what option. So I think that also, you know, these organizations will also not always be hiring, you know,
Starting point is 00:16:49 people who are going to be contributing to open source or like masters of open source, they'll be buying, but there is a role as well to kind of understand, you know, their needs and help bring that in a responsible way. So I don't know if that gives a sense, like it a little different and in some ways much more, I want to easy to lump a lot of things under the legacy umbrella when in fact, and I just appreciate it so much that you said, there are really smart people trying to navigate how do I modernize a stack that has had tens or hundreds of millions of dollars invested in it over a decades-long period and steer that and make good on that investment over time. And man, that is a really difficult challenge.
Starting point is 00:17:54 I think many people here have experience with technical refactoring. And it's a rare engineer who's able to come in and take sort you know, sort of a quote legacy code base or a really complicated pipeline or things like this and kind of, you know, gradually improve that, you know, both of what we have now and sort of restructure it in a new way. Like that's something, it's a, it's a lot more fun in many ways. And it's not necessarily easy to build something from scratch, but it's just a different skill set and things like that. So, but it's tough.
Starting point is 00:18:24 I mean, it's the price of, of success for a lot of companies is like, they built this, they've invested in technologies. Those are technology they've succeeded in fulfilling those use cases in most cases. And now, you know, you don't want to, you know, there's an exploit, explore, exploit, you know, dimension for them as well. I don't think it'd be phrased that way, but you know, I think, yeah, I mean, when they're confronted with an endless stream of a million startups pitching them on their unique thing, like how you decide between that, you know, I think becomes the key. Sure. So Tristan, I have like a couple of different questions, but I'd like to start with the basics. So let's talk a little bit about data observability, right?
Starting point is 00:19:05 So what is data observability and why we use the term observability there and not something else like data monitoring? Or I don't know. Why observability is the right word or what we are doing? Yeah. So I think, you know,
Starting point is 00:19:22 before we get into data, like I think generally, you know, people draw a distinction sometimes between monitoring and observability. And I think, you know, before we get into data, like I think generally, you know, people draw a distinction sometimes between monitoring and observability. And I think, you know, monitoring is meant to be kind of telling you that something has happened, I think in many cases, and drawing your attention to it. I think people use observability often to say, you know, this is the thing that helps you understand the internal state of what's happening. Basically, it gives you enough information, not only to see the symptom, but to go in
Starting point is 00:19:45 and find the cause very quickly. And so you see different tools, you know, do this to different degrees, but I think that's where, where that's mentioned. Now with data, I think what's interesting is that, you know, and you'll sometimes see, like you could ask a similar question around data quality versus data reliability and things like this. And so I think the interesting thing with data is that, you know, many of the use cases have existed for quite a long time. So data quality, like, and BI, you know, dashboards, reporting, things like this,
Starting point is 00:20:14 like these are not novel concepts. They're done likely much better now than they were in the past. The tools are really awesome, you know. But fundamentally, you know, it's still the same question. And the scenario is still the same. The social scenario is still the past. The tools are really awesome, you know, but fundamentally, you know, it's still the same question. And the scenario is still the same. The social scenario is still the same. You know, I'm in the room, I'm the analyst, I presented it. I'm trying to convince someone to do something. They ask that number can't be right. Like, where'd you get that? How'd you get that? Like, why is this data wrong? And so I never, you know, that's a bad feeling. You know, you don't want to
Starting point is 00:20:40 be doing that and you want to make sure it doesn't happen again. So I think that use case of kind of, Hey, I'm monitoring the data, like tell me if there's something weird that's on this, I think is an established one and, you know, fits with data quality and there's things that do that. Where I think data observability is different is it's really applied to newer, some of the newer, you know, and in some sense, more taxing use cases, like where you're actually providing a service to the outside world, whether that's a product, whether it's recommendations, whether you're literally selling your data, you know, or your analytics to a third party, which I think is, you know,
Starting point is 00:21:15 or in a marketplace like, you know, Snowflake obviously is promoting this and Databricks, things like that. So like, I think when you get into that situation, when it's not that, you know, you're going to feel awkward or like your colleagues are going to lose trust in you, it's like you're literally going to lose business because this is broken or it's delayed or it's wrong. And no one's going to tell you either. They're just not going to show up again. I think that's where understanding the internal state becomes quite important. So for us, I think observability means, you know, not just do I know that something happened, but how do I dig into the layer that I need to dig into and figure out why that is? And just to give an example of what that would mean, like, let's take the classic one of
Starting point is 00:21:52 like, hey, my pipeline is delayed. Like what's going on? Okay. So it's one thing to know. I think it's monitoring to know, hey, this pipeline's delayed. You know, let's go figure it out. That's another thing to know. Okay.
Starting point is 00:22:02 Let's go into, you know, this, this pipeline, let's say was using, you know, confluent for streaming. It was using Databricks to do some large scale aggregation. I don't know, maybe he's using Databricks for streaming too. And then it's landing something in a, you know, in a data warehouse and it's using Snowflake for that. And you're running queries against that. And that's what's like, so I think actually isolating at what point in this cycle, like this became slow, I think it's quite important. I think then digging in and saying, okay, now show me essentially the Databricks console or information gathered from it on what's going on. And, you know, was this constrained? Or like, did I have the wrong number of executors? Like, was my data skewed? Like what happened here
Starting point is 00:22:40 to cause this? Am I shuffling tons of data? Like that's the level where I think you get into true observability in the sense that that term is used, you know, in the broader, you know, non-data context too. We see it as quite intensive, I would say, or more comprehensive than monitoring. And they're like fairly serious about saying like, yeah, if this was monitoring, like that's, it's a great thing. As with BI dashboards, you know, there's better generations of that now than there were in the past. But I think, you know, we see observability as the thing that's
Starting point is 00:23:08 going to let you actually control and optimize and basically operate this in a predictable way. Is this observability more relevant to the infrastructure that handles the data or to the data itself or both is actually at the end? So, so my sense is broadly that it's both because, and I find it hard to decouple them for, for a couple of reasons. So I think one is that, you know, as you get into these data platforms, like the actual structure of the data has a huge impact on the, the basically the, the,
Starting point is 00:23:41 you know, compute layer. So like, just to take the spark example as well, like if you, if you do have data that's coming in and that data has drifted in certain ways, like that's going to make whatever configuration and whatever resource provisioning you'd have before inadequate and, and suboptimal and like at the volumes and the velocities you can be talking about, like that can be a quite significant difference. Likewise, I think as you start playing it out and you start looking at, you know, I'm doing this complex data pipeline, it's going against multiple things. Likewise, I think as you start playing it out and you start looking at, you know, I'm doing this complex data pipeline, it's going against multiple things. I mean,
Starting point is 00:24:14 timing becomes a factor as well. So if this is supposed to be read from this table at this time, and this upstream job was delayed for whatever compute reason, now you're going to have semantic problems in your data that ultimately were caused by an infrastructure issue. So I think in my view, you know, certainly there's an aspect of being able to dig into this and underneath, but, you know, even it's, it's almost hard to detangle them, even though I understand like traditionally, like it kind of has had this division of like, okay, I'm a data analyst type person. I look at this analytics engineer, I look at this and, you know, maybe I'm an IT engineer. I look at this. Analytics engineer, I look at this. And maybe I'm an IT engineer, I look at this. My read on this is that especially as we move to cloud environments where you're not asking your central IT team to manage this stuff, you as the data engineer are interacting
Starting point is 00:24:58 directly with the cloud provider and using their services, this starts to become more of a blending of skills that are, or let's say of concerns that used to be separate. Yeah, makes total sense. Can you give us a little bit of more, actually a few examples of like, what's the experience that someone has with the product itself? Let's say that I know that I have a problem with my data and my pipelines. And I'm convinced by yourselves people that observability is the solution to all my problems. So what happens next? How do I implement that? And what I'm interacting with in order to experience data observability?
Starting point is 00:25:40 Yeah. And I'll try to give a broad answer to this. I do think we have a unique take on observability and do things a certain way. I think other people's do it and I'll try to answer it in the general way here. Like, I think, I think the way that I see this happening effectively is basically I think there's two entry points that people look at. One of which I would prefer or recommend to the other, but like one of them is, is sort of connect to a, to a data source. And in many solutions, you'll see this. I think this is a little bit of a division in the industry or, you know, the micro industry that we're talking about with data observability industry. But basically, you might connect to the data source, which might be a data warehouse, or it might be a files or might be a stream. I think and you might at that point say, OK, these are all going to be basically analyzed by this data processing service.
Starting point is 00:26:25 It's going to look at, you know, on the one hand, the compute layer of where this is actually going through and, and, and sort of like the jobs that are processing, it's going to extract that from the provider. And it's going to actually look at the data itself and, you know, compute distributions on it, like analyze it for anomalies built, like everyone's building these like little simple time series models that forecast if a value is an artist, or sorry, sorry, tell you if a value is anomalous or things like that. I think that's one way to do it. I think the way to do it, that is we we've seen a lot more excitement around has been actually instrumenting the pipeline itself.
Starting point is 00:26:58 So these days, I think, you know, it's part of the unbundling or whatever, or rebundling, whatever, like, you know, there's a lot more code-based pipelines now. Like, you're not necessarily dropping in. Obviously, the drag-and-drop ETL vendors still have a huge market, you know. But, like, many new initiatives, especially ones that are serving external customers, are basically, you know, pursuing, like, code-based frameworks. You're coding it, and then you're orchestrating it with Airflow or Daxter or Prefactor. Like, the list goes on and on. And so I think the way that we've seen people get really excited about, especially the people responsible who are not writing the pipelines, but the people
Starting point is 00:27:35 responsible for keeping track of the 10,000 pipelines that are written, is actually to basically decorate these pipelines and have that information emitted back to the mothers back to the mothership here and have that actually, you know, give you this digest of the same type of information, like what's happening, you know, what, what's going on with this data set. Does it have values at a range? Is it not passing rules as well as, you know, let's get a read on the actual query statistics, you know, or the load on the database at this time and show you what that looked like. So I think there's, there's a couple of entry points that you,
Starting point is 00:28:04 that you look at here. I would say in terms of what happens, like what's the experience as a user, like supposing that you have this setup, I think there's two phases. So one phase is defining and sort of setting up, you know, what, what types of things you analyze. So sometimes people refer to these as tests or, you know or quality checks or dimensions, things like this. And it's very important. I'd say my philosophy on this is basically to try to automate as much as possible. It pains me every time someone has to write a test here. What I really want to do with data is be able to say, hey, I know that you have certain expectations of your data.
Starting point is 00:28:42 It's not going to drift. The values aren't going to be here. And you can define those things in a way that you can't actually even define them for software. So with data, you know, you can, there's a way to measure distribution drift. There's a way that forecast of things are anomalous, things like that.
Starting point is 00:28:54 And those in my view and in our view of Excel data should be applied, you know, in an automated, as automated way as possible. Once you've got all these things set up, of course, it's not always possible to automate it. Like, you know, you want to write code, like you have custom rules, you need to look across multiple columns. So there is this aspect, it's never going to be fully automated, but you want to push it as much as possible. Once you have that instrumented, it basically becomes, in some sense, a familiar
Starting point is 00:29:19 pattern. You're getting an alerting framework, you're getting told what's going on, you're getting an incident filed, you're jumping in and you're sort of seeing all these things connected together. I would say the only big difference with data that you'll sometimes see is that people will want, and we had a big sort of data seller, you know, requests and drive this feature. You'll want to actually split up some of the records, you know, that might've failed and sort of segregate them from things that go in. So, you know, there's a sense where, especially when you're dealing with files, you know, like very early in the stream, you want to filter out, you know, in quarantine or whatever
Starting point is 00:29:54 you want to call it, like basically some of the things that are a little suspicious before that ends up contaminating, you know, kind of your, your golden warehouse data or whatever's going into your model or things like that. And of course the user, a fool, going into your model or things like that. And of course the user, let's say, takes care of observability. I mean, in the infrastructure world, like things are like pretty clear there. You have like the SREs, they are like the primary consumers of like observability products, right? And for a very clear and good reason. Like when it comes to data, though, I think it's a little bit more complicated there
Starting point is 00:30:25 in terms of who are the consumers and who are all the different stakeholders there. Who is the primary user of an observability, data observability tool? Yeah. In our experience, it's data engineers,
Starting point is 00:30:36 but it's not all data engineers and it's not all the time. What I mean by that is basically, you know, I mean, data engineers, like machine learning engineers, like data scientists, like analytics engineers, like these terms are kind of telescoping or whatever you want to call it. You know, there, there, there's a, there's an explosion in specialization on these. And like, I don't know that the terms that we use for them or the job descriptions have actually perfectly lined up, like with where it actually is today, let alone where it's going to be in a few years. I've also seen job posts for data quality engineers, data reliability
Starting point is 00:31:09 engineers, things like that. So I think there's something where if you play this out a few years, there may be a specific role similar to how there are SREs now, where there's cloud ops, focused people that are focused on ops things. My know, my view at this point is, you know, it's sort of like the responsible data engineer, like, you know, who is looking at this. And I would say importantly, kind of that management chain. So I think as you're getting up and you want to view into what people are doing, that becomes the primary, you know,
Starting point is 00:31:36 user of an observability system. Like the first use case is of course, you know, hey, I've got this, I've got my pipeline, why is it broken? Let me know how to fix it as quickly as possible. I think that's a clear one. I think there are larger ones as you step back though, and the more you get removed, the more that you go from having five pipelines that are monitoring to 10,000.
Starting point is 00:32:00 I don't know if this number sounds incredible to people, but we absolutely have heard, we have customers that have 10,000 pipelines that they want us to monitor here. Like it's just insane. This is precious. Yeah. You know, it's like, it's become so easy and so powerful, the tools that like, if you're someone sitting in the central group here and try to keep tabs on it, like you have
Starting point is 00:32:19 no idea what someone downloaded and what they're running and what data they're using. And like, should they be using that data? Like someone asked to delete that data. Would you know? Like, and so, you know, in Excel data, like I would say one of the things that we found is like people, people start to get very interested when they're like, wait, you can instrument our data pipelines, you know, and sort of keep an eye on them and tell us what's going through them.
Starting point is 00:32:40 Like that starts to be quite, quite interesting, both to, you know, to, to buyers as well as to, you know, people in sort of the data governance type world. But I think the further out that you, the more that you aggregate this type of information, the more that you're actually getting, like, if you come back to these people we talked about earlier, the people judging, you know, and carefully considering the technology investments, you're starting to get a map of like what your business is actually doing with data, like which systems are being used, which systems, which pipelines actually work, which pipelines feed into others, which pipelines are reliable, who is reliable, who is not. So you start to get, you know, even though the initial engineer, initial user is like absolutely the data engineer who just wants to get their stuff working and keep it running and,
Starting point is 00:33:22 you know, be able to progress to the next thing. This does end up having different users as you grow and expand it. It does look like something very interesting, and I would like to ask a little bit more about that. So, okay, let's say we go instrument our pipelines. Now we are able to monitor things. Everything goes well as it should. And at some point, something breaks, right? And you get your notifications.
Starting point is 00:33:51 And the data engineer will go there and fix it probably. But the difference between something like data and observability of the server is that the impact that you have with data, it's much more difficult to calculate, right? To figure out what happened to the organization at the end, how many reports were wrong because of that. So how do we deal with that? Is there something that we do today? What's your experience with the organization that you have worked with? Like what happens after the incident?
Starting point is 00:34:28 So this one, you know, I think my experience with this actually comes more like from the ML world, like a data robot and looking at like what happens with when you deploy a model and you start feeding data into that model and it starts doing things. And then, you know, it's, you know, obviously the, I think the familiar, the case everyone's probably familiar with is with dashboards.
Starting point is 00:34:49 And of course this affects the dashboard report. We've seen this, you know, I think with models, what, you know, there's a, there's an increased sensitivity, like one, just because it's a kind of a newer technology and people have no concerns about it too, because of potential, you know, regulatory reputational things where you start putting out like bias data or just like obviously garbage things and people come back and mock you about it on Twitter and things like that. So I think, you know, I think broadly with data, like there's one aspect that's just, you know, being able to understand the whole flow of what's being used in that there's, I think there's no way around this other than kind of having integrations
Starting point is 00:35:25 with the various end steps here. So I think people are certainly very curious. Okay, we've got a history of historical runs here. This one went bad at this phase. And now show me everything that was impacted by this. And so this is something that we built out in Accelerator to sort of show the downstream impact across all these tables, things like that. But it's a bit of an infinite spiral here where there's the people that... There's the tables affected by it, there's the reports affected by that, there's the people who looked at those, there's the choices they made off of those, there's the customers affected by those choices. I don't think there's ever... I mean, you're right that there's never enough that you
Starting point is 00:36:04 can really trace this all the way through in the way that, you know, with software, maybe sometimes you can say, you know, these people hit this and they failed. On the other hand, I think like, you know, you can at least trace that chain through a few steps here and say, this was used in the cases I'm familiar with, like this, this data was used, like this batch of data was used by this model, which made these predictions, which were off in these ways. Let's go back, you know, and make sure like if it's these customers, like we score them again better, or we actually go back and offer them again because we gave them a terrible offer. We messed up the data going into the model.
Starting point is 00:36:40 Yep. Yep. A hundred percent. And I think that's where things are more complicated and more interesting with data because you have observability. We are talking about that, right? Because that's what actual data is doing right now. But then you start thinking, okay, if I also have other, let's say, capabilities that come from data governance, like Linux, okay? So I have, let's say, capabilities that come from data governance, like Linux. Okay, so I have, let's say, like a track of like how the data got consumed
Starting point is 00:37:11 and how it changed and like all these things. Then I can start like tracking like what's the impact there. And then if I also have like proper like auditing mechanisms that are in place, I see the people that are like getting involved and affected by that. So what are the, let's say, the parts of data governance specifically, but in broader terms, like the data stack elements that you think are important, like volume multipliers for data observability and vice versa, right? Yeah. And this has really been, I mean, I'm glad you asked this because this has really been
Starting point is 00:37:45 one of the key areas that I've been looking at and trying to think through in the time that I've been in Excel data. And I think it's one that everyone is sort of thinking through as well. And I think I would answer it basically with the theme that we've talked about throughout this discussion a little bit, which is really the theme between sort of existing and new and modern data stack. I don't know what you call the sort of like existing and new modern data stack. I don't know what you call the sort of existing data stack in a nice way, but you know, basically like the previous data stack or the current data stack and the modern new data stack. And I think
Starting point is 00:38:15 the fascinating, like it's an easy distinction where like, basically I think a lot of the data governance things are almost, you know, and you even hear it in the word lineage, like it's almost a standing still, like studying how things are fitting together and almost an academic, you know, it has very real applications in terms of data, you know, privacy, data protection, things like this. But it is almost sort of understanding like what's the intended, you know, connection between some of these things. I think where observability, you know, has proven to be quite useful additional lens on this is basically for these people that are not working in the data warehouse, like they're out writing code, they're grabbing data sets, you know, from different places. And this stuff actually did not
Starting point is 00:38:58 really go through the data catalog or things like that. It did not get classified. It did not get a million rules applied to it. And so it's basically out there in the dark as far as the data stewards are concerned. And I can tell you that's something that they're quite concerned about in many cases. some amount of data to, you know, 90% is like not that great if, you know, now there's a whole new 80% of data that you have no insight into. And so I think observability, you know, I think these things converge and they feed off, but like certainly like one aspect is just what touches what. So I think the ability to go into pipelines, basically be generic about what you pull in and what you analyze, I think is quite significant and sort of understand what the actual usage and impact was. So if a certain thing is always breaking or always causing something else, and it's an initial point, you're getting a real map of kind of the actual dynamics of how your data
Starting point is 00:39:57 is flowing versus the intended map of how things should be working in the systems you have insight on. I think more specifically, you know, if you dig into some of these things, like I see the observability sector as a bit more, I would say performance focused in the sense of looking at, you know, what's something that's going to affect the outcome of the end result here. And what I mean by that is like, again, this kind of littlest distinction, which maybe is just informed by my background and things like this, between kind of the very established BI reporting side of the world and kind of the data product, data service, ML powered thing on the other side. So when you're building and deploying a machine learning model, you very much care about subtle distinctions in how the data is distributed and how that's shifting. And is that in a harmful way with respect to the model and things like that?
Starting point is 00:40:48 I see observability tools as very much getting into this and being very dynamic because the thresholds, you know, not to go into all this stuff, you know, it's a different discussion, but like, you know, the ways that you measure this would change over time, you know, in response to seasonality and different ways that you segment the data and updates the model and all this stuff itself. So like all of those things, I don't think any, you know, in response to seasonality and different ways that you segment the data and updates the model and all this stuff and stuff. So like all of those things, I don't think any data governance tool like cares about those because those are not about the data stewards life and kind of making sure we know, you know, we've got a complete map of where things are. Those are about, you know, basically real world performance and impact on things like
Starting point is 00:41:22 that. So a little different focus areas. And I think certainly, you know, there's tons of room and it's one of the big things we think about with our partnership strategies is around, there's tons of room for a cross collaboration and connection here. So just like one example is of course, all that great stuff that the observability tools dig into could very well be populated back into a data catalog. That's the kind of, you know, owns the experience of where people look for data. And now that's enriched with some awesome information around that more detailed than
Starting point is 00:41:49 you would have gotten otherwise around this. And I think we've actually seen, you know, some moves by some of the vendors there to start bringing in some amount of, you know, observability and quality stuff. I think there's a lot more to do. Likewise, I think, you know, I mentioned, I think one of the big things we try to do to accelerate is automate as much as possible and no more. And one of the things that helps, you know, with automation and defining policies and propagating policies across data sets and across columns is metadata about those. And so when we know, you know, this data
Starting point is 00:42:18 attribute has been classified in this way, it's used in this way, it's subject to these controls, that actually can inform the intelligent creation of policies so that you as the data engineer, instrumenting your things, actually don't need to instrument at all other than say, hey, point to my observability platform and let it take care of the rest. So these are early days, I think, but it's been one of the, I think, learnings for me as I've come in, just the different sort of emphasis areas of data governance versus data observability and the different personas that end up using them. One question, if I can jump in, Costas. So Tristan, you talked about data drift,
Starting point is 00:42:52 which is an interesting topic that I don't think we've actually discussed in depth on the show, but I would guess that most of our audience has like a base sense of what that means. Where have you seen data drift start like as a problem, right? Like, and part of the emphasis behind the question is data drift is one of those things. We actually had a guest a couple of shows ago who said data is silent, which is, I thought was a very interesting way of describing like a lot of the problems or the nature of the problems. Drift is when you discover it a lot of times, at least in my experience, which is probably on a smaller scale than what you've seen, but you discover it when it's a big issue, right? Where you're just like, my goodness, and you realize like, okay, this has been
Starting point is 00:43:41 happening for a while. Where does it usually start? And like, what are some of the really early symptoms, you know, that our listeners could, could look for? Yeah. So I think the first thing I want to do is, is distinguish a couple different senses of this. You can tell I was a philosophy major now, right? But I think sometimes data drift, you know, in the data engineering world sometimes refers to like schema drift and like, Hey, we're changing the structure of this or like what type of field this is. That's actually not what I'm talking about. And I don't think it's what your, your, your, your guest was talking about. I think what we talk about, we talk about data drift in that sense is basically, you know, changes in the distribution of data. So you've got the same fields, but the patterns and the data are shifting. And I think
Starting point is 00:44:20 one of the funny things about this, like one of the things that I learned, you know, is that, you know, a lot of the banks, so in the machine learning context, like a lot of the banks, you know, and insurers, like, and sometimes banking and insurance are by their nature, like fundamentally about forecasting and predictions and things like that. So like they're extremely mature in how they govern models since they've been using them for a long, long time. And so for models, for banks and insurance companies, there's been a standard measure for quite some time called population stability index and the various forms of this, where basically you're building histograms of two data sets. One is the data when you train the model and one is the data as you just saw it or in whatever time window you care to define. And basically they, for a long, long time, like there's been measurements of this and you can quantify this and say, Hey, if it's above 0.2, like that's worrying and things
Starting point is 00:45:14 like that. And then I think one of the things that's become more sophisticated as more people have adopted machine learning and it's become more critical. It's like, there's actually a lot more techniques to do this now than maybe there once was. So there's a lot more ways to do that. There's ways to do it, you know, using models, there's ways to adjust it for time series, things like this.
Starting point is 00:45:32 So it's become almost a routine check in many machine learning contexts. I actually have always thought it should be used generally. Like that shift can actually be meaningful and like it can catch things that are't, are not obvious to people. So I totally agree that like the problem with data failures is, is precisely that they are silent unless you set up the things to, to alert them.
Starting point is 00:45:54 And so machine learning, they actually have done that at least. And, and it's, it's fairly common in like ML ops now. Now, the interesting thing about this, I would say is like, you know, you've asked exactly the right question from my perspective, which is like, when does this start? Like because that data before it got to your model, like it went through a whole data pipeline just like that. And so, and you know, incurred a lot of expense and, you know, touched a lot of things and landed a lot of warehouse.
Starting point is 00:46:18 Like why do you only get it? Why would your alerting only be at the last second, you know, on this thing when it goes into this model? And so I do think one of the opportunities for data observability companies is to bring in some of that stuff. And I know it's not necessarily what data engineers wake up every day and think about. That's exactly why I think vendors who build this in have a lot to offer to be able to tune this and offer it and just get these checks out of the box for you. And to be able to apply those on a stream of or like on a file, on a JSON file, like before it gets processed and wrangled and all that stuff and it's up in the warehouse.
Starting point is 00:46:53 I think there's a lot of things there. Yeah. I would say the other problem with that, sorry, is like, it's basically, you know, the alert fatigue thing. And I think that's obviously an issue with any system. I think it's, you know, particularly intricate problem here because like since the carbon monoxide alarm, right. Because like, you know, not to make it more dramatic than it is, but like, you know, it's like, it's something that you wouldn't sense otherwise, you don't know that it's there and you don't really necessarily know. You know, so the last thing
Starting point is 00:47:24 you want to get is a hundred things. And I've seen people do this. You get a hundred things and all my data is drifting and it's like, I don't care. Mute that. No one's going to notice. There's some other reasons it's tricky in machine learning when you wouldn't notice the effects. So I think figuring that out and basically reconciling these metrics of drift that people
Starting point is 00:47:43 compute now against the actual business relevance of it is one of the interesting UX problems that a lot of companies are kind of thinking through now. Yeah, super interesting. And I think one of the things that's interesting to me is that in the context of sort of the drift happening at the final state of the model is that even then it's sort of a variance on a baseline right and so you could have drift that you know doesn't necessarily trigger an alert but that is like damaging in some way right i mean it's all you know exactly i mean it's all you know stuff is hard to tune in like i think the other distinction is like you know, stuff is hard to tune in. Like, I think the other distinction is like, you know, we, we saw this, like when I was at a data robot, right. During 2019 and 2020
Starting point is 00:48:29 and 2021, it's like, yeah, everyone's models were drifting, you know, whether you could quantify or not, like there genuinely was a real change in the world, like of huge severity. And so like there was quantifiable and legitimate drift that you could understand. And we had people looking at that and analyzing it and adapting to it and things like that. But like, so sometimes it's real, like just the fact that there's alarm trigger doesn't mean the data is wrong. It could be like, no, this is a genuine shift that you should get. And like, it's almost as good an insight, you know, as you expect to get on an EBI dashboard. Hey, this has significantly changed.
Starting point is 00:49:01 This attribute has significantly changed and things like that. So I think it's, you know, it's one of these things where like people, when I tell them like, Hey, we're trying to automate a lot of things and make it really easy. People there, they start asking like, well, how do you automate the response? Like, how do you automatically fix that? And like, my answer is usually like, I don't, you know, I don't think you can automatically fix some of these, like where you shouldn't at this phase. Like it's very setting aside the permissions issue to actually affect these things. Like it's just there are things that are possible to automate and things that that aren't really
Starting point is 00:49:30 possible to automate well. And I think judging, you know, the impact of, of these things and is this actually wrong? Is there a reason for it? Like, I think that ultimately is a human discussion where you need to go, you know, put your shoes on and, you know, do some detective work and figure out what happened here. And there's no tool that's going to magically do that. The tool will magically tell you about it. You know, yeah, yeah.
Starting point is 00:49:54 Can we just spend a few minutes talking about tuning? So this came up, you know, we were prepping for the show and I think, you know, we kind of talked about like the analogy of an engine, like building an engine requires you like connecting certain parts. And I think, you know, a lot of people's standard definition of a data engineer is like building an engine that just runs really well, right? Like it's efficient, like it doesn't, you know, have a lot of problems. It behaves consistently. But tuning is really kind of a different skill, right?
Starting point is 00:50:19 Like we're talking about performance. We're talking about taking a system and sort of identifying the elements that can be optimized, and then like talking about the ways in which we want to optimize them. And a lot of times tuning is sort of balancing like cost benefit, right? Like we can sort of, you know, sort of accelerate one variable of the equation, but like that may come at a cost to other variables. Tell us about tuning, you know, sort of in the context of data engineering or MLOps and like, what is that skillset? What's interesting about it is, you know, it basically comes back to the diversity and the power of the tools out there. And what I mean by that is, one, like a lot of these tools are quite new and evolving quickly. And so, you know, it's not that things are static and you can master
Starting point is 00:51:03 it and, you know, you're going to be working with one vendor. It's like, you're actually like always facing this choice to some extent of like, do I use something I know how to tune well? I personally know how to use tune well from a manager that I met that my team knows how to tune well, or do I try something else that like might be a better engine, but I don't really know how to deal with it and things like this. And so I think, you know, this is a choice like everyone is facing all the time. Like, I don't know if you guys ever done this, but like, if you go and look at the DB engines, like a page that's showing the ranking of these, like plot a few databases on these, and you'll see like the curve to accelerate to the top to be, to become quite sharp.
Starting point is 00:51:40 And so, you know, tons of cool companies on there, but like you so, you know, are, you know, tons of, tons of cool companies on there, but like, you see, you know, Presto slash Trino, you know, spike up, you see Snowflake, obviously all these things like, and there's going to be more, there's more that are, that are happening now. And so I think like, this just shows to me like the, the, the ability to, to choose a better engine is always out there, but you're always facing this risk of like, yeah, I don't know how to deal with that. Like, I don't know how to tune that, you know, and, uh, and how should I do this? And so I think, you know, certainly one of the things, I'll just mention one other thing before we get to the, maybe one of the, the, the approach, I'm not saying it's solution, but one of the approaches to solve it. Sure. I think the other
Starting point is 00:52:14 aspect is like, you know, in many cases that the answer, right. Is like, oh, well there's a commercial company kind of backing that and things like that. And like, they can, they can kick, take care of tuning for you. And I think of that, which is a great, you know, it's a commercial company kind of backing that and things like that. And like, they can, they can take care of tuning for you and think of that, which is a great, you know, it's a great solution. Like it's value, value for money on that. But I do think like sometimes people are looking for control and like, what if they, they want to do it differently or they want to customize things like that or what if they switch and things of that nature. So, so I think, one of the things that we try to look at, at least in the Excel data world, is try to look into each of the major offerings and basically cloud data engines here in enough depth that if you're a
Starting point is 00:52:55 specialist in one, you can keep operating that as you want, you get all the insights you need. But if there's something that might be a better fit, you know, you can use that inside. You don't need to choose all in one. Which I know, obviously, many of the vendors, you know, not to name names, but like are trying to converge on each other's territory, whether it's data processing or, you know, analytical databases and things like that. You know, so like, I don't know that for the actual end user, like that needs to be a one or the other thing at a technical level, at least like, I think these things can be tuned, you know, through expertise, through, through data. And that's something software, you know, in my view should be able to help with if you do enough sort of research and into the actual platforms. And so you should be able to get that out of the box. I think the only other factor I'd throw out there is, is the cost aspect as well. So many things are solvable through, through paying more, you know, and I think.
Starting point is 00:53:49 All things maybe. Yeah. No, some of these queer, you know, so I think, you know, there's always a solution that's to spend more, buy more engines, you know, things like this. Like the brute force, right? Yeah. engines you know things like this like the brute force right yeah and so like at some point that's not you know it doesn't always solve solve things but like we have seen like that wasn't possible you know 10 years ago necessarily to just say like throw more at this you know it's like well i need to go back and you know buy another server that i'm going to put this thing on and it's like
Starting point is 00:54:19 well today i just say yeah let it go or like maybe don't, maybe I don't even need to do anything. It auto scales up. And then I realized after the fact, oh, wow, I just spent like $30,000 on this query or something, you know, like running this query for a month. Like, yeah. And someone knows that you'd out with. So like, yeah, I mean,
Starting point is 00:54:34 it's a really complex, like on the one hand, it's become some really powerful tools that are available and they're very easy to use. On the other hand, it's become a very complex thing just because there are so many of them
Starting point is 00:54:44 and because it's so easy to basically rack up some spend on those that might not make some of the people in finance happy. So it's going to be an interesting time. And I think we'd all say, I think, I mean, certainly my view is that this is going to expand and it should expand. I want to see these databases coming up left and right that are specialized for specific things. I want them to be really easy to use. Like that's what's going to help us accomplish these use cases better. It's just going to be the data engineer sitting in the middle of this is going to have a quite complex world to kind of navigate through. Yeah, for sure.
Starting point is 00:55:15 It should be fun. It is fun. And Brooks is telling us that we're at the buzzer here. So I have time for one more question. And this is less on the technical side, but with the show, we want to just sort of help our audience get to know the people who are doing interesting things in data beyond just like the technical components. And so my question is like, what do you really love about what you do working with data? Like what keeps you coming back to
Starting point is 00:55:42 working with data? And like, why is the problem interesting to you? Yeah. Oh, it's a, it's an interesting question. I mean, I think part of it is definitely, um, just the scene, like seeing something, some tangible improvement and tangible impact is always significant or quantifiable. I should say quantifiable maybe is the better way to put it. So I think, you know, one of the interesting things with data is like, it is structured enough that you can measure, Hey, this got this much faster where I did, I did, you know, cause I did it in a smart way. I was able to do this, do it, you know, 10 times faster than I can actually measure.
Starting point is 00:56:16 Like similarly with, you know, I think just some of the things in the machine learning world are endlessly one of the most impressive accomplishments that you know that that has happened in the world in quite some time and so you know i think on the one hand just understanding you know being able to learn about those is is just stimulating in its own right and you know proud to see people building these these products and services that that that use you know predictive technologies and things like that like in a quite smart way that can explain what they are like it's just very cool like you can spend forever learning about it. And then I think, you know, what we do day to day, like at Excel data, I think is rewarding in its own right.
Starting point is 00:56:52 Because it's basically, you know, combining those two aspects. So it's taking like, look, we have crazy data technologies, but, you know, we don't know how to apply them to only, especially for a complex like enterprise environment with a lot of investments already. Then on the one hand, you know, there's this promised land. If I could just get to it, like if I could just get all these data pipelines actually working and running and things like that, I could use all of this awesome technology to help me build things, you know, we can't even imagine really. And so I think, I don't know, it's motivating for me.
Starting point is 00:57:21 Sometimes data pipelines, like, you know, it's motivating for me. Like it's sometimes data pipelines, like, you know, it's just, it's, it's kind of like messy stuff. Like, you know, it's not always pretty and things like that and flashy and shiny, shiny lights and shiny graphical interfaces and things like that. But it's kind of the part that, that, you know, brings it all together. And so I think it's, it's rewarding to kind of bring something to that group and give them a place where they can sort of accomplish these, these pipelines a little more easily. Yeah.
Starting point is 00:57:44 I mean, it sounds like you kind of can have your cake and eat it too, because you sort of get to address problems on a philosophical level, but you don't have this like intractable, like, you know, philosophical quandary that doesn't really have a solution. You can actually like provide value and make things better and faster. And just get more engines. That's the practical. Yeah, just more engines. Tristan, this has been such a great conversation. We learned a ton.
Starting point is 00:58:12 And thank you again for taking some time and teaching us about all sorts of things. Yeah, thanks so much for having me on. Costas, you know that I love asking our guests about how what they've studied previously, whether it's academic or not, that's completely separate from data and engineering, influences what they do. And so my favorite part of the show was both hearing about that from Tristan, but then seeing. I mean, it was so fun for me when I asked a question and he said,
Starting point is 00:58:48 my response is going to be indicative of the fact that I've studied philosophy. And he really, in a very concise way, sort of broke down in some ways the problems with the question, right? It requires definition and other things like that. And so I just really appreciate that. And I'm going to continue to ask those questions, which maybe you should, because I think, you know, sort of your lineage, if I can use a term that's relevant to what we talk about on the show,
Starting point is 00:59:17 is much more philosophical than my lineage. But yes, I will continue to ask those questions. What did you take? Yeah, you should, you should. I mean, I think one of the most interesting parts of this show is making, like connecting the dots with what people were doing in their past or what they studied and how they ended up working with data. So you should definitely keep asking that.
Starting point is 00:59:42 And I'm pretty sure that we will be hearing more and more interesting stories. I would say that from my side, I got really final in the answer of like, what do you study when you want to become a product manager and that's philosophy. So... Oh, interesting. Yeah. Yeah. Like I was always like thinking of, okay, if you want to become a product manager, like what do you do? Like, where do you study? How do you do that? Right? And it made a lot of sense today. Yeah, like, you go study philosophy. And then you're like equipped with all the analytical tools that you need to keep asking why again and again and again and again until everyone like in their room is mad with you and they just want to get rid of you. And that's actually what shows you that you're doing
Starting point is 01:00:31 a good job. Yes, exactly. Exactly. So yeah, it was a very interesting point of our conversation today, like making this connection between like having these skills that you get from something, something like philosophy and actually ending up like developing and thinking and designing products. So that was like super interesting. I agree. I feel like I have a very vivid picture of how that should work. All right. Thanks for joining the Data Stack Show.
Starting point is 01:01:02 Lots of great episodes coming up. Subscribe if you haven't and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Lots of great episodes coming up. Subscribe if you haven't, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes
Starting point is 01:01:14 every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.