The Data Stack Show - 68: Season Three Recap: Holiday Edition with Eric Dodds and Kostas Pardalis

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. If you are going to the Data Council Austin event in January on the 27th and 28th, you're definitely going to want to meet Costas

Starting point is 00:00:33 and me in person at El Mercado on the night of the 26th. We will buy you a drink and talk all things data. We will be on site for the conference and we're super excited to meet you, Kostas. Tell me what you are most excited about asking our listeners if they actually show up to meet us in person. I don't know if I want to ask something.

Starting point is 00:00:56 I was thinking that like, maybe I'd love to play the game where I say this is interesting and then we all do shots. So yeah, if you come and visit us, like you will have the opportunity to play this new game that we don't have a name yet, where I say this is interesting

Starting point is 00:01:12 and we all take a shot of tequila or something like that. Maybe you can play that game. I don't want to do that many shots of tequila, but we would love to meet you in person. It's going to be a great conference and we're excited to meet some of person. It's going to be a great conference and we're excited to meet some of our listeners. So come by January 26th. You can reach out to us on datasackshow.com, fill out the contact form, let us know you're coming and we'll buy you a drink.

Starting point is 00:01:35 See you there. Welcome to the Data Stack Show season three recap. Costas, I can't believe we recorded three seasons of shows. It's kind of crazy. I think we have 80 shows in the books. Of course, not all of them are quite released yet because we do recording ahead of time, but that's pretty wild. Did you think that we would get this far when we started?

Starting point is 00:01:58 Oh, yeah. No, I mean, I never expected that it's going to last that long, to be honest. Yeah, it's been quite a journey. I think that we are going to be, you know, shows like French and all that stuff. We are getting closer. Okay, so I want to talk about a couple of specific themes that arose that are really interesting. But first, I want to ask you a question.

Starting point is 00:02:20 So when you messaged me on Slack and said, let's do a podcast. And we hopped on a call and we talked about it. You said, I want to talk to the people who are doing interesting things in the data space, both so we can just learn about what's happening out there and then meet the people behind it. Do you feel like you understand the data space, the data stack better as a result of doing the show? Like what, what sort of been the impact for you personally, as you think about, I mean,

Starting point is 00:02:53 you work in the data space every day, like, has it been helpful for you? Okay. That's an interesting question. I feel like I got answers, but I also, it also created like new questions.

Starting point is 00:03:03 Right. But I think at the end, what, what matters is like to try and get in contact with people and see how they are thinking and why they are doing the things that they are doing. Because at the end, the market is so young. There are so many things happening. Not all of them are going to survive.

Starting point is 00:03:19 And of course, not all of them have the best way to do things or whatever. So we don't really know yet what will be happening in a couple of years from now. But having like this kind of contact with passionate people who really love what they are doing, like, okay, we had people joining us, all of them. I mean, they have done like amazing stuff, right? Very smart people, very honest people, like with why they are doing the things that they are doing so yeah i think that's the for me the most important part of like this show is like this kind of connection with all these people like i think it's what really keeps both

Starting point is 00:04:00 of us i mean okay i'm talking more about myself, but I think that also like applies to you, but I think it's what keeps us like doing it. So yeah, I mean, okay, we say that we do it because we want to share things with other people, but we are also selfish, right? So primarily we do because we're having fun and we meet all these nice people. Yeah, super fun. It is kind of a paradox because I agree with you. I think questions have been answered, but it's a paradox in that, you know, I think some of the simple, some of the more simple things where we think about technology around data warehouses or data lakes is sort of becoming crystallized across stacks across the board. You know, it's kind of like, okay, we see patterns emerging there, but then you talk with, you talk with people who have developed really groundbreaking technologies

Starting point is 00:04:49 and a lot more questions open up, right? Because these people are really sort of pushing the envelope of what can be done, which is super interesting. Okay. Let's just cover a couple quick themes here of what we talked about. The first thing I want to ask you about is, so I'm just going to rattle off the main themes from the episodes that I jotted down in my notes as I was reviewing the season. So we talked a ton about ML. So machine learning as a service, ML ops, the emphasis is kind of saying, okay, ML may be like the next step beyond analytics, right? So data stack to serve analytics. And then once you get that sorted out, it's sort of you serve ML

Starting point is 00:05:31 use cases. We talked about batch versus stream, which was super interesting. And then sort of like federated data, which was really interesting. And so sort of that tension. And then we talked a lot about observability. Actually, we talked to several companies who are trying to solve for the challenges that you run into with all this data sort of even thinking around how you deal with data is increasingly adopting thought patterns from software engineering. And that actually is reflected both, I would say, in the team structure, as well as the tools that people are trying to build. Observability, for example, right? I mean, that's sort of a direct adoption. So, as a software engineer, tell me what you think about that. That was just a consistent theme we heard throughout the entire season. Yeah. I mean, I think it's very reasonable to happen.

Starting point is 00:06:37 There is a reason that we have all these different disciplines that they have, like the term engineering in them them from mechanical engineering to chemical engineering to software engineering to i don't know whatever other like engineering we have at the end when you're engineering social engineering yeah i mean at the end when you engineer something like there are some very specific principles that they are shared across like all the different disciplines all right and I don't think that data would be something different. I think actually it's like an indication of this space maturing. That's what is happening right now.

Starting point is 00:07:13 So it matures. It has to be much more serious. It has to deliver much more consistent results. And that's when you start moving, let's say, from the experimentation phase to the engineering phase, where now you need to put processes in place. Now you need to ensure quality. Now you need to observe things and make sure that they work in the way that they should be working, right? So how you do that? I don't know.

Starting point is 00:07:41 I mean, obviously, like in data, things are different compared to infrastructure observability, for example, or like whatever else. But still, the principles remain the same. We have a process. We need to observe the process. There are some data about these data or these processes. So we have some metadata that we need to track and try to reason using these numbers and see like, OK, can we trust the data? Can we trust our pipelines? Can we trust our data lake or whatever? So we are going to see more and more these principles being applied with anything that has to do with data. We have companies that are doing versioning, for example. I don't think we had anyone on this season, but there are companies out there like PackyDerm, for example, they are

Starting point is 00:08:23 doing data versioning, right? GitOps, I mean, at the end, we will get something like GitOps for data. So what I'm trying to say is that if we try to detach ourselves from what is happening and like take some distance, we will observe, and that's something that we've said many times, right?

Starting point is 00:08:39 That the data engineer today is like a role, that it's like a hybrid between engineering and operations right that's again like an indication that we are still early probably this is going to break into different roles and then you might have data and data engineering right where someone is responsible like for writing like all the stuff that we need to execute there and then we have someone who's like operating all this software or whatever and yeah yeah, I think the next couple of months, maybe years, like one or two years, they are going to be what is going to define how exactly and mature like this discipline. I would say, though, because you mentioned at the beginning the email part and that we consider email as the next step or whatever. I wouldn't say like if I learned something is that actually email is not the next step.

Starting point is 00:09:26 Like email is something that's out there, right? What is happening though inside the companies is that email and analytics are kind of like two different functions, right? And in many cases, this also reflects on like the infrastructure that the companies are using. You'll see like a completely different infrastructure that a mail is using compared to the BI function, for example. One is using a data warehouse, the other might be using a data lake. I think one pattern that we are going to see a lot, especially with the lake house paradigm or whatever, is the merge of these two into one. So we will see that everyone inside the company

Starting point is 00:10:06 is going to be using one infrastructure. And if you want, like, okay, I'll refer to a term that we usually make fun of, which is the data message. If there is, like, as I see it right now, value in this term, it's exactly this unification. We are going to have one infrastructure for all the data practitioners inside the company. We are not going to have them all separated as we have them right now. And I think this is happening now.

Starting point is 00:10:41 How exactly is it going to happen? Which paradigm is going to succeed at the end? Who is going to, like, if it's going to be called data mesh, data networks, I don't know. I mean, it doesn't matter. But there is a unification that's going to happen in terms of, like, how data is accessed and how it's used inside the organization. I agree with that. And I think a big driver of that is sort of right? If you think about common data schemas, tooling that can sort of enable common ML use cases on top of existing technology, there are a lot of things that are making it way more accessible, which is super exciting.

Starting point is 00:11:38 Let's talk quickly about observability. So we talked with a couple of companies, Big Eye and LightUp, and then it was a common topic in general, but what do you think about observability, right? And so, and let me, I'll give just a little bit of context here. So we said in one of our recent episodes, the stack is expanding, right? It's not contracting in complexity. It's actually expanding in complexity, which creates all sorts of problems in terms of being able to understand whether there are problems across the stack. What do you think about the observability space with data? Is that a, I mean, is it a huge need?

Starting point is 00:12:18 Do you think that those companies are solving like a really true problem? What are your thoughts? Yeah, obviously they are solving like a problem. There's no discussion on that. The thing is that I think it's still early when it comes like to how observability can be successfully implemented. I'll give an example, right?

Starting point is 00:12:39 Like we had, if you consider not just decision, but the whole show, right? In the past, we have also talked with companies like Avvo, for example, right? Who are, they don't call it observability, they call it like quality, right? And they are focusing more on like

Starting point is 00:12:55 the streaming side of things, while companies like BigEye, at least now, are focusing more on like the data that is at rest in the data warehouse to figure out what's going on there. But we see that we have, let's say, two sides of the same coin. Again, it's about data quality and figuring out if we can trust our data. Now, what's the best way to do it? Is it best to rely on an architecture where everything happens on the data warehouse?

Starting point is 00:13:27 Or you have a more decentralized architecture where quality and observability is something that fits part of the whole workflow that we have and the whole stack? That remains to be seen. I think right now all these companies, they are tackling the same problem from a different angle. And at the end, the market is going to decide who's going to win based on like which one of this is like, let's say the most important angle. Because at the end, what happens like with markets in general, we have consolidation and like we end up with a platform that does everything, blah, blah, blah, like all these things. I agree. My hot take is that I think there's going to be some combination of both. If you think about the sort of micro problem of data quality and capture, that's really, really important for certain teams in a localized sense, right? So if we have data coming in, that's driving some sort of like very personalized experience, for example, it probably makes sense to be like very rigorous on capture.

Starting point is 00:14:32 Now, I'm not saying that's not important for analytics and other things, but I think about observability as sort of a more comprehensive solution that crosses certain points of the stack as opposed to a rigorous approach to ingest. But like you said, it remains to be seen. It's a fascinating problem. I think maybe one of the most interesting ones beyond maybe data lineage, which has come up a couple of times. Yeah. Okay. That's what you saw solve at the end but i think it would be interesting to have and maybe that's something that we should include in our shows from now on like to also interview or interview i mean chat with vcs who have invested in this place because the thing is that we are talking about product categories that are so new that, I mean, you don't know.

Starting point is 00:15:27 Whatever you are going to see today probably is not going to be true in a couple of months from now, right? Yeah. So it would be interesting to see these people that, okay, they invest their money and they have every reason to do it as early as possible, why they do it, and what is the thesis behind that for this stage of the market? Because, okay, if we are talking about data warehouses, I mean, it doesn't make sense to ask the investors. It's better to go to the companies right now.

Starting point is 00:15:55 But for probability and quality, I think that it's the right time where it's going to be much interesting to hear what a VC has to say, not even the founder. Sure. Yeah. That's such an interesting proxy for sort of what the vision is of the problems that are being solved as people sort of look at their horizon. Yeah, because from a portfolio management perspective also, that's something that these people are doing. There are also correlations between all these different companies

Starting point is 00:16:29 and their investments. So it would also be interesting to hear on how they see this category related to other categories in data that they might also be investing in. Anyway, I think it's something that I think it's worth doing, like find someone who is very active in investing in data-related companies and get them on the show. All right. Listeners, if you hate that idea, go to datastackshow.com and fill out the form and tell us because if not, we're going to get someone on the show. Okay. Last question, because we're coming up to time here. One other subject that we discussed a lot was the modern data stack.

Starting point is 00:17:14 So we talked about this with someone who's been at Mixpanel for over a decade, and they're sort of migrating to this paradigm where they view the warehouse as an essential component of the data stack, which is really interesting for, you know, sort of a product analytics company. And then we had a panel with DBT, Databricks, Fivetran, Hinge, and then actually a VC that's pretty active. That may not have actually made it into the season three. So that's a preview for everyone coming up. I have mixed feelings about the subject of the modern data stack. In one regard, I think about some of the episodes where we talked with people who just sort of assumed the basic components. You need like good ingestion. You need a single source of truth. You need to be able to move data

Starting point is 00:17:58 easily. You need a sort of flexible pipelines. And that was kind of like, what are you doing with the data? And then we also had episodes where people were talking about serious problems with sort of any one of those components of the data stack. But I think probably one of the most interesting things was just hearing people who are practitioners actually trying to explain what it's like to use the modern data stack. And they just have way a lower emphasis on the tool set and more on what it enables them to do, which I think is really interesting. So with that theme, do you feel like you understand the modern data stack better? Or are you more convinced that people like me are making it into a marketing term?

Starting point is 00:18:55 Okay, I don't think that there is, let's say, some kind of clear definition of what this modern data stack is. There's no such thing. I mean, and we did, I think, a very good attempt to make things more clear on the episode that we recorded. But I think the consensus is still that, okay, it depends. And probably what is today, it's not going to be like tomorrow. And usually when you attack these kinds of problems where you have very, let's say semantic issues? Like people cannot agree on the definition. I think you need to, again, take some distance and focus not that much on the definition, but on the words that we are using and why we are using them.

Starting point is 00:19:39 And why do I say that? The most important thing at the end is that whatever is happening right now in the market is going to be a stack. And what's important about a stack? You don't have one component that can work on its own. No matter what, this is not going to be like I'm going out there and I'm buying a CRM where I go to Salesforce and that's it. It works.

Starting point is 00:20:04 In SaaS, for example, which was like, let's say the previous wave of innovation, you didn't talk about a SaaS stack. I need Shopify together with CRM and I don't know, like Marketo. Some email tool or whatever. Yeah, like you didn't need all of them together in order to have something that operates.

Starting point is 00:20:22 You could have each one of them. Did the companies at the end buy all of them? Yeah, they did, but they didn't go out there to buy them as a stack. That's what I'm trying to say. So what is, I think, very, very interesting and very, very important is that this is a space where synergies are very important. There's no one tool, one platform that will come and be like, we are doing everything.

Starting point is 00:20:43 Even if we are talking about Snowflake, right? Or, know, like Databricks or Google even. It's not like I can go to Google right now and not use any other tool to do my job. No, you will need probably to use something for pipelining or something else like for, I don't know, for versioning or observability, whatever. So I would say that for people that they are getting angry with and thinking that this thing is like a marketing term and it's just used by the market to convince them to go and buy, don't think in this way. Don't be defensive. At the end, focus on the words that are used

Starting point is 00:21:23 and the terms that they are used and try to understand how this is going to affect your work in the future. Because as a buyer, you will never be a buyer of one product. You will always have to choose many different products and how they work together is going to be important. That's why we also see that partnerships in this space is something that is starting in companies much, much earlier than what happened with the SaaS companies of the past, for example. So yeah, if you want my opinion, that's what I would say about data stack and the importance of the modern data stack.

Starting point is 00:21:58 And the rest about the definitions and who's going to be the winner of each one of the data stack parts, it remains to be seen and we will see. It doesn't matter at the end that much for the market, right? I mean, for the owners and the people who work there, it matters a lot. But for the market, it doesn't matter. I agree. Well, we're at time. Let me just do a couple of quick thank yous.

Starting point is 00:22:18 We talked with Ben, the Seattle data guy from Facebook, Ananth, who runs the Data Engineering Newsletter, and his day job is at Zendesk. Great episode. Tristan from Continual AI, James Serra at EY. Bart, who runs the Data on Kubernetes community, which is great. Of course, he mentioned Mixpanel, which is a really fun episode on the modern data stack and the warehouse.

Starting point is 00:22:42 We also talked with Pete Goddard from Deephaven, and that's a really interesting episode on the modern data stack and the warehouse. We also talked with Pete Goddard from Deep Haven, and that's a really interesting episode on sort of the difference between batch and streaming and doing stuff extremely fast. They have some pretty cool stuff going on there. Jeff Chow from Stripe, stream processing was a fascinating conversation. Definitely check that one out if you haven't. We talked with Igor from Big Eye, Scott from InterSystems, who talked about Data Federation, which is really interesting. We talked about making ETL optional, another federation conversation with Jeff, or sorry, Justin Borgman from Starburst, which was a really great conversation as well. We talked about

Starting point is 00:23:21 open source with Ashley from Benthos, really cool tool. And that was a great conversation, a great mascot for the open source project. And he's just a hilarious guy. We talked about data design with Kevin from Touchless Technology. We talked about IoT, which was a great episode, not a theme, but a great episode. And we talked with Rob from Thing Logics, and he talked about how he uses his own technology on his cattle farm in Oregon, which was amazing. We talked about ETL versus ELT with Matillion, which is a great conversation as well. We talked with Airbyte about open source and ETL, which was a super fun conversation. And we talked about data teams, which was a really

Starting point is 00:24:06 interesting conversation with Srivastan, who works at Robinhood and actually has a long history at a bunch of other data companies, which was really, really a good conversation as well. So definitely subscribe if you haven't. That's just a quick rundown of some of the highlights of season three. And we will catch you on the next one. Many, many exciting episodes that we've already recorded for season four that will come out early next year. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Starting point is 00:24:53 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

The Data Stack Show - 68: Season Three Recap: Holiday Edition with Eric Dodds and Kostas Pardalis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.