The Data Stack Show - 69: What is the Modern Data Stack?

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we are recording an episode with a panel of guests. This episode is also live streamed. So thanks for everyone who joined us on YouTube

Starting point is 00:00:37 for the live stream. We'll be doing another one of those and we'll let you know about that on upcoming shows. This panel is pretty incredible. Back when we started the show, I would have said you were crazy if you said that we were going to have a panel with people from DBT, Databricks, Fivetran, someone who's building data infrastructure at Hinge, and then a VC investor from Essence VC who invests only in data infrastructure products. So this is pretty incredible. And I'm really excited to hear about the different ways that each of these people talk about the modern data stack, because they each come from a different part of it on the sort of SaaS side of the tooling

Starting point is 00:01:26 providers. But then you have someone who's actually using some of these tools to implement stuff. And then you have someone who's trying to think about how to invest in them. And so I think the variety of perspectives are going to be really, really helpful. So I'm pumped. But Kostas, what are you going to ask everyone? I think I will improvise, to be honest. As I usually do. We always do. We always do. But I think it's an excellent opportunity to see if the term modern data stack is just like a marketing term, as many people say, or something more than that.

Starting point is 00:01:59 So yeah, I think we have the right panel there to figure this out. Hopefully, at the end of this discussion, it's going to become much more clear why we needed this term and what's the essence of the term. So yeah, that's my goal for today, trying to understand better what the modern data stack is. All right. Well, let's dig in and talk with all of these amazing thinkers about the modern data stack. Let's do it.

Starting point is 00:02:26 Welcome to the Data Stack Show. This is probably our most exciting episode to date. We're also live streaming this episode, which is really exciting. We have some of the best minds in data here to talk about the modern data stack. And I just could not be more excited. So we have so much to get through. Let's just do quick intros, maybe 30 seconds or a minute introducing yourself. And I'll just call out the name since we have such a big crew here today. So Jason, do you want to kick us off? Sure. Hi, my name is Jason Pohl. So I'm a principal solutions architect here at Databricks.

Starting point is 00:03:02 I've been, I was one of the first 10 solution architects. So I've seen the company grow from just one instance type on one cloud to supporting all three clouds and more instance types than I can count. So I lead up the data management subject matter expert group for Databricks. So anything that has to do with data engineering or data governance, I basically help enable the field, our customers and partners on how to do it best and serve as a conduit back to product management as well. Awesome. Amy, how about you? Hi there. I'm Amy Diora. I head up partnerships for dbt labs.

Starting point is 00:03:33 So I lead those relationships with other products that are integrating with dbt and then with consulting partners that are bringing dbt into industries all over the world. Before that, I worked about 15 years in data analytics and data science consulting before joining dbt labs. Happy to be here today. We're happy to have you, Paul, you're up next. Sure. I'm Paul Picaccio. I work at Hinge. I'm on the core data platform team where I've built out our modernization of our pipeline. So this conversation is, so that's really interesting to me. Yeah. We can't wait to hear what you built. All right, Brandon. Hey, good. Thank you everyone for tuning in. I am a manager of our technical product marketing team here at Fivetran. Prior to joining product marketing at Fivetran, I was our first West coast sales engineer. So really excited to see how Fivetran

Starting point is 00:04:19 has grown over these past couple of years. And I should also note when I'm talking about Fivetran, I'm also referring to HVR, one of the companies that we've recently merged with as well. Great. And Timothy. Hi, everybody. I'm Timothy Chen. I'm an investor here at Essence VC, especially invest a lot in data infrastructure. So especially very excited about this as well. Before that, I was an engineer, work on open source, contributed at Spark, Kafka, Drill, others related data stuff as well.

Starting point is 00:04:46 And yeah, so definitely seeing a lot of interesting stuff happening in this space. Great. Well, like I said, we're all in for a real treat here. We're not going to waste any time. So Costas, why don't you kick us off with the first question? Yeah, let's start with the most important one, right? So I'd like to ask our panel here what the modern data stack is and how they understand it. So let's start actually with Paul, because I'd like to hear the opinion from

Starting point is 00:05:15 the stakeholder, right? Like someone who gets benefited by it and is using it on every day. So Paul, what is the modern data stack? This is of course a loaded question because it can mean all kinds of things. But to me, I think it's a combination of like volume access and trust. So like pulling heavy volumes of data through a reliable way. And obviously trust, like it has to be secure

Starting point is 00:05:38 against intrusion and whatever, but it also has to have high data quality. You have to know that what is coming in at various stages of your process is what you get out the other side. Because if you're running these experiments, a very small margin could mean a lot to your data scientists and stakeholders and so on. But in terms of access, I was thinking about this as we were talking earlier, that you have sort of disparate patterns of access, that you have to be able to explore the data to not just machine learning models or what have you but have like actual human

Starting point is 00:06:09 exploration and you have to be able to sort of examine your data at point of time so you have like repeatability concerns but you also have like you're thinking about like oh what is modern like you have to have if to be compliant with the law as well. So like privacy and, and just like GDPR, all the, all of the, all of the ways in which you have to now touch every piece of data across your entire stack at reliable intervals in order to ensure that people are being protected under law. Okay. Well, that's, That's, I think like it's a very interesting definition. And I'd like to ask next the VC of this panel, because he's probably getting like a lot of pitch decks out there where

Starting point is 00:06:54 people are talking or they are trying like to define themselves as like the modern data stack or part of the modern data stack. So team, like what do you see out there? Like how do people communicate to you, the modern data stack. So Tim, what do you see out there? How do people communicate to you the modern data stack and what's your opinion about it? Yeah, the modern data stack is one of those buzzwords that you're so intrigued about that you really have no idea what it means. And it just sounds good to keep saying it over and over, because I think it's one of those things where it feels like it's in the center of something.

Starting point is 00:07:28 You just cannot fully grasp it because everybody has their own definitions and it can be used in so many different contexts. But in your response, back to your question about what do I see? I think modern data stack, what is the most comprised of what I can tell is really the the move to the cloud everybody has democratizing access to all this sort of data that they need and different functions and therefore there's a collection of tools there's a collection of products that everybody's keep meshing over and over and you can sort of change and mix and matches but there's a few things that are always sticking there which is kind of like your storage, your warehouses, and some level of different things here. And so that I think in some way, it feels like people are trying to redefine modern data stack every single day, because, hey, I think

Starting point is 00:08:16 there needs to be a real-time streaming one. And maybe there needs to be a better way to get analysts, a better graphical report style, right? You can kind of stick a lot of things in here. But I think the core, from what I can tell, especially talking to different customers and friends that are also investing in this space, what we're seeing is modern data stack is also an opportunity where people are re-looking at all the things they've done in the past. And like, hey, do I really need to be doing data the same way as before right does all the new buzzwords like data mesh and all the stuff like we're seeing right now do we have a collection of tools and practices and we actually help enable democratizing access and make data

Starting point is 00:08:56 infrastructure and data analytics done in a very different way but of course modern has no goal and also has no specific limitations or even like requirements. So anyway, it's a very fuzzy word, but I'll just leave it at that because we're all I'm trying to figure out. Right. Even based on this. Yeah. Yeah. I'm glad we're trying to talk about modern data. Second, not data mesh, though, because I think it'll at least be a little bit. We just reject every buzzword we could ever hear of and we could have a five-hour discussion too. Yeah, maybe we should make another panel just for

Starting point is 00:09:31 the data mesh and see who is most confused about the term at the end. But I'd like to ask next Brandon. Mainly, what is interesting about Brandon, I'm very interested to hear his opinion, is because I think in many people's minds, the modern data stack is a term that it has started or it is heavily associated with Fivetran. But also because, as Timothy said, an important part of what the modern data stack is, is also the democratization of accessing data.

Starting point is 00:10:02 And I think that's a big part of the mission and the vision that Fivetran has. So Braden, the stage is yours. Tell us about the modern data stack. Fivetran even has a modern data stack conference. So we discussed this quite in detail across the conference with all the attendees, all the different panels. And I truthfully expect that we'll be talking about this

Starting point is 00:10:24 for multiple years to come because the definition will continue to evolve, will continue to evolve. But at some point, we might not refer to it as modern data stack, we might refer to it as some other type of data stack. But I think it was Notorious B.I.G. who said, more tech, more problems. And ultimately, I think that's where the modern data stack comes from, right? And then Timothy was talking about this too, the rise of new cloud technologies. They make it a lot easier to scale out what you're trying to do with your data strategy. And with all these new technologies,

Starting point is 00:10:51 the capabilities that we can do as different data teams across different companies continue to expand as well. And with all of these new tech, really it just comes to what else can we try to solve for? So then I'm going to throw in another buzzword-y term, machine learning, AI. These are just generalizations of modern data problems that people continue to run into.

Starting point is 00:11:11 And really, it starts to become, what is your modern data stack? Really depends, in my point of view, on what your company is trying to solve for. If you all have never utilized any sort of new technology before, imagine your company is a single entity, single one-person entity, and all you have is one laptop to work off of. You have no access to any cloud. Then for you, a modern day stack might simply be a database that's locally set up for your computer that happens to record transactions just from this one little office that you're in with

Starting point is 00:11:39 your one little room. And really, the definition will continue to change as new companies introduce new terms, introduce new use cases. And I fully believe that we'll try to keep up with all that change as a company as well. Yep. That makes a lot of sense. All right. So next, I'd like to hear like what Jason has to say. He's, I mean, as part of Databricks and like what I find like extremely, extremely interesting about Databricks and Spark is that we are talking about a tool that exists for a long time and it has evolved all these years. So Jason, from your experience and like working in this space, like all this time, what is this modern data stack?

Starting point is 00:12:27 Yeah, I think for me, the modern data stack, I used to be a data warehouse architect years ago. So I would work with companies. I would use the popular ETL, BI, and databases of the day to build these data warehouses. And I think since then, we've had like these digital native businesses like Facebook, Airbnb, Uber, they've kind of like, they've built their entire businesses off of different tech stacks than the one that I

Starting point is 00:12:50 used to implement 15, 20 years ago. And those tech stacks were open. So they came from open source. And initially I don't think there was, there wasn't a public cloud to go to, but now there is a public cloud. And what was really unique was they were using these tech stacks and it was really multiple stacks to do data processing and do historical analytics, but also do artificial intelligence and apply machine learning to their models, to their data, to be able to optimize everything from lead flow or lead gen to optimize ride routes for Uber. So there's this been, I think, evolution where as these digital natives started up, they used open source, created their own open source projects, and then use those also

Starting point is 00:13:30 for applying machine learning. But now those same projects have been ported to the cloud. And now I think the modern data stack is the culmination of all these open projects that have now got either hosted by the cloud providers themselves or other cloud services like Databricks, where we host Apache Spark and MLflow and all these other open source projects that we've developed over the years. So I see it as a way for companies to kind of like pick and choose whichever parts of the stack they want, the best of breed and combine them in a way that gives them the maximum velocity for whatever they're trying to do. Okay. Okay. That's great. Amy, I left you to be the last one because I have some very important reasons for that.

Starting point is 00:14:16 And we will see that in a bit because there is a follow-up question. But before we move to the follow-up question, tell us from your perspective, personal perspective, because you've also been in this space for a long time, and also like from the DBTs perspective, like what these modern data stack is. Yeah, I think of the modern data stack kind of in contrast, right, to what we had before, how data teams were working kind of before we had these suite of tools. And probably the biggest change is kind of what Jason said about having really focused on best of breed tools for each specific job that the data team does, right? So in kind of in the past, when we were looking at different solutions, folks would look at some of maybe the informaticas of the world or these kind of all-in-one solutions that did a lot of different things, right? And that was kind of

Starting point is 00:15:01 thought to be kind of easier and better. But now we have a data team that's choosing kind of the very specific, the best tool for what particular job they're doing, whether that's ingestion or transformation, or we're even bring in kind of bring in notebooks from data science into analytics in a new way, new tools where we're bringing in data from the data warehouse back to Salesforce and kind of back to these other applications in a way that we weren't before. So finding those best in breed tools and being able to have interoperability. So teams have the ability to both choose the tool that works best for their particular use case and also change those tools, right? When folks see there's new data warehouses on the scene, there's new different tools on the scene. And because of the interoperability between all of those different tools in the modern data stack, folks can then choose what's best for

Starting point is 00:16:03 their use case. And a lot of innovation just happens from that choice now that the team has in terms of the tools that they use. Yeah, that makes a lot of sense. Anyone want to add something? Yeah. I think both Jason and Amy's responses are good examples of how their companies have also continued to push forward the idea of the modern data stack. Before Databricks, there was a concept of, let's say like a data lake, and then AWS reInvent really pushed that concept forward. And then Databricks came out with this, okay, on top of this data lake, how are we going to make things more structured? How are we going to add, for example, like asset transactions to your data lake? And now we have this concept

Starting point is 00:16:40 of data lake house, which AWS has also started to, what is it called, adopt as well. And then across, what is it called, DVT, for example. The lines between analysts and typical data analysts and typical data engineers are continuing to be more and more blurred with what DVT is putting on top of traditional analysts, workflow, traditional modeling. And now we have this term analytics engineer. And these are all, in my point of view, great examples of how technology and the terms that are being pushed out as this technology evolves continues to also make the definition of a modern data stack evolve as well. Yeah, that's a great point. I'll go back to Amy. And I want to ask my next question. And the reason that I'm asking this first to you, Amy, is because you are a person that works with partnerships.

Starting point is 00:17:26 I think everyone will agree in this panel that partnerships are something very important for everyone that works and who is part of the data stack, which makes sense, right? Because each tool needs another tool in order to deliver value at the end. That's why we have a stack and we just have one tool out there. So my question, Amy, for you and for the rest of the panel later is what are, let's say, the most fundamental and important parts of this stack? Like what defines this stack? What kind of like functionality at minimum we need in order to say that we have implemented like the modern data stack?

Starting point is 00:18:02 Yeah, this is a question where the answer is definitely evolving, right? So if you ask folks maybe a year ago, they would say data ingestion, a data warehouse, and transformation at a BI tool, right? They would say those are kind of like the categories. Now we really have a lot of innovation around a lot of that, right? So people would say ingestion, a data warehouse, or a data lake, right? Or a would say ingestion, a data warehouse or a data lake, right? Or a query engine. There's all kinds of like pieces that we can make our data source. Transformation, then we might say in the BI layer, we'll have kind of more exploratory analytics tools like a notebook. We'll have traditional BI tools like dashboards. We'll have

Starting point is 00:18:41 also what's called sometimes reverse ETL or operational analytics, or basically this idea of being able to take data from our warehouse and put it back in source systems. Then there's kind of two other kind of categories that are kind of, that are now part of what most people call the modern, the modern data stack, but they're kind of more in a little bit of still exploratory stage. One of them is probably observation and testing, right? Observability and testing. So kind of data quality, observability, there's a lot of companies in this space and a lot of folks are figuring out exactly how this fits in the modern data stack. But I think to Paul's point earlier, like this is going to be important, right? Kind

Starting point is 00:19:20 of understanding kind of testing observability. Some folks also put kind of data privacy into that bucket as well. Then also there's probably a part that we're really excited about at DBT is what some people call the metrics layer, right? So this idea of between your data transformation, creating your data sets that are ready for analysis and kind of your traditional BI tools, how do we make sure that the definitions of the metrics that we use to measure our business are consistent across all of the folks, whether they are using that data for BI, whether they're using it for exploratory analytics, whether they're pushing it into another system. So the metrics layer is kind of a piece of evolving that I think there's a lot of different, a lot of companies and a lot of different folks are really thinking about as,

Starting point is 00:20:07 as kind of a new evolving part of, of what we call the modern stack. Okay. That's a, I think a very, very thorough definition of what it is. And it's like, it's great to hear about like all these different layers that it has, let's say. And I think that's probably also contributing to a lot to, let's say, the problem that people have out there, to define it exactly and say what this is and why we are using it. Because there are also, from what I understand, many different variations. It's not like every company has exactly the same needs or it's at the same maturity level to utilize

Starting point is 00:20:44 AI and ML. Some companies, they just build their BI layer. It doesn't mean that there's no stack there or that the modern data stack doesn't apply to them. That was great. What's five-tramp things, Brandon, about the most important components of the data stack? Of course, ingestion, right?

Starting point is 00:21:05 Like we need that. Is this correct? Or we can live without ingestion. What do you think? Yeah, I mean, in my point of view, of course, ingestion is always first and foremost, right? Getting the data to where it needs to be so that you can actually do what you want to do

Starting point is 00:21:18 with your data, with the complementary tools. And that feeds back to Amy's earlier point about interoperability, of making sure that all of the tools that you pick are seamless, they all work together. And many times people actually start with the, let's say like data storage layer, right? Trying to figure out how to make sure that, hey, as their cloud application, number of tools that they're using continue to evolve. They want to make sure that as they're pouring that data in, whatever underlying data storage system they're using is able to support all of that new data that they're going to be working off of. And that supportability is broken into a few pieces. Of course, price is always going to be a part of it. Money is always going to play a part of it. But the other part is even back in query optimization, how efficient can the queries that your team

Starting point is 00:22:10 is used to running actually work on these different data storage pieces? My point of view, it's fine to start with that place. Just always consider as you're building out the rest of the stack, how it's going to function with that piece. And of course, part of it is going to be driven by what your company wants to do. What is prompting the move to re-evaluate certain data tools you're using? In that case, it oftentimes starts with what is the problem that you're trying to solve for? Is it that your data storage doesn't work and just going really high level here? Is it that your data integration is breaking all the time and you need a fully managed service to accommodate

Starting point is 00:22:42 changes like API changes across all the tools you use and whatnot. Depends on what problems you're having. Sorry, that's a bit of a non-answer. No, no, no. I think it's a very, very good point. I have a question that it's just for you, mainly because it has to do with initiation. I mean, Fivetron has been in this space for quite a while. It has actually disrupted, let's say, the space. I remember like if we were talking about six years ago, seven years ago, like all the noise was about how we can get access to our data, right? And I think like big part of the mission that Fivetron has is like to make it as easy as possible

Starting point is 00:23:19 like to get access to your data. Have you seen like after all these years that this goal has been achieved and more and more companies are actually focused on other parts, on implementing other parts of the modern data stack, and they consider the injection part solved? Or do you think there's still a lot of work to be done there? I think there's always going to be more work to do. And when we think about data integration, data replication, sometimes we just think about it in terms of how to get to data from point A to point B.

Starting point is 00:23:51 But there's a lot that goes into it, right? Beyond just getting data from point A to point B, how can we make it as efficient as possible? How can we make sure that we're connecting not just to maybe one endpoint through some API, but how can we make sure we're pulling all the fields from that endpoint? How can we make sure we're structuring that data so that when it lands in the warehouse that whatever queries you want to run out will actually function because the data types are cast correctly and part of making data integration easy and making data accessible is making sure that everyone understands how to use the tool so that's one of the core components of five trend is ease of use easy

Starting point is 00:24:21 easy to use you won't see a lot of buttons in there. When I was a sales engineer, sometimes I actually dreaded demoing Fivetran because people would ask all these questions about what's going on behind the scenes. I can show you what's going on in a workflow, but you won't see anything in the UI except for some GIFs that are moving back and forth to represent data.

Starting point is 00:24:40 And the reason I say that this will always evolve is because if the goal is to make it as easy as possible, make data integrations as easy as possible, it means abstracting away a lot of those considerations. Things like reading API documentation should be, in theory, a thing of the past if you're using any of these tools. Maintenance has, let's say, schemas changed the data source system. That should also be a thing of the past. And to make all those backend considerations work, a lot of it goes back to, will it evolve? Yes, it will evolve because people continue to do funky things with their source systems. People continue to adopt best of breed tools unique to their departments,

Starting point is 00:25:13 various challenges that they're trying to solve. So there are, in my opinion, will always be work for 5G to do to further optimize some of these backend considerations we're making on behalf of our customers, supporting higher throughputs as the data volumes across all these sources continue to grow as well. And ultimately making it as out of box as possible. So making sure that we hit all the edge cases that our thousands of customers are continuing to run into

Starting point is 00:25:37 with all their funky setup. Okay. Jason, you are the last one from the vendors that we have on the panel. So what do you think from Databricks perspective about how you would define, let's say, the fundamental pieces of the modern data stock? Yeah, I mean, one thing I think that's changing a little bit since I started was that when I first started at Databricks, it was pretty common as an architecture to basically have a data lake where you store all of your data and you do a lot of the data transformations there. And then you'd offload

Starting point is 00:26:10 some of that data to like a data warehouse, either at Redshift or Snowflake or BigQuery or something. And then you would basically use that for serving up the analytical queries from business intelligence. And I think we've been kind of like marching towards this confluence for a while, but now you really can basically achieve both things on the data lake. So we have this data lake house concept that we've come up with. And the concept behind it is like, you can have the economics of a data lake and do all your ETL there, but you can also have your interactive BI queries run on top of that data lake using SQL. So essentially you can write your data once and then do all your use cases on top of it, whether that's streaming or SQL for BI or machine learning or graph analysis, it doesn't

Starting point is 00:26:56 really matter. And so I think with the modern data stack, there's a number of different tools out there that do these different individual things, but you can kind of like combine all of them on top of the lake house as long as they're adhering to open standards. And in some way, this is kind of like realizing what Google realized 20 some years ago when they wrote the white paper for MapReduce is that they realized that they had way too much data to copy it around to do their processing. And they're going to have to bring the processing to the data.

Starting point is 00:27:26 And that's kind of like where the impetus of their MapReduce started. And then it spawned off all these other open source projects from there. That's very interesting. And I'll ask next our investor. I think this is going to be interesting because some people might also find good ideas to go and pitch to the investors after that. So what is from an investor point of view, the important parts that someone needs to define this data stack and implement it? You mean from a customer point of view or what actually you want to buy? Yeah, from a customer because like okay i guess investors

Starting point is 00:28:06 like investing like in ideas that they have a market so i think it's important you know like i think it's you're just hearing the last few folks talking about right we're actually talking about different kind of personas that are actually looking at this stack in different ways right when we talk about ingestions or infrastructure or data, those are definitely more of a data engineer infrastructure sort of point of view of people. So that's kind of what's one kind of customers, right? And then we look at the analytical users, right? Data analytics, right? DBT has a large brand around those kinds of users coming into this space. But we also have, it was really interesting, exciting that we're seeing as investors is that data is truly

Starting point is 00:28:46 getting pushed out for many different kind of functions lots of people are increasing budgets to hopefully get into like a data-driven enterprise right or data-driven businesses where can we actually leverage analytics where can we leverage ai where can we leverage where we actually manually have to do processes is there a way to do rpas you know there's data driving insights and insight driving into actions and actually actually turns automations can we continue to actually feed this loop into multiple places and so there's also a segment of users are just as business analysts product managers anybody's actually doing any sort of functions like sales or marketing right this is where reverse ETL even came from. Why do we even need to have this into a modern data stack?

Starting point is 00:29:30 Well, we don't just stuck them into a dashboard. We've got to push it out to some other people, tools that they use. And so if you're looking at what is necessary to implement this, this really is a very tricky question because everyone's needs look different. Everyone's maturity level of how they view data and the sort of complexity differs a lot. And that's actually one of the difficulty investing in this space because, hey, one person will be maybe proposing, hey, we need an all-in-one solution that just takes modern data stack in a box. Install it, everything you go.

Starting point is 00:30:07 Don't worry about anything else. Don't worry about any other vendors, right? Those are solutions and definitely companies out there trying to make all the complexity reduced into one single click install. And there's pros and cons of that, right? How do you do the best of breed if you have, how do you hide all the complexity? Can you make the five trend like experience just mask every single vendor out there? Like that's one kind of considerations.

Starting point is 00:30:32 However, on the flip side, if you say, hey, we want to be the most modular, we can truly choose whatever you need. There's a lot of engineering work and sort of like maintenance work and integration glue work that needs to happen. And so there's a huge spectrum. Everyone's trying to figure out if you talk to any head of data, they're scratching their heads every day still because they're fighting this fights between,

Starting point is 00:30:57 I'm hearing all this buzzwords about modern data stack. And I think maybe I need something like this, but I'm not entirely sure how do I mature my organization? So you understand what do I need first? How long is it to implement it? What are the trade-offs? What are the ways I should actually even think about doing this in the first place? And it's, it's not easy. Cause I think, you know, if you really talk to data engineers, they have one set of concerns, analysts, head of data, and truly the business.

Starting point is 00:31:26 You really need somebody that understands all those pieces in most of these companies that can actually say, hey, here's what we're going to buy first. Let's ignore all the noises. Let's take this approach. Let's start from this business unit. Let's spend this as much money. And for larger places, it might start from more foundational pieces like like the data bricks and so so sorts and fight your way up and some more smaller companies this might be starting from like more more defined smaller pieces of of solutions that can kind of combine them all right so you need both but it's just like it's hard to have one product catch-all in for everyone's need and that's that's a complexity of the space.

Starting point is 00:32:07 And we're still evolving of all the new tools people are proposing about. So it's hard to answer that question because we have so many discussions. It's not, there's no one single answer. They're all asking me and I have to relate to each other and they all fight, talk to each other. There's no consensus whatsoever

Starting point is 00:32:23 of how one way to implement modern data stack. Yeah. Tim, I have a question that it's for you. And actually it's a question that comes, it's inspired like from a question that I got like from the community. There is a very specific, how to say that, like there are very specific semantics around the term stack when it comes to developers, right? For example, we have Jamstack, right?

Starting point is 00:32:47 This is like the latest thing. Before that, like, I don't know, like a decade plus ago, we had LAMP, where we had like Linux bundled together with MySQL and Apache and PHP. So it's something that's like extremely, extremely well defined in the mind of a developer, what like a developer, what's like a developer stack is, right? And I'm asking you because you've been like a developer, so you understand this like very well. So do you think we will reach a point where we will have something similar also for data and that the modern data stack can become this? Yeah, it's, I think it

Starting point is 00:33:23 would be interesting because if you look at everyone's definition of modern data stack at the moment, there's actually a few pieces that's fairly consistent. It hasn't changed that much. And so you kind of have to play the devil advocates of each category a little bit, right? You're like, okay, is there new categories that we need to, because we're, we're constantly

Starting point is 00:33:46 inserting new, new, new categories to our data stack. You know, we have a lab set. It's just four letters. We messaged her when I worked at, there's a smack stack, right? There was only five letters. And the modern data is like this infinite number of letters. There's no set number of characters anymore. So this is like a unbuffer string.

Starting point is 00:34:06 You can just do whatever you want. And that's why it's hard because we're going to increase the number of categories for sure. I don't think we're ever going to have just five. There's more business analyst stuff. There's AI things. We haven't talked about catalogs or discoverability. There's stuff we're going to just add to the stack over time and we see consolidation in each one of those categories

Starting point is 00:34:31 i think we will and there's also going to be cross-functional things my tool that's catalog also do quality my quality also does catalog multiple vendor names are showing them in different categories of problems right So will we get there? Will all this shuffling end and just become one letter? I don't think so. Our future will be really interesting. I do think that there's definitely going to be consolidation of logos. Really, there's no way to have a crystal ball.

Starting point is 00:35:01 I don't know exactly what that will look like. I do think, though, like I said, the number of categories will increase. And so we're going to actually, it's even debatable what categories even makes worth the category as a modern data stack in the first place. And then even figure out what is the cross-section of things that people really like. Yeah, 100%. 100%. even figure out like what what is the cross-section of things that people really 100 100 and i think like if someone sits down and like all the different categories that they appear like from quality even like if you take data quality for example there are probably subcategories inside

Starting point is 00:35:36 that you know so yeah absolutely and i think that's one of the reasons that like people should keep in their mind like why it's so hard to define it is because it's just too early. All these vendors are appearing right now because now the market is ready for that, but we still need time to define exactly all the categories and all these things. Paul, you're last, but for a good reason. You, let's say, represent the most important stakeholder, which is the customer and the user. And you have implemented at a very large scale, like a data stack. So what is the data stack for you? And what are the essential components of this data stack, of the modern data stack? So I think it's related to two pieces.

Starting point is 00:36:26 It's easy to talk about data in the abstract, but really like thinking about the end purpose of all of this machination is important at each step. Like we're doing all of this work so that people can answer questions to drive the business to do particular things like tangible decisions have to be made as a result of all of these systems. So like, for me, it's, it's mostly about trustworthiness, like over even over performance is like, although trustworthiness and performance, like if you don't have anything to, to look at, then it's very trustworthy, but it's not very useful, but it's yes, you have like, like confluence, like K tables, like materialize, like all of these

Starting point is 00:37:05 sort of vendors in this space are dealing with the abstraction question that Brandon was talking about earlier. And like, they're doing it in a way that is observable. So for me, it's like, I want to know what is happening at each stage. I want to have like the sort of like eye into the opaque box and see like what what decisions is it making about my data and so like just correct like correctness and and trustworthiness seems to be the the most essential piece of the modern data stack to me because if you're if you're serving up answers that are not provably correct, then you lose stakeholder trust.

Starting point is 00:37:45 And then also, you lead your business to make the wrong decision, maybe at a crucial time. So you have to make sure that you're telling people the truth. Yeah, 100%. I think that's a pretty hard problem to solve. And I think there's a lot of debate

Starting point is 00:38:00 on what's the best way to implement quality. I remember having discussions where there was a question, is quality something that can be implemented by one system or it should be the responsibility of each part of the data stack that you have, right? Is the pipeline responsible for its own quality and then the storage for its own?

Starting point is 00:38:19 I think it's both a product and an engineering difficult problem to solve. But that makes it also like so interesting and fascinating. Eric, the stage is yours. Oh, well, perfect timing because I was thinking, Paul, I have a two-part question for you. Following along with thinking kind of beyond the actual tooling or individual componentry into sort of what is the outcome of these things coming together. Before we get there though,

Starting point is 00:38:51 I'd love to know from your perspective, it's easy. I loved your comment about thinking about this in the abstract. And when you think about the modern data stack in the abstract, you sort of arrive on the scene with like this amazing modern tool set, right? And in reality, different parts of the stack become modernized at different rates. Could you speak a little bit to what that dynamic is like sort of managing a large scale, you know, complex data stack?

Starting point is 00:39:22 Sure. I mean, I think it starts with the, your like message bus, like however you pass information from one group inside your organization to another. And so like, like, well, I guess way, way back in the day, you had like maybe one database where everybody would be like, go ask the database, capital D. And it's like, well, that doesn't really work once you scale past enough people to sit in a room, you know? So like being able to communicate through in different parts of your organization means that, I mean, people love to like fetishize Conway's law or whatever, but like you have,

Starting point is 00:39:56 you have these like sub organizations or like sub organisms that have to like learn to trust each other as well. So like if you're modernizing at a different speed, you have to sort of prove that your piece is trustworthy, first of all, but also will work with the other organization's piece. And so like passing data back and forth, like I know RedStack does that. That's like your bread and butter.

Starting point is 00:40:19 So it's like being able to have like inspectability into or like an eye into the systems that communicate like internally so you're taking data in from the outside and so there's like the idea of a single source of truth or what have you but you end up having whatever multiple views on that data or yeah or just like the i suppose it does come back to trust again so that's kind of the the main piece is just like you have have your overall data strategy of like, this is how we're going to manage lineage, for instance, throughout the organization.

Starting point is 00:40:53 This is how we're going to manage. Just like knowing what did we know? When did we know it? How do we know that we know what we know? That's kind of circular. Sure. I think that's a really, it's so fun to talk about the specific tech, but it's a really helpful thought pattern to put trust at the center and say, you modernize

Starting point is 00:41:17 your stack according to the ability or sort of need of various components to deliver trust. And that's such a helpful thought pattern. So second part of the question, so building trust, especially as you have a stack that's sort of increasing in complexity is non-trivial, right? And I know there's some tooling around that. My guess is that on the ground in a lot of companies, you have a team that's sort of managing that across a tool set. So let's say trust is sort of a central use case. Are there other use cases in the business? I mean, analytics is a primary use case, but let's just sort of assume, let's abstract the modern data stack and say, you have a great set of tools in every part.

Starting point is 00:41:56 What are the other use cases that are hard to build for? And are those, you know, sort of still in the analytics space? Are those enabling teams who are delivering like the actual experience to the end user? I'd just love to know sort of, great, you have all the modern tools, but what are still the hard problems to solve, even if you have best in class? Well, I think it's, I mean, not to, the answer depends. It's always like both true and disappointing. But if you have, if you sort of like focus on that, like, oh, it depends. It's always, it's always like both true and disappointing, but if you have, if you sort

Starting point is 00:42:25 of like focus on that, like, oh, it depends. And like, like you're focusing on flexibility as your, as like your method of, of, of delivery. So like, if you, if you, like, I'm thinking of like ML ops, for instance, like everybody seems to be like reinventing ML ops for themselves and being like, our, our use case is super different from yours. We're doing statistics on numbers. And it's like, our use case is super different from yours. We're doing statistics on numbers. And it's like, cool. But so like talking to individual,

Starting point is 00:42:55 like ML engineers, or like you're talking to like somebody who's building out an experimentation platform or integrating an experimentation platform or something like that, getting them what, like getting them enough information to answer the questions that they need to answer now without like without like

Starting point is 00:43:09 building a behemoth of a thing where you're like this solves all of human knowledge and they're like cool we needed to not do that and give us an answer tomorrow so like so like being like talking with individual people like no that's just like one of i think that's a lot of companies value set like at hinge specifically we have a lot of like individual sort of power to, to like make what we think is necessary for the space. And so like having the flexibility within your team to make whatever, whatever, like it depends means to deliver the answers that you need to answer. You know, that's, yeah, that's a word.

Starting point is 00:43:46 Yeah, that's great. That's, I once had a coach who said, you know, there's not sort of like advanced maneuvers. There's just mastery of the basics and then combining those to apply to like a really specific situation in a really specific, you know, context on the field. And so I'm so, I'm glad that you said it, because like, there's not a tidy answer to that, but really does depend. And I think, to your point, if you have the core there and a team that has a tool set that allows them to be flexible to address those needs, I think is a great answer. All right, well, I think we're coming up on close to time here, Costas, why don't you go ahead? I think we have time for another question or two. Yeah, actually, we have like two or three

Starting point is 00:44:25 questions that are coming from the community. So these are not like questions that like everyone has to answer. I'll try like to find the right person to answer the question, but I think it makes sense like to try and get like some quick answers to these. And I'd like to start and ask Brandon about ELT. It's something that, I mean, it's heavily associated with like Fivetron. And many people argue that like, actually, ELT is nothing new. It's something that has been existing like for since forever. Help us understand a little bit better, like, at the end, what's the difference between ETL and DLT and how new this EOT thing is. Yeah and my thoughts on this are pretty similar I'd say to George our CEO's thoughts. Everyone's right, ELT is not new. You can use other tools to do ELT too.

Starting point is 00:45:16 You can use like script your own nightly dump, full load, dump and refresh. Effectively that's what you're doing too with the ELT. If we're just taking that face value of taking everything of the EL and then you transform the warehouse, you do that with almost any tool. But the difference between what FITRAN does with the differential between what we do and what the process is taking at face value

Starting point is 00:45:37 is all of that backend stuff. All of the things of, hey, how can we make sure we're not just doing a full nightly pool and crashing all of your systems? How can we make sure that we're using the most efficient way possible to actually read this sort of information? And how can we make sure that as your data updates over time, where I'm categorizing

Starting point is 00:45:54 updates as schema changes or updates to values in previous rows or net new rows, how can we make sure that all those changes are pushed? And the evolution of the term of ELT, in my point of view will continue to go more and more and more. Oh, sorry. The value of the efficiency can't be understated because it'll only continue to be more important, especially as we move towards more real-time analytics. And part of that ELT too is also making sure that we're doing everything that Paul's talked about.

Starting point is 00:46:22 And I think Paul has a very interesting point of view on the data tracking piece, right? All of that, making sure that you understand where your data is flowing end to end, making sure that you're staying within compliance, making sure that your data stays secure and adheres to ever-changing rules and laws as they relate to data privacy, like GDPR or CCPA, any of those. I think ELT is nothing new, but the way that we're approaching it, our interoperability, I really like that term, with the other tool sets really enabled the best use, the most efficient use of ELT so that you can get your data to work off of. Yeah, yeah.

Starting point is 00:46:55 I totally agree with you. What I think people get confused a little bit with is that many things, it's probably like everything in technology is not new. We keep reinventing things. And there is a good reason for that. It's because we have different needs and different technologies that we have to work with. That's why, I mean, the database has been invented like in the 70s, but we are still building databases, right? And we create new categories of databases. The same thing is also with ETL.

Starting point is 00:47:24 Like the moment that we had the database, we started needing ETL, but we are still doing ETL and we are inventing ETL and we are implementing it in a different way. So I don't think people should be mad about using terms that have been used in the past or processes that we keep doing from the past. We just need to understand that there are good reasons to iterate on these technologies and create a different version of it, right? At least that's how I see it. But I totally agree with you. And I think it was a great explanation of why ELT is something that we use more often today. Next, I'd like to ask Amy something, which I think here and DBT is probably like the best person like to answer. Traditionally, SQL has been a very, very hard language to apply all the best practices that software engineering has come up with, right? And it's one of the reasons that it was always like a little bit of, let's say, second class

Starting point is 00:48:29 citizen in terms of like language, right? One of the things that's really, really hard is testing. Like it's very hard like to maintain a code base of SQL and like do testing and unit tests and like all that stuff that we have like in software engineering. How do you see this from the perspective of dbt? Because my feeling is that dbt has added a lot of value on this and has changed things. So I'd love to hear like your opinion on that. And like, what, where do you think, what do you think is missing?

Starting point is 00:48:58 And what do you think has been solved because of dbt? Yeah. So for folks that aren't kind of familiar with with testing in the dbt context so a test is any sql statement that can return zero rows if that test is passed and we use tests in dbt to look at kind of quality both data quality and just kind of unexpected things that are happening in your data pipeline. And those tests can then either provide a warning to an analyst or someone that something is unexpected kind of in this data pipeline, or also that something is broken with the idea that you can kind of find those things before your end user of your pipeline finds

Starting point is 00:49:37 that. Again, there's lots of ways that folks in the community, this is a really a point of a lot of innovation, people using kind of the DBT testing functionality to implement things that look set that you can load in and see, okay, did this operation do exactly this at this step along the way? There's lots of folks using DBT tests several folks that have posted a lot of ways and frameworks of doing this, of testing data, kind of using the dbt test framework. So it's definitely something where we're seeing a lot of innovation and a lot of kind of great ideas. And we're kind of providing those tools to the community so that folks can kind of develop that set of best practices.

Starting point is 00:50:43 Okay, that's great. I mean, what I would add is that, I mean, people keep thinking that testing is still something really hard to do with SQL. And as we said, like also apply the rest of like the software engineering, like best practices, but this has changed actually. I mean, there are still like things that are happening. Like there's a lot of of innovation still happening there. But I think that with the introduction of DBT, many of these problems are not that much of a problem anymore.

Starting point is 00:51:12 And it's more about trying to figure out what's the best way to do it, not if we can do it or not do it, right? So that's why I also wanted to ask you about that because from my perspective, it was one of the huge contributions of DBT as a framework to the community. Cool. Next one is Jason, and I have a bit of a provocative question for him,

Starting point is 00:51:33 but I'm sure I cannot help myself. I have to ask that. Do you see a future where... Before I ask that, many times during our conversation today, I think almost everyone mentioned that storage is probably one of the most important parts of this stack that we are talking about. Without storage, we have nothing. Do you see a future where the data warehouse is not at the center of the data stack and another technology like the lake house, for example, or the data lake, or I don't know, something exotic that we don't even know about it might take this position that the data warehouse today has in terms of how, let's say, important it is for the data stack?

Starting point is 00:52:19 Yeah. I mean, I think the data warehouse is, I don't know if it's ever really been in the center. I think it's been on the edge, to be honest. Because if you ask people who have a data warehouse, do you also have a data lake? Most of them will say yes. I think if you ask them, do you have more data in your data lake or your data warehouse? Most of them would say most of the data is in the data lake. And in fact, the data that's in the data warehouse, it's not that it only resides there. It's basically just a replicated copy of data that's already in the data lake. And in fact, the data that's in the data warehouse, it's not that it only resides there. It's basically just a replicated copy of data that's already in the data lake. So I feel like the reason why people were attracted to data lakes in the first place was that it's an easy, cheap way to put any type of data. So not just for the structured data that you might need, but unstructured data like pictures and videos. I worked with one customer who was like an advertising agency and they were doing a campaign for like a software drink company and they wanted to look at social media pictures to see which pictures had that soft drink in it

Starting point is 00:53:14 and who were the demographics of the people in the picture so that they could better market their product. And I don't know if you could do that in a data warehouse. It'd be really hard to do that. And so I feel like those types of advanced use cases, they're just going to get more and more. And the people and the companies that can master that type of analysis are basically going to get an edge over their competition, which is why you see the top 1% of the Fortune 500 being the fan companies. These are the companies that have mastered how to do this stuff.

Starting point is 00:53:42 And so I do feel like the data lake is not going to go away because that is the place to do this granular machine learning type of analysis. And I do feel like now that you can do high performance interactive queries on the data lake, which we, Databricks recently broke the data warehousing benchmark with the TPCDS benchmark. So we're able to prove that you can have fast queries on a data lake and you don't need a data warehouse anymore. So I think it's just going to be, if you wanted to consolidate fewer things on that stack, I don't know why you would have to have a separate data warehouse

Starting point is 00:54:16 for these things anymore. Yeah, that's great. Like the answer that I was expecting, to be honest. Thank you so much. I'm personally very, very interested in like this kind of, in this space, like what happens, like what's happening today

Starting point is 00:54:31 with all the innovation, like in data lakes. So it's just like data bricks. It's also like, you see like stuff like Hudi, Iceberg out there. I mean, there's a lot of innovation happening there and it's very, very interesting like to see what will happen in the future.

Starting point is 00:54:45 So hopefully we'll have the opportunity in a couple of months to chat again about that. So we are at the end here. And I'd like to close our panel today by one last question that I'd love for you to reply, to answer, sorry, with, I don't know, just one word if possible, right? And the question is, what is, what do you expect as the next, what next technology,

Starting point is 00:55:14 let's say, makes you very excited, right? What do you wait to see like in the market around the data stack? And what you would like to share with our audience out there in terms of excitement things that are happening. And let's start, Jason, with you. I'm kind of interested in the governance layer for these stacks. I think in the A16Z diagram, it's at the bottom somewhere. But I think it's at the bottom because you kind of have to have it. And all these different tools have got different ways of governing either their data assets or AI assets or whatever. And so I think unifying that is going to be something that's going to be interesting.

Starting point is 00:55:53 And then just the sheer number of open source data catalogs that have come onto the scene in the last few years, I think it's like half a dozen of them and maybe a handful of commercial companies that are behind those, like seeing how that plays out is going to be interesting because I'm a big fan of open source, but I feel like it's hard to have like six open source projects of something that does the same thing. You usually end up with like two. So it's going to be interesting to see how that whittles down. All right. That's interesting. Amy, what about you? Yeah, I think the idea of headless BI. So the idea of like kind of keeping your metrics in one layer that can then feed all kinds of various BI tools, including kind of those set of BI tools that are very specific to industry use cases.

Starting point is 00:56:36 So I think that's going to be really interesting. So everyone in the organization kind of being able to interrogate data using tools that are real specific to kind of their use case, but all kind of in one source of truth. Okay. That's very exciting. Brandon. I have a very similar answer, Jason. Maybe slightly different approach. I think a lot of these data cataloging tools are very interesting to me because one of the problems that I see with a lot of analytics teams is just the rate of onboarding, understanding what tools they're working with, understanding what field definitions are. So any tool that could solve for that, whether it's a kind of explicit catalog tool or some other data dictionary evolution, it would be fantastic to have.

Starting point is 00:57:17 And last but not least, Paul. I'm split, honestly, because personally with if i didn't have to think about like value to the business i would say like bespoke ml in the people companies like hugging face or or something like that which is like delivering like easy ml that you can just like deploy via sage maker via like whatever whatever tool and like get back fast answers to your, to your questions. But I think like that's trying to hit the, my real answer, which is democratization. So like letting similar, I think to Amy's answer, but like allowing each person to ask the questions that are most pressing to them without, without friction. So like giving them

Starting point is 00:58:03 access to the most relevant parts of the data without confusing them with irrelevant pieces. Okay. Thank you. I'll give the microphone to Eric. I really enjoyed this panel today. I hope that you also had fun and hopefully our audience also like will be a little bit wiser and after

Starting point is 00:58:28 today's panel and understand a little bit better what the modern data stack is eric ah yes well that the live stream the live stream listeners got to experience that but the podcast listeners won't because they'll edit it out. This has been amazing. I learned a ton. And I think one of the big takeaways is that it's an evolving question and we're all working on some pretty hard stuff here, but some pretty exciting stuff that's enabling all sorts of interesting use cases, technologies, and job roles inside a company. So thank you for your time. Thank you for helping us understand all of these things on a deeper level. And we'll catch you on the next show. Have a good rest of your day.

Starting point is 00:59:13 What a treat to hear from so many great minds in the data space. I think one of my big takeaways is that the data stack is changing. It has changed, right? If you think about five years ago, there were tools that didn't even exist, right? And now a tool like DBT is a key piece of many data stacks, right? It's changed drastically. And then hearing Timothy talk about how he views the modern data stack in the context of investing in companies, it was so interesting to me that he said, you know, you can kind of apply that term to like so many different pieces of the puzzle here or to the whole. And so I think part of the dynamic nature of the stack and how it's changing and increasing in complexity makes it a pretty hard term to nail down. And I thought I really loved how when we were talking, when we asked questions to Paul from Hinge, who's doing this work every day, he's a brilliant guy, which was clear from his answers. But it was hard to answer these questions, I think, because of some of those reasons. So it was really helpful.

Starting point is 01:00:26 And I think it's, sure, it's a marketing term, but I also think all the factors of sort of the dynamics and complexity make it hard to nail down. What do you think? Yeah, first of all, using marketing terms is not a bad thing. I mean, there is a reason

Starting point is 01:00:41 that we have marketing out there and it's not like just evil reasons behind it. Whenever we build something new or we are, let's say, reinventing something, we need to also invent new terms and new language so we can talk about it. So the modern data is another attempt towards that, nothing else. But it's very strong evidence that something is happening in this space. And that's what we have to keep. Like, there's so many things happening. As Tim mentioned at some point, you see so many new categories, like, appearing every day, things that were just like monoliths in the past, they are not monoliths anymore. Like we, the products from the past

Starting point is 01:01:25 are like broken down into like a large number of other products. I mean, that's a good thing. That's an indication that many great people with very smart people are trying to figure out how to make things better, right? And that's why I think that

Starting point is 01:01:42 even if we manage today to give a definition of what the modern data stack is, probably like in a couple of months from now, it's going to be at least slightly different, right? And that's fine. I mean, it's an indication of that's the testament, I would say that we are living in very exciting times for anyone who's like in this space and working with data. So that's what I take from our conversation today. As always, a very thoughtful and concise reflection from Costas. Of course, of course. All right.

Starting point is 01:02:14 We'll have another panel coming up in early 2022, which will be super exciting. So we'll let you know when that's going to come up so you can register for it. Yeah, and that was our first one. So if anyone from the people that was listening to it, they have any suggestions or criticism or whatever, please reach out.

Starting point is 01:02:34 Oh yeah, you can go to datasackshow.com and we have a contact form on the site now at datasackshow.com. So please, we'd love your feedback and ideas on a live stream that you'd want to see. Yes, please come. We want to be friends.show.com. So please, we'd love your feedback and ideas on a live stream that you'd want to see. Yes, please come. We want to be friends. Absolutely. All right. We'll catch you on the next show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Starting point is 01:03:11 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. coffee.

Pet Camera - EBO Air 2

The Data Stack Show - 69: What is the Modern Data Stack?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 69: What is the Modern Data Stack?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.