The Data Stack Show - 30: The DataStack Journey with Rachel Bradley-Haas and Alex Dovenmuehle of Big Time Data

Starting point is 00:00:00 The Data Sack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Thanks for joining the show today. Welcome back to the Data Sack Show. Eric Dodds and Costas Pardalis here. We have an exciting guest and a surprise guest on today's show. The exciting guest is Alex, who is at Mattermost, who is actually our very first guest on the show, exactly 30 episodes ago. Alex has since started a consultancy called Big Time Data, along with Rachel. And both of them are going to join us on the show today to talk about the data stack journey, how the data stack changes over time. I'm very interested. My burning question

Starting point is 00:00:51 for them is, we see this all the time in our work and from people on the show, companies of different sizes have different requirements around the stack that they build for customer data infrastructure. And I want to know which tools stay the same throughout the entire journey from, you know, sort of person in a garage startup, all the way up through enterprise level. So that's what I'm going to ask. Costas, what's on your mind? Actually, we're very aligned on that. That's something that I think both Alex and Rachel are like the perfect people to chat about how from this very chaotic market of data related technologies right now, patterns emerge and what these patterns are and how people can use them to navigate the whole process of building

Starting point is 00:01:30 their own data stack. And there are, of course, differences between the different types of companies, the different problems that they're trying to solve and scalability, but there are also many commonalities. So I think we will be able to tackle this today. And of course, I'm even more excited because Alex was the first episode of this show, but also my very first episode for a podcast ever. So I'm really, really happy to chat with him again. Great. Well, let's talk with Alex and Rachel.

Starting point is 00:01:56 All right. We have our very first podcast guest ever from the first episode back on the show and have added another special guest. So Alex, who was at Mattermost when we talked with him last, and Rachel, who was also at Mattermost at the same time, have joined us to talk about the DataStack journey. Thank you so much for joining us. Oh, you're welcome. Glad to be here again. Yes. Glad to be here again. Yes. I didn't be here for the first time. Well, we had a great episode with you, Alex, talking about all sorts of interesting things. I guess it was, wow, six or eight months ago now. So time flies when you're stuck in a house,

Starting point is 00:02:37 right? Yeah. Yes, indeed. Yes, indeed. Well, why don't we start out? We'd love to just get a little bit of background. So you had different roles at Mattermost, but worked very closely together. But we'd just love a little personal background on your history, how you ended up at Mattermost, and then what you're doing today, which is big time data, which we want to hear about as well. So Rachel, why don't you start since, and Alex, of course we want our new listeners to hear your story, but we'll let Rachel start. Yeah. Yeah. So background is in industrial engineering at university of Michigan rep that to the day I die. So go blue. And really what ended up happening, graduated, went to Cisco, really honestly, very lazy in the way that I never liked to do the same thing twice.

Starting point is 00:03:25 So got really into automation, data, how do you scale all that, and then wanted to go deeper on a technical level. So ended up moving over to Heroku, which at the time was a subsidiary of Salesforce, and did a lot of data analytics, data engineering, ended up spending a little bit more time on the operation. So my role grew into really understanding how all the data in the data stack can be used to drive go-to-market motions, automation, and scalability. And then after that, I kind of felt like I had outgrown my role there and decided to take a risk and go to a smaller company with Alex. So we ended up going over to Mattermost and really starting from there,

Starting point is 00:04:05 just understanding how do we start a data infrastructure from scratch, basically, using some open source technology, some new tools we had never used before. And then also, how do you help an organization adopt a data-driven culture and really embed that in their day-to-day? So that's where we're at at this point. And then, you know, once Alex is done, we'll talk a little bit about how big data came about. Just as far as my origin story in this whole thing, I come from, you know, computer science, full stack developer kind of background. It was really at Heroku that I got into all the data engineering things and basically modernized their data stack, which at the time when we were there early,

Starting point is 00:04:53 and this was like six years ago, five years ago, they were using like bash scripts to run stuff. And like, it was just a total nightmare. They were using Postgres as their data warehouse. So we migrated all that, DBT, Airflow, Redshift. And that's where Rachel and I first met at Kuroku. And, you know, she was doing the analytics and operations stuff. And it ended up being like, at one point, just basically like the two of us were tackling

Starting point is 00:05:22 all this stuff by ourselves. And we ended up you know building teams there and everything like that so then when we moved to matter most like she said it was like they really had no infrastructure at all so we built that all up from scratch and then as far as the evolution of matter most it's like we built all this stuff at matter most we could see and show the value of all the, you know, like the architecture that we were using and the technologies we were using and how we were using it. And then what we started to notice was, you know, there's all these other

Starting point is 00:05:57 companies that have the same problems, right? They all have a bunch of data. They don't know what to do with it or how to get the value out of it. And so that was really what started to really spark the idea of big time data and us kind of going out on our own and, you know, actually spinning up a consulting company. Because like what we really want to get to is building this for a bunch of different companies. Like I want every company to succeed, right? I just want all of them to be able to like harness the power of their data and make their company the best it can be. Yeah. Just to company to succeed, right? I just want all of them to be able to like harness the power of their data and make their company the best it can be. Yeah.

Starting point is 00:06:28 Just to add to that, I feel like one of the things that's really hard is people talk a lot about data and how to build these state-of-the-art data stacks. And, you know, for us, it feels very approachable, right? Because we live in that every single day and just thank goodness our parents were smart and therefore we became smart as well. So it seems very intuitive, but when we went to Mattermost, I remember thinking, oh gosh, they're going to know we're frauds. We're really not as great as they keep saying we are sort of thing. And we got in there and just the smallest things that we would do where we would say, oh yeah, you're just going to do this and throw this on top of it. And it's straightforward. And they were just thinking,

Starting point is 00:07:08 oh my gosh, you're a godsend. Like, this is amazing. I never would have done this. And you're kind of thinking, huh, that's weird. That's just something I thought everyone knew. And so as we've continued along and talked a lot about how do you scale your operations using scripting and, you know, how do you really support self-service analytics and data governance, we started realizing these are things that are not talked enough about, or there's almost a sense of I'm too embarrassed to ask because it seems like everyone knows what they're doing. And so from our perspective, it's like, we want everyone to be able to do that. We want to put documentation out there. We want to have best practices. We want to make sure that people can do these things because data is so important. And so that's really where my passion has come from, from this. It's so many easy, small conversations that help people build confidence to make those

Starting point is 00:07:54 risks. And so, you know, that's one of the reasons why we're on this podcast right now is just making sure people know all you have to do is take one step at a time towards your future goal. And it really is approachable if you have the right people. Sure. You know, it's really interesting, Rachel, that you mentioned people being afraid to ask questions because they think that everyone has it figured out. And I was in consulting before joining RedrSac doing similar things, but more on sort of the MarTech side. And it was so interesting. There's almost an imposter syndrome type dynamic in many companies where you just have this sense that we're the only company whose Salesforce is really messed up and who's having trouble cleaning our data and getting

Starting point is 00:08:45 insights. And the more companies that you talk to, the more you realize literally every company has these same problems, right? It's pervasive and it's not because people aren't working hard or they aren't smart, but technology is changing quickly. And when you have a quick growing company, it's just, it's really hard to align both the organization and the tools and the data and everything to make it work out correctly, especially if you don't have a playbook. So that really resonates with me because I saw that all the time. Like we I'm just, you know, it's almost like I'm embarrassed about the state of our situation. Yeah. And I have two things to add to that. I think it's one of those things where

Starting point is 00:09:25 you end up finding, this is more of a psychology thing. I feel that people end up talking more about the parts that they're comfortable with. Right. And so you all of a sudden have companies that are doing one thing, right. But everything else is kind of crap and they're talking about that one really great part, but you're comparing it across your entire system. And so all of a sudden you have this perception that everyone else has everything great when in reality, it's just that one part that they're talking about. And man, Alex knows he can get on a call and go so in depth with all these different tools that I'm sitting there just nodding, pretending, you know, letting my imposter syndrome get to me. But then I realized one of the reasons why Alex and I are such a great partnership

Starting point is 00:10:02 is because we don't need to know everything ourselves. You know, we obviously have great friends over at Rudderstack. We have great friends at DBT across the board everywhere we've been. But that's why it's so great to have a community. I know a lot about go-to-market motions and using data to drive that. That's something that Alex isn't as strong about. So it's just one of those things, you know, don't be too hard on yourself if you're not there yet and be realistic about what's really going on. And man, about the Salesforce thing, a hundred percent, we worked at Heroku, which was part of Salesforce and we

Starting point is 00:10:34 were struggling to do it. Right. So I definitely, definitely get that one. Yeah. I mean, this is, this is a little bit tongue in cheek, but I mean, it really is the reality. But we used to joke, we used to ask people, have you ever seen a Salesforce that wasn't a mess? Yeah, when you spin them up. Really? Oh, that's so good. That is so good. Okay, well, we have so much to talk about.

Starting point is 00:11:00 And I know Costas has many questions. So I'll kick off with the first question on our topic of the data stack journey. So we wanted to have both of you on the show because one, you bring an interesting perspective of working together at multiple organizations on the data stack, sort of from two different directions, you know, sort of the data engineering perspective and heavy technical side from Alex's side, and then the ops sort of go-to-market alignment on your side, Rachel. And going from Heroku, which is a huge, I mean, part of Salesforce, right? Massive company to Mattermost, and now having consulted with a variety of organizations, you present a really interesting perspective on the best practices for building a data stack that will scale and how that needs to change over the life of an organization, right? Because when you're just starting out and you maybe have a two-person company, your needs around the data stack are very, very different than when you get to the

Starting point is 00:12:05 size of a Heroku that's running inside of a massive enterprise like Salesforce. So I'd love to get the perspective from both of you on what is the data stack journey? Just give us an overview of, you know, from the perspective of a company just starting out to becoming a large enterprise, what does the data stack journey look like? How would you define it? Yeah. So the data stack journey to me is like, how do you build your data infrastructure in a way that can grow with your company as it's growing and still give you all the value that you need while being efficient with costs and like operational burden and that kind of thing. Because like you said, if you're a two-person company, you know, having a bunch of different tools and, you know, a bunch of different

Starting point is 00:12:58 infrastructure that you're having to maintain is just going to waste your time when you should be, you know, talking to customers or whatever. But on the other hand, it's like once you get to that Roku size, you can really dig into optimizations that would otherwise, they're only valuable because you're doing them so over and over and over again. And so this idea of the data stack journey is like, how do you make those decisions upfront when you're small that allow you to grow with your data and your organization and not shoot yourself in the foot where you're having to, you know, spend a bunch of time doing rework or, you know, your analysts are just fighting data fires and

Starting point is 00:13:44 they can't figure out why the data is wrong and all that kind of stuff. Yeah. And just to add to that, I think there's a couple of different variables that come in when talking about that. You know, you're talking about, are you willing to pay more to have more scalability because of limited bandwidth, right? And so you're saying, oh, if I have two people and I have one tool that does it all, and maybe it's a thousand dollars more a month versus, you know, five different tools, if you start to think, okay, how much time is it going to take to move between us? If there's an error or something needs to be debugged, how much longer is it going to take? Because you have to look at five different tools.

Starting point is 00:14:19 The other thing that you brought up, Alex, which I think is so important is, you know, if you think about where you are now and where you're going to be in a year, five years, so on, you have to think about the cost it would take to move from one to the other, right? So right now you might say, oh, it's $1,000 more a month. It's not worth it. But one year from now, if to re-engineer it, it's going to take an entire engineer's month. Is that more than $12,000? So you start to have, like, in my mind, from the operations perspective, I start to think about the dollar amount and the cost of an engineer's time and honestly, the morale, right? You want to keep these people around. We all know the worst thing in the industry is losing someone when a company is so small and they have all the knowledge that's, you know, that's a huge deal breaker. You want to be using the tools that engineers want to be using and analysts. So you keep them around and retain them because the loss of an engineer or an analyst is unimaginable. And I

Starting point is 00:15:14 would say, you know, close to $200,000, $500,000, depending on where you're at. Absolutely. What do you think? And I'll ask one more question here and then the costs jump in. And this may sound like a kind of an obvious question because we just see this so often, right? With a growing company and you have good intentions and then, you know, you just don't seem to have the time or the resources to do things the right way from the beginning. Why do you think that happens? What are the main, you know, maybe top two or three things that produce the downstream problems that companies face if they aren't really careful about nurturing their data stack early on?

Starting point is 00:15:58 I mean, I think the first thing is going to be that as an organization, you're not going to have that muscle of, hey, when we implement this new feature in the product, we need to, you know, like track its usage in a decent way. Right. So that we can like have the insights, like, are they doing this thing right? And, you know, so then what's going to end up happening is you're going to kind of end up beingmost, where it's like they had a data warehouse and they were running some queries on it. But the data quality was kind of low. Things were all one-offs. And it just wasn't scalable at all. And so then you have to really go through that whole migration process.

Starting point is 00:16:43 It's not only a technology change it becomes like a people change and organizational change and as a growing company you're already dealing with so many challenges from like that growth just in general that having to deal with like data growth and you know all that stuff just adds to it right so it's better to like if you can and it doesn't even have to be like crazy you know amounts of time that you're spending on all this stuff right like you can just do a few things and i think you know the more i've been thinking about it it's like can we as big time data like like provide some tools, and this is kind of going back to the imposter syndrome thing is like, can we provide some guidance and tools and guides or something that

Starting point is 00:17:31 can give people the confidence that, hey, I'm not like totally screwing this up, even though I don't know everything about it. Like, I'm not an expert, but I know I need to do something. Right. Yeah. And the other thing that I don't talk about, obviously, Alex, you and I always think about these questions in different perspectives, but I think that's great is from my perspective, I think the biggest impact is you have all these brilliant people that need to be focusing on strategy and making sure that that business is successful. You're going to hit that pivotal. And are you going to be ready to blast off? Or are you just going to be a dud? And if you have the leaders of your organization spending their time questioning numbers, instead of focusing on strategy,

Starting point is 00:18:15 that's a big deal. You start to have a VP of marketing presenting numbers about, you know, your pipeline, and there's a disagreement, all of a sudden you're spending the full day trying to get ready for a board meeting, questioning how many MQLs you have instead of saying, how are we going to present this? What are our next steps? What are we doing for the next year? How is our product going to change as our customer evolves? Those are more important questions than what's the definition of an MQL? Is our data from Salesforce coming in accurately? Do we have the right, you know, triggers in our product to promote, you know, growth, all these different things, right? So I think it's so important that you have data in the right place and the definitions and it's

Starting point is 00:18:55 trusted or else you end up spending these unaccounted for hours trying to figure those things out. And no one tracks that anywhere. It's just something that comes as part of the job. And I think as soon as you realize that you wouldn't have to have as many of those conversations, if you had invested a little bit upstream, you're going to regret not having done it already. Absolutely. The board meeting scramble, that is probably a good topic. That'd be great to collect war stories because you said that. And I think myself and probably a lot of our listeners know exactly what you're talking about. Okay. I have one thing to add to that. I'll just say, you know, this goes into the whole, I'll give a little shout out to Michael Schiff. He's been a mentor and my boss and Alex's boss for a while, you know, you pay

Starting point is 00:19:44 now or you pay later is one thing he always said. The other thing is you're training them or they're training you. And I will tell you that at Mattermost, we've worked very closely with Emil, who's our VP of finance there. And we have trained him how to go and get his own numbers, how to trust the data for all of the things that he needs to present to the board. And I'll say the last board meeting, there was only one question that he had that he reached out to me for getting for the board meeting and the ability for him to self-serve when initially it was, you know, four to six hour calls trying to get him his numbers. It's just been amazing.

Starting point is 00:20:17 It just kind of shows as you train them and as your data stack evolves, people are able to trust the data and feel more confident going and getting it themselves. Love it. That is really cool. All right, Costas, I've been monopolizing and I can keep going, but I'm not going to because I know you have a ton of questions. Yeah, Eric, I think this is a common pattern lately on our shows, but it's fine. I mean, you're asking very good questions anyway, so it's good. For me, it's a very special episode today because Alex was my first ever guest

Starting point is 00:20:50 in a podcast episode, so I'm super happy and excited to have him back and also having Rachel together because they are both working on the data stack, but they see it from a different perspective. So I think it's a great opportunity to have both perspectives at the same time. So let me start with a question about the data stack. I mean, you've been working with data for quite a while. So and you have seen like the changes

Starting point is 00:21:18 that have happened in the technology. So how the data stack has matured since your time at Heroku or even earlier, if you have like experience from before that. And what are the tools that really excite you that exist today and didn't exist in the past? Yeah. Yeah. It's crazy how much things have changed and it feels like it hasn't been that long. And yeah, I mean, you know, going back to Heroku, like the early days, I mean, you're talking, I already mentioned the bash scripts and stuff like that, but you know, the SQL that we were writing, I mean, literally, and I'm not kidding, like thousand line SQL files were not uncommon.

Starting point is 00:21:58 I don't know why you're complaining. I really enjoyed debugging those scripts. Yeah. We had just amazing data quality too. And so one of those tools that I just preach the gospel of everywhere I go is dbt, which we started using at Heroku

Starting point is 00:22:15 three years ago. Was it three or four? Anyway, something like that. And that was really like, it was funny because it was actually a data engineer on my team just sent me this link he was like hey i saw this thing on hacker news and then we started looking at it was like oh my gosh we have to use this what are we doing and that really like so then you go from here's this thousand line sql file that i can't make

Starting point is 00:22:42 heads or tails of i mean eventually i could but could, but you know, you have to, it's like every time you have to debug the thing, it takes you four hours just to remember all the nooks and crannies of the stupid thing. You go from that to like, Oh, I just have, you know, a 50 line dbt model and then like a couple of other ones and like everything just works. It's amazing. So that's one. And I think the other thing that has been really interesting is just the availability of tools that make dealing with

Starting point is 00:23:12 large amounts of data easy for like not like you don't have to be a PhD person to be able to deal with big data anymore and I think there's just been so much done there that it really helps I mean anybody right like anybody can deal with terabytes of data now whereas before is like oh my gosh I have terabytes of data how like it's going to take me hours and hours to query into this stuff. And I don't know what to do. So I'll let Rachel add to that in her way. Yeah. I mean, one thing that you didn't call out is the biggest thing that we ended up changing right away when you took over our data stack at Heroku was adding airflow. We used to have everything basically in one massive daily or hourly job.

Starting point is 00:24:06 And it would be like 50 in the hourly job and 120 in the daily job. They would laugh themselves. And it was utter chaos. One thing fails, you have to kick it off by itself and have to track it to make sure it finished. It was terrible. And so just getting airflow and all of that going was a huge game changer for us. And then, like Alex mentioned,

Starting point is 00:24:25 we started talking about DVT. And from my perspective, I don't think I conceptually understood what it was doing at first. I viewed it as cool. It's a different way to organize your code, yada, yada, yada, huge investment. This just feels like a tool that an engineer wants to use because they're bored of their day-to-day job and they want to have a new tool to mess around with. I was so wrong. I think I was very busy at the time and I didn't really take the time to really understand what it was. And when we ended up moving over to Mattermost, because at that time I had a team four was very heads down more on the analytics side. When we moved over to Mattermost and we started it from scratch and we had basically no code and we were starting it from scratch, seeing how great it was to build these dependencies on top of each other and have it be so clean,

Starting point is 00:25:16 where if you just need to change one small piece of logic at the granular level, it scales and moves throughout your entire basically data model. And so being able to see that, I'm so glad that they still invested in it at Heroku, even though I wasn't a huge proponent of it, was a good idea. And I think it's one of those things where once again, you pay now or you pay later, you're going to have technical debt. And I think DVT really helps you manage that in a way. It really limits how complex your technical debt will get. So big fan of DVT as well. And then, you know, I'm just a huge Looker fan girl. I can't help it. That's always been something I've been very lucky since we went to Heroku. Heroku had it from day one. When I did

Starting point is 00:26:01 my interview, I did stuff in Looker. I don't think I ever want to live in a world that doesn't have Looker available for it. I just think in terms of how they've turned a visualization tool more into a data governance tool as well that allows self-service scalability has been a game changer in terms of making sure analysts can focus on the important things and not become report monkeys, right? That's everyone's biggest fear being an analyst is do they think I'm a report monkey or do I really get to drive change in the business? And so I think Looker has enabled analysts to focus more on driving change, diving into the data because you now do have people like VP of finance feeling comfortable going in Looker

Starting point is 00:26:41 and pulling data themselves with confidence. Yeah, that's a great point, Rachel. I think, and I've said that in the past, that the most successful tools at the end, they don't just add value or simplify processes. They actually promote organizational change. And that's a very good point about Looker. And I'm happy to hear that from you. So from what I understand, I mean, some major changes that happened in this space is things like orchestration, talked about that, modeling, composability in SQL, something that

Starting point is 00:27:12 was missing from the language for a long time. So my feeling is that there are many of them standard, let's say DevOps or software engineering techniques that software engineers are using like for quite a long time that are entering this space and that's like a sign of maturing and it's the way of okay let's adopt methodologies techniques and technologies that they have proved to be to they have proved to add a lot in the productivity in the way we work what else do you think that is going to be introduced i mean there are things like cic, there are things like CICD, there are things like testing, especially I think testing. It's still, I mean, DBD is doing a lot of

Starting point is 00:27:51 things around that, but I think it's still an immature side of the data stack. So what do you think is going to be the next big thing, let's say, that is going to be introduced in the data stack and is going to have a lot of impact in the everyday work of someone who is managing and building these data stacks? From my side, I think there's two things. And one you touched on, which is testing, which really is more about data quality, right? And, you know, you see things like great expectations,

Starting point is 00:28:21 you know, DBT has some testing stuff built in and they even just came out with a great expectations package that you can use. And I really think that's going to be, you know, like you said, like bringing actual like software engineering techniques, the ICB, you want to have like unit tests, that kind of thing. It's been really hard to do that in your data warehouse. And so then you end up in that situation where your VP of finance comes to you and is like, hey, this number doesn't make sense that I'm seeing.

Starting point is 00:28:53 What's the deal here? And then you're having to go trawl through all your data trying to figure out what the issue is, right? So I think that's going to be one thing that's going to really take off. And it should take off like that should be part of your you know that should just be the way that data warehouse and data engineering is done it's like okay i've developed my model but now i need to test it and make sure

Starting point is 00:29:16 that it's the way it is and then the second thing that i think is interesting and I want to like learn more about it and get more into it is getting to more like real-time analytics not only analytics but also like doing stuff with all of the data that you have in your data warehouse that triggers in real time something to happen whether it be like marketing or something in the product things like that i think could really be interesting like you know you look at materialize where they can basically like ingest all this data from a kafka stream and you can write a sql statement on top of it that you know updates basically in real time it's like what if instead of your dbt models you know you're having to run like them incrementally every hour or whatever it happens to be like, what if they just always were up to date?

Starting point is 00:30:11 What if that just like automatically happened? I think that would be really cool. And that's something that like, I'm keeping an eye on and trying to learn more about and see, see what value that we can get out of technologies like that. Because I, I think that's where people are going to start really looking for stuff. Yeah, a few things that come to mind for me, I think one of the things when I think about data, and it's been great that there's a huge growth of companies that are really focused on the data engineer, right? I feel like for a while, it was kind of, well, they just do what we tell them to do, make the data engineer, right? I feel like for a while it was kind of,

Starting point is 00:30:45 well, they just do what we tell them to do, make the data happen. We're not really going to invest in tools for them and whatnot. And now I think there's huge importance on it, which has been great. So that's why you see some of these companies coming out of nowhere with a bunch of stuff to support them and really making sure they have what they need. But with more tools comes more issues around integrations and timing. And you start to think about, okay, well, I'm piping in my data with a stitch or a five tram, then I have to run my dbt jobs. And then I have to send that data somewhere else. And if you don't have really great scheduling or orchestration, you're all of a sudden sending stale data out because your dbt job took too long

Starting point is 00:31:22 to run, and it's not timed up perfectly and so you start dealing with like how do you make sure that everything's kind of talking to each other so that it is going based on dependencies and all that and then the other thing is tool consolidation because i do start to worry about how much of the data stack is going to be very piecemeal. And if something goes sideways, you know, debugging that many tools can be very difficult. And so are you going to start seeing companies have more integrations with each other and, you know, talk to each other? You think about the Salesforce idea where you have these different packages and installations and whatnot. Are you going to see, you know, connections between a stitch and a dbt or a rudder stack and a dbt? And then are you going to see dbt have connections to another tool that then is going to write to

Starting point is 00:32:10 Salesforce, all these different things, it feels like there aren't as many really strong integrations there yet. So while it might not be a tool itself or a product, it's how do you make sure that all these dispersed tools are talking to each other and have really great alignment? Because if there's any gaps in that system, the data engineer and analytics and honestly, business as a whole will suffer. These are some great points. And actually, it's something that I was thinking about lately. I mean, I totally agree with you, Rachel. I think that the way that it works right now with all the different tools and just adding more and more tools, like for example, I think it's a very common pattern to see like companies using both

Starting point is 00:32:54 Stitch data, for example, and also Fivetran just because there are like different needs for integration or they are trying like to control their costs and all these things. But it's good to have many options out there. But the downside of this is that you end up having a stack that is more fragile, right? Much more difficult to figure out where the problem is. And especially when we are talking about tools that are cloud-based, right? It makes the whole process and trying to debug much more time consuming and much harder, in my opinion. But what I find more interesting, and I would like to hear both of your opinion,

Starting point is 00:33:34 is here we are talking about data stacks where the core of the data stack is the data warehouse, right? It acts as like the central repository of all the data that we have. And this is, of course, like a great architecture and it works really well. That's why companies are adopting this. But as we add more and more tools that they have to interact with it, and especially when

Starting point is 00:33:58 we are talking about real time, right? The utilization of the data warehouse is probably going up, right? And one of the selling points of data warehouses like Snowflake or BigQuery is that you can control your costs because you pay as you go, right? Like you have to execute a query and then you're going to pay. But we are reaching a point where I don't see or I don't feel that like the data warehouse is going to be slipping a lot, to be honest. So at the end, we might end up in a situation where the data warehouse is just working 24-7.

Starting point is 00:34:31 And optimizing the costs around that, from my experience, at least, is not the easiest thing to do. So two questions here. I mean, first of all, I'd like to hear your opinion on that if you agree with this. But how do you think the data stack is going to evolve to address these things, especially with the position of the data warehouse? And how do you think that the data warehouses like Snowflake or BigQuery

Starting point is 00:34:55 or even Redshift can address and adapt to these new challenges? Because, okay, traditionally, data warehouse, it's not something that should provide responses in real time, right? It's not something that it should be like working 24-7 naturally. And that's how these systems were designed. But the industry has different requirements right now. So what do you think about this? Yeah, so I think with Mattermost, basically, we do have an extra small virtual warehouse running basically 24-7, like you were saying.

Starting point is 00:35:26 And that's just kind of been like, well, we just kind of have to have that going, you know, all the time. And that's just the way it is. I think the, you know, we have spent a lot of time actually optimizing our snowflake costs at Mattermost. And, you know, it's anything from just like warehouse optimization, as far as like what jobs are you running against which warehouse and, you know, how often you run them and all that kind of stuff to even optimizing queries, right? Because if you can take a query runtime from, you know, 10 minutes on an extra large warehouse to five minutes on an extra small,

Starting point is 00:36:05 you know, at least in snowflake land, that's going to be quite a cost savings. So, you know, I think the, going back to kind of the materialized idea, I think is why I get a little bit excited about that too, because it's like, can you use materialize more as like your real-time data store that, you know, gives you that real-time access. And then, you know, behind the scenes, you just are doing your regular things with Snowflake and all that. Real quick on that, Alex, like just to go back to what you're saying, because, you know, I think Materialize is a great option, you know, as that space continues to evolve. But for right now, right, you start to think about and tell me if I'm completely wrong, because once again, this is why

Starting point is 00:36:48 I feel very lucky to have you. But we basically have that extra small warehouse running all the time, which is then dumping data, obviously, like modeling this data, bringing it in. And then we have Looker going against different, more powerful warehouses that in the moment, if someone is querying something using Looker or whatnot, you know, that we're paying more. But that one's not up all the time because you're taking care of all of your piping data in and modeling in a different warehouse, which is up a lot of the time, but they're smaller. And then we have a bigger warehouse that maybe is running more complex stuff, but that's only running as a user needs to access it. Right. Yeah, yeah, exactly. And I mean, that's where you sort of have to I think. People who are new to Snowflake really need to understand how that pricing model works, because you can kind of rack up a lot of cost if you aren't a little bit careful with it.

Starting point is 00:37:45 The other thing I think is like, you can look into, you know, like Snowflake has their Snowpipe, which is a lot less money as far as getting the data into the data warehouse. And then I know BigQuery has ways for you to stream data into the warehouse as well, you know, for less cost. So, you know, I think at the present moment like rachel's saying it's like that's sort of where we're at with all this stuff and you just sort of have to play the game in that way and i you know as far as moving forward i think we'll see more stuff like like like your snow pipes and things like that where it's like okay you can optimize your cost for like sort of a subset of the use cases that you needed for that's great i want to i have two

Starting point is 00:38:33 more questions and then i'll give the microphone back to eric i know he also has like a lot more questions to ask so you are working on the data stack, but on a different part of it, right? And obviously, you have been very successfully working on that, like all these years. Can you describe a little bit more how your roles differ and how you interact with each other? Yeah, yeah, for sure. So yeah, I mean, basically, the way I've kind of been seeing it is like, I do whatever Rachel needs me to do to like, do the go to market things and operations and analytics things to make it work. I'm kind of like the plumber who I also considered myself, I was building the, I was building the Legos, and then she would put them together is kind of the way I would think about it. And that's why, like, I mean, honestly, that's kind of why we, you know, started this whole thing is because we, like, it just, like, we just work well together and it really,

Starting point is 00:39:39 there's no gaps, right? Like, if we were both just hardcore data engineer people, it's like, okay, yeah, there's like some cool stuff we could do. But together, we're able to like really have a huge impact on organizations. And I mean, you can see just based on this conversation, like, we definitely think about things in a different way, but it ends up like, fully forming the idea. And, you know, and the solution. I think the thing that's been really great. And one of the reasons why I even feel comfortable going into business with Alex is we just have such a great level of trust with each other. I think what ends up happening is I take a lot of time

Starting point is 00:40:19 to understand the needs of the business and really think through where do we need to go? What are the things that the business doesn't even know they need from the data yet? And how do we make that happen? And so what ends up happening is I go to Alex and I say, Oh, what about this? What about that? Brainstorming these moonshot ideas. And Alex is absolutely brilliant in my mind. I mean, don't, don't tell him cause you know, I don't want his ego to get too big, but I think what's really cool is we take these different tools and anything that doesn't exist, he's able to bring a custom aspect to it. So from my perspective, I do everything from basically what I would consider analytics engineering all the way through process flows and sales force and helping marketing define

Starting point is 00:41:00 how they want to do their pipeline and sales forecasting and all these different things, right? So we really do meet at that overlap of right where data engineering hands it off to analytics engineering. And I've been literally in the past two weeks, I think I have really honed in on being obsessed with the concept of analytics engineer. Because I think in the past, either I was oblivious to it or it really is that new. I don't think that that concept is really there. I used to call it a hybrid analyst engineer. And I think that's those people that have the ability to map business logic to raw data and model it and things like DDT is where there's going to be a lot of investment right we have these very strong individuals and those are the core people that enable self-service analytics and so from that

Starting point is 00:41:51 point on is where i focus and alex really does everything before but the thing that's super important that alex does is he knows how the data needs to be ingested and kind of initially modeled for that analytics experience. And so you end up having, if you don't have a tight interaction and relationship between data engineering and analytics, you have people just dropping data and not giving a crap about what it looks like, honestly, the quality of it or how it's going to be used into the data warehouse. And then it's so inefficient for analysts to try to query it and model it. And, you know, there goes your snowflake cost. If all of a sudden, you know, instead of writing a few different scripts before you dump it in the warehouse, you're just dumping it in there. And then next

Starting point is 00:42:39 thing you know, you're spending a thousand dollars more on snowflake for the analyst to try to model it and create something of it. So I think in general, that overlap there and like empathy and understanding about what we want to do with the data has really allowed us to grow in a scalable way. Yeah, that's great. Some great points here. So based on your experience with big time data, right? What are some common issues that you see with from your customers and also like your your experience with big time data, what are some common issues that you see from your customers and also your prior experience with that in the communication between data engineering and data analysts? And do you have some advice to give around that?

Starting point is 00:43:17 And if you would also like, can you tell us also as big time data, how you help with that? Because solving the technology problem is one thing, but the technology can do nothing if the organization is not the right around the technology, right? So what are your thoughts around this? And how do you approach it as big time data? Yeah, so the clients that we've had,

Starting point is 00:43:37 it's been interesting because most everybody is like, hey, I know we need to have a data warehouse. So let's just use BigQuery or Redshift or whatever. And they'll have some data in there. But then they're like, OK, we have a data warehouse. Great. But it's like, well, hold on a second there. You're not really getting any value from this data, really.

Starting point is 00:44:00 Or you're just running one-off queries on top of it. So it ends up becoming this thing where we come in and we're like, okay, cool. You have some data in there. It's like, what are you doing with it? And they're like, well, we're trying to figure that out. And that's what we've really been helping with is like, hey, let's get in there. Let's understand your data.

Starting point is 00:44:16 Let's model it. And then let's build like a scalable analytics infrastructure on top of it. And then, you know, and then you can get into the more, even more fancy stuff as far as like, you know, marketing automation and all that kind of thing. So that's what, I mean, that's pretty much what we've seen a lot. And, you know, like you said, it's like an organizational change as well, because one thing that we're really sensitive to is just the trust in the data that people need to have to to use the data that you're

Starting point is 00:44:47 producing because if you have you know it's like okay great you have all these fancy graphs and stuff but do people actually use that data are people actually trusting that data and so that's something we're really sensitive about is making sure that you know if we come into an organization like we're not trying to just like build something, leave, and then like nobody really uses it. It's like, we really want it to be used long-term and build sort of that muscle within the organization on being that data-driven. And, you know, like Rachel was talking about earlier with Anil at Matterhouse, it's like, you know, they're trusting all this data, they're using it for board meetings, and all that kind of stuff.

Starting point is 00:45:30 Yeah, I think the one last thing I'd add to that is, I think in a lot of these companies, you know, obviously, it makes sense when you have a limited number of individuals, you have a lot of people focused on the product and engineering, and then you have kind of this slim, quickly moving go to market area, right? You've got a salesperson, a marketing person. They might have other roles in the company as well, right? Especially when you're really small. And so they don't necessarily have the ability to hire a data engineer. And what ends up happening is no one's taking a step back

Starting point is 00:45:58 and saying, like, what does this data mean? How should it be used? You have products that's saying, I know I should be creating a lot of data. I know that someone's going to want to use it. I don't know what they care about. I'm just going to create a ton of data and send it into a warehouse. And then you have people on the other side that maybe don't have the technical skills saying, I don't know what to do with this data. I don't know what it means. And so there's this awkward gap. And so I think what ends up happening is that gap will continue to grow. And it makes it very hard,

Starting point is 00:46:29 once again, as we talked about, to make this data accessible to the modern person or like the common person at a company. And so if you don't add that layer that we've talked about, that the analytics engineer has where it's saying, I can take a step back and say, this is the raw data. This is how it maps to what a customer is doing in our product. And this is what you should care about from a business perspective. Then that just gap continues to grow. And so I think that's really what we've seen is people trying to make sense of the data, but really not knowing where to start. And so while it's not always the first thing that you hire at a company, I do think it's something that you should start moving when you hire an analytics engineer up a little bit further, you know, like get that person in there,

Starting point is 00:47:16 build that business logic sooner rather than later, or else you might suffer the consequences. That's great. I have many more questions, especially around your experience with big time data, but I think we will need at least another one. So, which is good. I love to have you back, but now I have to give the stage to Eric because he also has questions and I think Eric, I really monopolized the conversation here. No, it's great. And unfortunately we're coming up on time here. So I will have, I'll just throw one more question out there to wrap it up. And of course, we would love to have you back on the show, but so many great insights and we've talked some about tools, but just give us a breakdown and maybe we can divide into sort of three stages of companies, but just thinking about our listeners who are probably at all different stages of companies, but give us a quick breakdown

Starting point is 00:48:11 of what is the big time data stack of recommendation for maybe sort of seed stage or like, you know, seed stage series, a startup to sort of a mature, like maybe midsize company, you know, seed stage series A startup to sort of a mature, like maybe mid-sized company, you know, maybe a hundred plus employees dealing with some serious data, multiple thousands of customers to, I'm a gigantic enterprise, you know, like Heroku that's running, you know, maybe inside of Salesforce. What's the ideal stack and maybe we can approach it from the standpoint of what tools are the same across all three and then what tools are different, you know, for each stage. Yeah. So I think across all stages, of course, it's going to be dbt exact, you know, that's why it's such a good tool to invest in early because you can,

Starting point is 00:49:05 like, it's going to pay dividends for years, you know, working with it. And it's also, if you want to use dbt cloud, it's super cheap. So it's not like you're, you know, paying through the nose for it. I think that's one thing. I think, you know, I'm not as dogmatic about which data warehouse you pick. I know Rachel would have a different answer. And in some cases, like, depending on your size, if you're really small, like, I don't even know if you need a smoke, like, for a BigQuery. It's like, if your data is small enough, and by small enough, I mean, maybe a terabyte total, which is actually, you know, a pretty decent amount of data. You know, you could just run a Postgres database, like, who cares? And then obviously, once you do get to that sort of

Starting point is 00:49:49 growth stage, and you're a little bigger, and you can pay the money, you know, go with, I would say Snowflake would be my 1A, and then BigQuery could be my 1B, if you're, you know, if you're all up in the GCP world. And, you know, I think then what you're going to need is, you know, the ETL tool, just, you know, a Stitch or a Fivetran or whatever. Stitch is pretty cheap too, so you could get away with that that way. And then you would need a, like, you need to get product data into your data warehouse. So, you segment a rudder stack a tool like that and then and then you're going to need at some point your reverse etl which i would say is more like a growth stage tool and there's so many tools out there for

Starting point is 00:50:38 reverse etl that richard and i are still trying to like figure out which one that we like the best, but there's so many players in that space at the moment. Yeah, I was just going to say, I feel like one of the things is like across the board, kind of like you said, from my perspective, it would be, you know, series A seed round, depending on your type of business, you probably don't need a snowflake or BigQuery, probably just Postgres, Redshift, something like that. But then you start talking about the tools that are across all of them, definitely dbt. The other one, I mean, from my opinion, I know this is the podcast, but rudder stack, the reason I would pick rudder stack for

Starting point is 00:51:21 event streaming is really because that's going to scale with you. So we talk about how much energy or effort is going to take for you to move from one product to the next as you grow. With Redderstack, I definitely just feel like it will grow with you as you scale price-wise. You're not going to be put in a corner as you start sending more events. So I do feel strongly about that one as well. And then the other thing is, I don't know how much you really need a reverse ETL when you're that small, because you basically have one salesperson that's manually entering leads and doing that stuff, right? So at that point, I think very early on, it's maybe not necessary. But then as soon as you start having two to three different people, you have a third-party tool like HubSpot or Salesforce, and you're really wanting to make sure that there's enriched data based off of product usage

Starting point is 00:52:10 in there. That's when you should start really investing in it. You got people like Census, Polyconic, I know Redderstack's coming out with their new stuff. I think overall, that's a new space that we're going to see a lot of growth in, in terms of how do you make this data accessible in the places where people need it most, which is sales and marketing and all of those things. I'm trying to think about if there's other stuff that's really missing there. I guess the last thing is data visualization. Looker's not the cheapest product. I think it really helps you later on dealing with data governance, but when you're really small, you could, you could probably let dbt handle a lot of that. So I'd say when you hit series B, I would start thinking looker before that you could probably deal with, you know, Metabase, I think is open source. They also

Starting point is 00:52:56 have a cloud version. What are the other one, Alex, that you think from a visualization standpoint? Yeah, I mean, I guess there's mode. Yeah, there's mode. The thing I don't like about mode is just like you're having to put so much SQL into it, which again, to your point about dbt, it's like if you can basically make your

Starting point is 00:53:17 mode queries really, really simple and then have all the complexity in dbt, then I think that allows you to scale. Plus then, if you do switch to Looker, you're already kind of like, you already have all these dbt models that are, you know, being used, and you can basically just, it makes your migration process a lot easier. Yeah, and when we say data governance, I think the biggest thing that we're talking about from a visualization standpoint is there are tools that you write one-off SQL for every single visualization you want to create. And what ends up happening is like we mentioned technical

Starting point is 00:53:51 debt, because if a single piece of business data changes and you have to go and update your visualizations, are you going to want to go and update a thousand visualizations to add that one piece of logic because you decided to write custom SQL for every single thing versus with Looker, you have your own code behind the scenes, which is called LookML, where you define all your business logic, and then it just flows into the visualization. So it ends up being much more scalable. And that's what, you know, we're talking about in terms of data governance, where it's so much easier to scale and you can really trust because all of the logic is owned behind the scenes in a GitHub repository. And you make one minor change, it has to be PR approved, analytics sounds it, you can really trust that data as well. Yeah, that's great. That's some great points. And acquired, right? And my feeling

Starting point is 00:55:08 is that as we are entering like a new, let's say, innovation cycling in the data space and the way we interact with the data or the requirements that we have around the data is going to change. I think we will start seeing new BI or visualization tools that are going to address that. And that's something that I'm really looking forward to see what's going to happen in this space in the next couple of years. And the other thing that I would like to add based on what you said,

Starting point is 00:55:34 just to summarize and also add my feeling about that. I mean, right now we are in a period in time where there's like crazy hype around anything that has to do with data. There are literally like products coming out every day in every possible function around data from governance, pipelines. There's also like a big part that has to do with ML and AI, which we haven't touched and it's still quite immature, but there are even new categories like that are formed right in there with things like feature stores, for example. So there are way too many things happening. And I

Starting point is 00:56:10 think for someone who's trying to build a new stack, it's really easy to get lost in all these details, make the wrong choices, have overconfident in what you can do with your data. I mean, I was in this position, right? Like I had five customers and I was trying to do data-driven product development, which, okay, doesn't make sense. So that's why I think that it's a great opportunity. And I would advise a lot, like all these companies,

Starting point is 00:56:38 especially on the earliest stage to get in contact with you at the big time data, because there are many pitfalls and a lot of advice that you can give to navigate this space and help them get value out of the data faster and reduce, of course, their costs because when it comes to data products,

Starting point is 00:56:56 mistakes cost a lot. So Rachel and Alex, thank you so much for being with us today. Pretty sure we will have another show for sure. I mean, there are many, many more things that we have to chat about, more business oriented things, but also like more technical things.

Starting point is 00:57:13 I think that one hour is just not enough. I mean, you both have like so much experience in this space that there's so much value that we can give to our audience. So I'm looking forward to have another show with you in a couple of months. Yeah, absolutely. We appreciate you having us on.

Starting point is 00:57:28 And yeah, I mean, like you said, like there's so much going on in the space. It's like exciting. And I think like Rachel and I, I don't think we realized it when we started Big Time Data, but like how much fun we're having just like being a part of sort of this community

Starting point is 00:57:42 as it grows and just learning all that stuff. I mean, that's why we started Big Time Data because we were having these conversations and it just, I remember thinking, oh my gosh, the conversations I have, you know, two or three times a week are the highlight of my week. I love talking about this stuff. And so it was just kind of surprising how much we knew and how much fun we were having. It was like, what are we doing? Like, let's just make this our life. So it's been very exciting. And honestly, it's a joy to come and be able to have these conversations. You know, we have these conversations, Kostas, off of the podcast all the time with you. So it's been very fun to just be

Starting point is 00:58:20 able to dive into this. The last thing I would add is, because as you talked about, there's so many tools that are coming out, right? And I think one of the things that Alex and I are really going to try to hone in on are, what are those core components that you absolutely need? And then what are those fun little add-ons to your data stack that depending on what you're trying to do would help you, right? And so, you know, I think that's something that we could talk about in a future podcast. It's like, what are the core pieces and what are some different add-ons that you should start thinking through depending on what you want to do? And if you have someone that wants to do AI and all these different things, it's just

Starting point is 00:58:53 really fun to think about it, but there's so many tools out there. It's really hard to know where to get started. Yeah, yeah, absolutely. I think that's an excellent idea actually for content in general, but also for another episode to have together how we can compile this landscape in a way that can be easily digested by our audience and also give them some kind of, let's say, map to navigate this and make the right choices of the right tools depending on their needs and the market they are in and their use cases. So I think we should absolutely do that. And I have to say that I'm really happy to hear

Starting point is 00:59:28 that you're having fun doing all this because I think that's the best that can happen in life, right? Having fun while delivering a lot of value to many people and companies. So that's great, guys. Well, thank you again for having us on here. Hopefully we'll be back soon. Absolutely.

Starting point is 00:59:44 Thank you so much. Thank you. As always, a fascinating conversation with Alex and now Rachel. I think the big takeaway that I had was really just reinforcement of an idea that we've heard before on the show. And that is that the tooling is one thing, And it sounds like it's just gotten way easier to build a scalable stack, but the people running the stack really make the difference. And it's their commitment to shepherding the data and shepherding the tools in a way that doesn't create future problems for the organization, which just aligns with sort of what we want to learn about on the show, right? The people who are behind the tools. I think this was like a very

Starting point is 01:00:24 unique show exactly because we had the opportunity to I think this was like a very unique show exactly because we had the opportunity to have two people that have a very symbiotic relationship. We have the data engineering and the operations from the other side. And I think it became extremely clear that the success of any kind of data initiative

Starting point is 01:00:40 inside the company relies greatly on how these people and these functions can work together. And of course, with Rachel and Alex, they work really, really well together. But I think it's something that whoever starts to trying to build a data stack needs to have in their minds together with the technology. Absolutely. Plus, they're pretty funny. And it's great to have funny people on the show. All right. Well, thanks again for joining us on the Data Stack Show. Subscribe on your favorite podcast network to get notified of new shows and we'll catch you next time.

Starting point is 01:01:12 The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Learn more at Rudderstack.com. you

The Data Stack Show - 30: The DataStack Journey with Rachel Bradley-Haas and Alex Dovenmuehle of Big Time Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.