The Data Stack Show - 22: Season One Recap with Eric Dodds and Kostas Pardalis

Episode Date: January 29, 2021

Season One of The Data Stack Show is in the books, and in this episode, Kostas and Eric take a look back at some of the biggest takeaways, trends, and topics from the season. With some great guests al...ready set for season two, the next slate of episodes is shaping up to take an even deeper dive into the world of data and the people shaping it.Key points in the conversation include:Patterns with data warehouses and data lakes (3:38)Looking back at the people behind the data and their stories (8:12)Minimizing flaws while remembering that data is built by humans, for humans (11:02) Using proven technology and making mature solutions (15:20)Data involves a significant amount of trust (23:38)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I'm Eric Dodds. And I'm Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it. All right, we are wrapping up Season 1 of the Data Stack Show. We had 21 episodes in the first season, which is amazing. It feels like it wasn't that many, but we actually talked to a ton of different people. Kostas, how do you feel now that you have completed the first season of a podcast that you
Starting point is 00:00:47 started? Ah, it was an amazing experience. Actually, for me, it was also my first experience with podcasts. And it was a much, much better experience than I expected, to be honest, full of surprises. And as you said, like a ton of people that we had the opportunity to chat with and many different people actually in these cases. I mean, when we started this podcast, the main thing we had in our mind was that we want to focus around data, right?
Starting point is 00:01:20 And I think one of the most amazing outcomes of all these conversations that we had in this season is that at the end, data is, at least today, pretty much behind almost everything. We had the opportunity, for example, to talk with companies that, on the first look, you don't think that they are a data company or they are working with data, but at the end, they are. If you remember with Slabdash, for example, the episode, which is a productivity tool, but at the end it's built on top of a ton of data to do that.
Starting point is 00:01:52 Same thing also with Panther Labs, where they are a security tool. But as we were discussing with Jack, the main outcome of the conversation was that security is a data problem. So, yeah, it was an amazing journey, actually, and full of surprises and amazing guests. And I'm really looking forward to continue doing this on the next season. Yeah, I agree. I think the concept of everything or the reality of everything running on data is really interesting when you look back at our shows. I'm thinking about sort of two sides of that coin, and I know
Starting point is 00:02:31 there are different ways to look at it, but when we talked with Bookshop and Mason, he talked about the data that they have to wrangle just in order to display a correct listing of books on their e-commerce website for books. And that's sort of core to the product, right? They have to leverage that data in order to deliver a product that provides a good experience. But then on the other hand, I'm thinking about our conversation with Jason from BIND in the health insurance space. And he talked a lot about delivering data products to other parts of the organization, right? So of course, they work on things that feed the core product, but then you have data driving marketing, you have data driving product, you have data driving sort of testing
Starting point is 00:03:25 in different areas of the company. So it's just interesting to think about data both driving sort of the core product use cases, but also being a product itself that's consumed by other parts of the organization. Yeah, absolutely. And if you think about it, I mean, it looks like data is everywhere, which makes sense. And I think that's another great, a little bit more technical outcome from all the conversations that we had, that there are some emerging patterns out there in terms at least of the architectures that the companies are using. I'll give some examples. Even from the first episode that we had with Mattermost, for example,
Starting point is 00:04:07 we saw this pattern of building everything around the data warehouse, right? Where you have like the pipelines, which is one part of the architecture. The pipelines are pulling the data from the different sources that the company has and all the data that they need. Push it into a data warehouse. The data warehouse might be something like Snowflake, for example. And actually, one big difference compared to the past, because it's a new pattern, but more and more people prefer to just extract and load the data inside the data warehouse
Starting point is 00:04:39 and implement any kind of complex, let's say, transformation logic on top of the data warehouse, which is a result of huge changes that have happened in the space of the data warehousing, mainly the scalability of the solutions in terms both of processing storage and of course cost, which is very important. And also these platforms have become more and more powerful in terms of the expressivity that they have, like the things that you can do.
Starting point is 00:05:07 And that's where it's a point where you see technologies like DBT come into play, right? Another very common layer that emerges inside companies in terms of like building the infrastructure. With DBT, you have a layer where you can transform and model your data with all the benefits around that. And then, of course, you have the consumption part of your architecture where you have your BI tools connecting there and using something like LookML again with modeling. That has to do with modeling, but this time LookML is more on the visualization part.
Starting point is 00:05:43 And that's a very common pattern that we see. I think it's pretty much, it exists in pretty much like every company that we discussed with. But there is also another pattern, let's say, that emerges which has to do with data lakes. And the data lake is something
Starting point is 00:06:01 that we see coming up more and more when we are talking with organizations that they incorporate data science in their practices. And the interesting thing, if you ask me, is that the first texture pattern is more about building the infrastructure for the internal consumption of data. If you notice, Eric, we talked about BI, right? In the first version where BI is like 100% something that's going to drive your organization and then like other departments, like your marketing, your finance. With data science, there we see how data can become a product that's exposed to your own customers. I think this is the most powerful use case around data science. Of course, you can use data science to also do things that are consumed internally, like lead scoring, for example, right? Or do some forecasting for your sales or finance. But I think that the most powerful
Starting point is 00:06:57 part of data science and how it is used is like in building products that are going to drive like the customer experience. And there, the requirements are a little bit different. We had some hints around that in the first season, but we are going to have even more exposure to these use cases in season two. It's one of the things that I'm really excited and I'm looking forward to the next season. Yeah, I agree. I think that we didn't necessarily have a plan for a breakdown of the types of roles that we would talk to in terms of guests who work in the data space. But we did talk with several people who work specifically in data science. So I mentioned Jason from Bind.
Starting point is 00:07:43 We talked with Stephen Bailey from Emuta, who does a ton of stuff in data science. So I mentioned Jason from Bind. We talked with Stephen Bailey from Emuda, who does a ton of stuff in data science, Ari and Osman, who works for HomeSnap. We talked with multiple people doing data science, and it is really interesting to look at that subset of the data world where there are different requirements than, you know, sort of your quote unquote standard, you know, pulling data, processing it, and then either sending it to tools or preparing it for BI. You know, one thing that I think is really interesting that really stuck out to me was we got the chance to,
Starting point is 00:08:23 how do I want to say this, sort of look at and discuss the people behind data and behind the technology in a variety of different ways. So first of all, I think one of the things that I've come to enjoy most about the show is that there's lots of cool technology. There's lots of companies doing really neat things with data, but the people who are doing those things have really interesting stories. So I think about Andrew from Ernest who, you know, works on real estate transactions and is doing some really interesting things there. And just hearing about, you know, his work as a marine biologist, you know, working, you know, in the ocean. And that was just fascinating. Stephen Bailey, who I mentioned before, has a PhD in childhood
Starting point is 00:09:14 reading or education. That's just really, really interesting to see that. But then the other angle of that that we saw, I think, is the human element in the actual work. So, you know, we talked with Duke Haba, who was at Cognizant and has just had a long career working in AI. And the way that he addressed it is really interesting. He talked about the people fear AI or they're skeptical of AI, you know, or AI produces a bad result. And so you tend to blame the technology, but he said, there are actually people behind that. We need to remember that. And then the last, or, well, there's many more, but the last one I'll mention that stuck out was Ari and Osman at HomeSnap talking about building models for
Starting point is 00:10:01 sort of predicting the time it would take to sell your home, depending on the price, you know, so you have a slider, you say, okay, if I price my home at this number, it should sell in, you know, 25 days, you know, and of course the time lengthens if you raise the price, but we talked about how the model, you know, it's, it's hard to train a model to account for human perception around things like if the price is too low or too high, you know, people have questions. And you talked about ways that you can incorporate that. But all that to say, I just really enjoy, I think one of my favorite things is getting to meet the people who are doing this work and hear about their lives, you know,
Starting point is 00:10:43 kind of outside of their data work and the things that influence their their lives, you know, kind of outside of their data work and the things that influence their data work, you know, whether that's prior experiences or other projects. And then also just the human element of actually doing the work of data and how, you know, technology still requires, you know, a real human element in order to produce a great experience for users. Absolutely. I think you are touching like a great, great point and a great insight for users. Absolutely. I think you are touching a great, great point
Starting point is 00:11:05 and a great insight that we tend to forget, I think, especially as the people that we're working in technology, which is that technology in general, regardless if it is around data or not, it is built by humans and it is built for humans. So it's not going, I mean, it's not the responsibility of the technology, right? At the end, if something goes wrong,
Starting point is 00:11:29 it's because our models or our architectures, they have some kind of flow. And of course they are going to have a flow. There's no way that we can build something that is going to be so complex and perfect from the first time that we get it public. Iterations are needed and anything that has to do especially with data and especially anything that has to do with data science and data analytics it's something that it's the productization of these practices are something
Starting point is 00:11:58 very new. So it will take time. It will take time both for the people responsible to build them, to build the best practices and the engineering practices on how to engineer something that is going to be the best possible solution and more predictable in the outcomes, but also for the people that are using these tools and products. They have to get educated, right? Technology is still technology. And data science model is a model that predicts something very specific.
Starting point is 00:12:31 It's not a human that we have in front of us that can adapt without the intervention of other humans. So it's a very interesting point. I think we need to always remember that, as I said, these are things that are built by humans and for humans. And we have a lot of work in front of us to improve them and figure out what's the best products that we can build, which brings me to another outcome of all the conversations that we had, which has to do with the maturity of this market and the industry.
Starting point is 00:13:03 We saw where there are like specific technologies that they are really mature right now. So for example, anything that has to do with data pipelines, right? We talked about ETL that became ELT. That's a pretty mature part of the stack where I would say the products are almost like commoditized right now, right?
Starting point is 00:13:21 And there are some parts of the stack that are very mature. Same also with data warehousing, right now, right? And there are some parts of the stack that are very mature. Same also with data warehousing. But there's also a huge, huge space for new products and a lot of opportunities. And we saw that with, for example, if you remember one of the first interviews that we had that was with Meroxa, with the VARIS, about CDC, right?
Starting point is 00:13:42 CDC is very hot. I mean, it's something that we see many companies right now trying to come up with solutions and products to address CDC. Then something that is super hot is anything that has to do with data governance, which, by the way, is funny because it's something that in the past, at least, it was you would hear like data governance and would be like, okay, that's something super boring that only Fortune 500 companies care about. And suddenly it becomes something that is relevant for everyone, right?
Starting point is 00:14:14 And you see many companies trying to address, we had Immuta, right, which is just one part of data governance and they address it and that's the access control. Then we have iteratively our last episode where guys over there they are working on trying to figure out how to do improve the quality and control the quality of data. So that's super hot and I'm pretty sure like in the next couple of months we are going to see more and more companies appearing trying to solve these problems and of course just make the connection with the architectural patterns that we were talking earlier, there's a lot of space for products around anything that has to do with what is today called MLOps, which all the operations around how to productize ML, machine learning and data science.
Starting point is 00:15:04 And as I said, more about this on season two, but I think it's going to be very fascinating. And just to get to something that I think it's one of your favorite outcomes, it's about the importance also of boring technology, right? What do you think about this? Yeah, that came up on multiple episodes. When we were talking with people from Bookshop, they brought up the boring stack. Sometimes I'll ask for a stack breakdown just because it's interesting to see different ways that people shape their data stacks around the needs of different businesses. And it was interesting to talk with Mason. And his response was, we actually have a pretty boring stack, but it works. And I know that we have an episode coming out in season two with a company called LeafLink. and we have a similar conversation there. And I think in that episode, I won't give too much away, but we talked about the balance between building something that is scalable and reliable that works for your users and your company now with an eye to the future, as opposed to just adopting new technology, because
Starting point is 00:16:26 it's new, right, or it has some promise, like that's not always the best decision. And so we talked to a lot of companies who, you know, we certainly people are doing really interesting things. But in many cases, they say we have a really boring, but very stable and very efficient stack. And I just found that fascinating. Pushing the limits is really cool, but I think we saw several very mature developers and data engineering and data science practitioners who have seen a lot and have implemented a lot of things and really build for something that's going to be scalable, maintainable, and provide a great experience. But that's from my perspective. I'm interested
Starting point is 00:17:10 in your thoughts around that because you have the actual experience of building all sorts of different architectures. So how did that hit you that we saw that repeated over and over across episodes? Yeah, I think at the end, I mean, from an engineering at least perspective, okay, it's always, you know, people who are in technology, they always love to play with new toys, right? And in some cases, it's really hard to control the excitement of using like the latest shiny thing that came out there and promises to solve like another problem or an existing problem in a much better way. But if you approach this from a more mature engineering approach point of view, at the end, you cannot really build something new without having a very stable foundation, right? I mean,
Starting point is 00:17:59 you can, but probably you're going to end up having some really bad nights where you won't be able to sleep because things will go really, really wrong. And it will be super hard to find resources to solve your problems. So if you want my opinion, the best way when you are trying to build a new product and solve a problem that wasn't being solved before or in a new way, one of
Starting point is 00:18:26 the best choices that you can do is the foundations of your solution to be based on proven technology. So what we actually see with these companies that they have this approach is that you have some really mature and experienced engineering teams behind that they know that I shouldn't focus on trying to debug and understand how this new shiny thing works. But instead, I should free my mind and let it focus on the problem that I'm trying to solve with my own product. So, yeah, I think it's a great sign of maturity and good engineering practice at the end.
Starting point is 00:19:07 And as we approach and chat with more companies, I think this pattern will appear more and more. By the way, just to add something here, we might think on the other hand that, yeah, but look at Google, look at Netflix, right? They are using all these new shiny tools and they are building their own shiny tools to do that. But when we get into this thought process, we forget something very important that these companies, they are addressing problems at a scale
Starting point is 00:19:40 that it's completely new. They are really pushing the frontiers of technology and what is available right now. And at the same time, they have the resources and the talent to build new solutions that can address these unique challenges that they have, right? So from starting a company to becoming a company that has to deal with the traffic, for example, all the reliability that Netflix has, that's a huge, huge road ahead, right? So that's another thing that we always need to keep in mind
Starting point is 00:20:12 when we see these amazing companies using and building all these amazing products. Speaking of Netflix, I mean, it was a real honor to speak with Ionis from Netflix and hear about all of the various challenges they have. But when we asked him about how they evaluate using or even building new technologies, I think it can be, we can have the perception easily that, well, there's a bunch of engineers at this company and they have tons of resources and so they can just try all these new things. That's not necessarily untrue, but the reality
Starting point is 00:20:50 is that they have a very balanced approach to making those decisions, especially around building something new. And there are multiple people inside the company involved in those decisions. You know, you talked about people from the business side, even, you know, sort of thinking through like, do these things make sense for us to build, right? We have a problem, it's creating a customer experience, it's creating, you know, some sort of friction and business process, but they don't have a, you know, they're not reactive in the way that they adopt or build new technologies. And so, yeah, it was really cool to hear about that from Netflix and hear, you know, it isn't just a free for all of exploring new technology. They actually have a very principled approach to the way that
Starting point is 00:21:37 they use new stuff. Yeah, absolutely, Eric. And also, we need to keep in mind the unique nature of a company like Netflix, right? I'm pretty sure that if we were talking with someone who is from the, I don't know, like the production teams that they have, right, that they produce the shows, for example, what we would hear would be a little bit different than what we heard from Ioannis. And there is a good reason for that, because at the end, Netflix, their main business is in producing content and shows, right? Technology is there to support that. And that gives also like a different, let's say, freedom and flexibility to the teams that they work to build the technical backbone of the company. If you go, for example, to a company like Snowflake, which, by the way, is the company where Ioannis works today. So maybe in the future, we should also try to have another episode with him and see how the two environments are different or the technologies are different and the products.
Starting point is 00:22:45 I'm pretty sure that at this point, Snowflake has some very strict methodologies and processes in terms of how to introduce new technologies or how to introduce new practice or even change the core product that they have. So it's very easy to get excited when we see just one announcement or one blog post from a company,
Starting point is 00:23:06 especially at that scale. But we should always try to remember what the company does, why does that, and that at the end, each company is different. And the culture is also different. And that all these things make the whole process of how to approach new technologies very different from company to company. Without saying that one is better than the other, right? At the end, what matters is the success of the product and the company. And it seems that even companies with very different approaches, they might succeed. So that's great. Sure.
Starting point is 00:23:38 Well, one last thing before we close out the season with a season wrap-up show. The last subject I wanted to touch on was the subject of trust. And we talked about this with multiple people. I think one of the first times that it came to the forefront of conversation, actually, and we didn't necessarily use the word trust, but I think about our conversation with Axel from Pool. And aside from him telling an amazing story about Paul Graham telling him his startup idea was horrible right to his face, which is one of my favorite stories from the season, he had a really simple but really powerful piece of advice for early stage companies where he said, you need to be
Starting point is 00:24:26 very diligent about collecting the data. But he said, I always, especially in the early stage, you know, sort of pre, you know, being able to have statistical significance and make decisions based on that. He said, I always use the data as a way to figure out which customers I should talk to directly in order to learn about how they're using my product. And that sort of brought up this idea that data involves a significant amount of trust, right? So both trust and the data, as Axel pointed out. And then we also talked with multiple other guests who talked about how that works inside of an organization. So that showed up in terms of companies where we talked with the data engineer who was, I think, about Stephen Bailey from Emuda, who said that there was just a huge lack of trust in data. And then that sort of colors the way that people think about data engineering and the operations around that.
Starting point is 00:25:23 And turning that around is a really significant effort. And then we also talked about with iteratively the impact of their work around data governance for companies that adopt them. And they really pointed to trust as well. It creates more harmony between teams because everyone is really confident around the data. And I think another component of that that came up was the people who are consuming the data are making decisions with it, right? And so the more confidence and trust that they have, the faster they can move, the better that is for the business. So that was a really powerful topic, I think, in terms of summarizing a lot of the topics that came up around data.
Starting point is 00:26:06 What did you think about trust as sort of a summary of a lot of the things we discussed with our guests? Oh, yeah, absolutely. I think that, as you said, like trust both on the data itself and also on the teams involved, because keep in mind that data inside an organization is an interdisciplinary thing, right? It's super important. And that's why we see the emergence of data governance and all these companies that are trying to tackle the different aspects of data governance, right? From access control to quality of data.
Starting point is 00:26:42 And this is something that actually I think, I mean, I keep saying that, but I'm really excited about what's going to happen in the next season. But especially as we get inside MLOps, where, as we said, machine learning is not just about the impact that you have inside the organization
Starting point is 00:27:00 with the data, but it's also how to deliver innovative products to the customer. They're trusting the data, trusting the data, but it's also hard to deliver innovative products to the customer. There, trusting the data, trusting the models, trusting the products that are built on top of the data, it's going to be huge. So we will see a lot of work that's going to be done on that. And it can be like a total disaster if, for example, your models or your data exposed to your customers in a way that it might be perceived like something wrong. Like, I'll give you an example.
Starting point is 00:27:32 We all experienced what happened at the Capitol building, right? the recommendation engine on YouTube where, I think it was on YouTube, where there was a video from all these very sad things that were happening there. And the recommendation engine was recommending products that are related with survival and guns and stuff like that, which, I mean, if you think of that just from the perspective of the data itself, it makes sense, right? They are two related concepts, but they are two very wrongly related concepts at that specific time. So the more data is exposed and become like an integral part
Starting point is 00:28:18 of the products that we build, the trust of both the data that we are using and what we build on top of that, it's going to be super, super important. And that brings me to my last highlight of all these discussions that we had the past week, which is around open source. And I think trust, it's another, I mean, open source is another way of building more trust over the technologies that we are using. And it's a very foundational component of building like technologies that we can use with our data in the best possible way.
Starting point is 00:28:54 Absolutely. Well, I am extremely excited about doing another season. We already have several episodes recorded that I'm really excited about. And I'm excited. We'll cover other topics. Please feel free to reach out to us with your feedback. We'd love to hear what we're doing well, what we're not doing well, what types of guests you would like to hear,
Starting point is 00:29:18 what types of topics you would like us to cover. So feel free to ping us on that. Costas' email is costas at ruddersack.com. I'm eric at ruddersack.com and we will catch you in the next season.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.