The Data Stack Show - 48: Season Two Recap with Eric Dodds and Kostas Pardalis

Episode Date: August 11, 2021

Highlights from this week’s episode:Dissecting the different team structures from organizations in season two (1:16)The people behind the data are key to the data itself (9:17)Open source licensing ...and the core components needed for large scale commercial viability (15:13)Game-changing core technologies in the new data economy (22:09)Snowflake vs. Databricks battle. "The UFC of Geeks" (25:54)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. This is a wrap-up of season two. That's right, we've completed two whole seasons of the show. I think we've recorded 50 episodes,
Starting point is 00:00:43 not all of those are out yet. Season two was 22 or 23 episodes. And for this one, we actually decided to turn the video cameras on. We thought that would be fun. So you can see what Costas and I look like face to face. You can see our podcasting equipment and all that fun stuff. So we're going to do this one on video. We'll post it on YouTube. In the season recaps, we like to just do a quick overview of what we covered in the last 20 or so episodes of the show.
Starting point is 00:01:08 And so we have a list of the things that stuck out to us. Costas, I'm just going to run down our list here because we have a lot to cover and we like to keep these fairly short. The first topic is team structures. There were a couple of things that stuck out to me across multiple episodes here. And one of them was the different structures we see between the relationship of data engineering and data science. And a couple of episodes come to mind there. One is Policy Genius versus two other companies, HomeSnap, and then The Atlantic. We talked with data scientists at all three companies, leaders in the org. And the high-level overview, and I just love your thoughts on this, policy genius sort of has purview over all the data. So data science is a component of the data
Starting point is 00:01:56 practice in the company. You can go back and listen to the episode to hear about that. At HomeSnap and The Atlantic, we talked with lead data scientists, and they really do nothing on the data engineering side. They just sort of receive the processed, cleansed data that they need in order to build their models. I think there are advantages to both. It's obviously working very well in different contexts. But what stuck out to you about that different team structure that we heard about? And there are more examples, but those are just the ones that jumped out. Yeah. And I think that this is some kind of theme that is going to be repeated in future episodes too. I think we are going to be discussing a lot about that,
Starting point is 00:02:33 mainly because anything that has to do with data and how data is positioned inside the company is still something under definition. Quite recently, we started discussing about MLOps, for example. What is MLOps? Who is responsible for MLOps? Is it going to be the data scientists? Is it going to be the data engineers? Is it going to be some different team? We don't know yet.
Starting point is 00:02:53 All these things are still emerging out of where the companies are still trying to figure out how to work with the data, what value they can get from the data, and, of course, reorganize the company in a way that can maximize the benefits that come from using the data. I think something that makes a big difference in how data teams are structured is two things. One, how important data is to the product itself. So we're talking about policy genius, We're talking about a company that operates in the insurance market. Of course, data is very important there.
Starting point is 00:03:30 The data itself is actually part of the product. So it has a very core role inside the company and inside the business itself. So that changes things a lot. Now, on the other side of the spectrum, you have this amazing case of the Atlantic where you have like a very old organization which is working on like introducing data
Starting point is 00:03:53 into even the product itself. Because if you remember when we talked, like the stuff that they are doing are pretty amazing. Like they are using ML to help a lot, like the experience that the end user is going to have reading the Atlantic. But still, you have an organization that needs to evolve into a data-centric, let's say, I don't like the data-driven term, but much like a data-centric organization compared to another company that it's built from day zero with data at its core in terms of like the value that they provide.
Starting point is 00:04:28 So, of course, this is going to affect also how the companies are structured. Right. Yeah. That's as we say, like, I mean, it's still something where like as professionals and does the market itself like still tries to figure out. Yeah, I think it's interesting. If you think about policy genius, the product wouldn't work without data. Yep. And I think actually, Jenna, I've thought about this multiple times where we said, you're not really a product company, you're more of a content company. And she disagreed. She said, content is the product. And I loved that. But at the same time, you could publish plain text on the internet and you don't need any data for that. Of course, it is data. And I think Jenna's point was very
Starting point is 00:05:15 well made. But to your point, from a business model standpoint, it's not like they have to have some sort of underlying data set in order for the product to actually function. You can just produce content, which is super interesting. One other point that came up, and this came from Policy Genius, but we heard about different setups, was where do data professionals live within the organization? And the two big structures we heard about where data is a centralized practice that acts as a shared resource across teams and a service center, if you will. And the other one was what the team of Policy Genius called structured embedding, where they actually assign data professionals, various capacities, right? So an analyst or a data engineer to specific teams. So you'll have people on the data team,
Starting point is 00:06:05 but they're embedded into the product team or the marketing team, which is really interesting. And that was a really interesting conversation. My question for you, Kostas, because you run an engineering org and have run engineering orgs, what are the reasons that you would want to pursue a structured embedding as your org structure as opposed to a shared service centralized model what are the different business models or questions that's an excellent question and actually it doesn't have to do only with data engineers or data analysts right like it's it's a more general question of like how you want to organize your engineering organization at the end.
Starting point is 00:06:49 And there are two main, let's say, directions there. One is you have each function that it's siloed in a way, right? And each silo interacts with the other one depending on their needs. So, for example, if the engineering team that builds a specific feature, they need some, I don't know, some data, okay? They will go like to the data team and be like, we need this pipeline to be built and this data to be delivered there, blah, blah, blah, all that stuff.
Starting point is 00:07:16 And the data team is going internally to prioritize this in the way that they want and implement it and deliver it, et cetera, et cetera. And that's not just with data engineering or data in general, like think about design or product, like again, it's or even like front end and backend development. And then you have this concept that I think it was pioneered by Spotify, or at least that's what I remember right now as a very good example of a team that does that, where you have like this concept of a squad, let's say,
Starting point is 00:07:47 where you have teams and they have all the functions inside them, right? Like, so you are going to have the front-end developer that's assigned to a specific team. You are going to have the backend developer. You are going to have the data engineer. You are going to have the designer. You are going to have the product owner
Starting point is 00:08:00 and all that stuff. Now, I think at the end, because, okay, there's a whole industry behind how to structure teams and with coaches with consultants blah blah blah and all that stuff and how to increase productivity and then i think it's a matter of culture the way that you structure the your organization is a matter of like culture i think it has to do a lot with the early, at least, leadership team, and of course, the CEO and the culture that the CEO brings to the company. And at the end, it has to do with how you can manage communication better.
Starting point is 00:08:38 These are the two things that affect that. Outside of this, I wouldn't say that I prefer one or the other. At the end, people need to figure out how they can communicate better. And the structure is going to emerge from that. So I don't like to see things black or white, although with an engineering background, like I'm trained to do that. But unfortunately, when we are talking mainly about humans and human relationships, things are like, not just black and white. There are many shades in between with many different colors also. It's not even just gray.
Starting point is 00:09:10 So that's my opinion. And yeah, whatever works at the end and keeps people happy with less friction. That's the most important thing. Yeah. Okay. Next topic on our list was, and this has really been a recurring theme, I think, through season one and continued into season two, but I'll bring up two specific examples. The theme is the people behind the data are really the key to the data itself.
Starting point is 00:09:35 And I'll bring up two specific examples. We talked with one of the early data scientists from Shipt, which is a grocery delivery company that was purchased, I believe, by Target. So they're just absolutely huge. And then we also talked with Peter, the founder of a company called Aquarium Learning, which helps people, it's tooling for building better models. And it was really interesting to talk with both of them. So Shipt deals with a lot of out-of-stock or in-stock type predictions around people purchasing things online across grocery stores that are geographically distributed. Very complex problem.
Starting point is 00:10:13 Peter got his start in machine learning, working at a self-driving car startup that ended up becoming a gigantic organization. Just really, really interesting people. Yeah. The thing we heard from both of them was a model is a model and you always have to ask the right questions and have the right mental framework when you're building a model. And I'm just interested for you as an engineer, and I actually don't even know what your experience with building, you know, machine learning models is, but just give us your perspective on that as an engineer. What are the pitfalls? Did that really resonate with you? Do you think there are areas where
Starting point is 00:10:53 that's not necessarily true? Yeah, I think that's an excellent point, actually. And I really enjoyed like the conversation we had with Aquarium because it managed to communicate something that people tend to forget a lot when it comes to machine learning, that before you end up having a model that does its magic, you might need maybe even like thousands of human beings do very boring stuff like annotate data and that's i don't know at least like for the people that they live in the bay area in san francisco they might see all these waymo cars driving around again and again and again like where they just create data right so you need someone to drive around to create like the data set that is going to fit and build the models. And we are talking about probably millions
Starting point is 00:11:45 of hours of driving without purpose. And that's the same thing also with the pictures where if we are capable today to have models that can identify all the different breeds of dogs or, I don't know, all this crazy stuff that you see out there. It's because one way or another, we humans figured out a way to utilize humans to create these data sets. And that's a very big part of the work behind all these models. And it's something that, okay, probably it's not that sexy. It's not so exciting. We prefer to create this kind of image of this better than human intelligence that we are going to create, blah, blah, blah, like, you know, all that stuff. But at the end, there's a lot of labor behind that to happen. And I'm really happy that this is something that came up during our conversations.
Starting point is 00:12:41 And I think another theme, which is probably more profound when it comes to data and machine learning especially, but I think it's like a recurrent conversation also in engineering, right? About, yeah, why you would go and build something
Starting point is 00:12:55 that is going to replace data engineers, for example, because data engineers are building pipelines. Why you are going to automate the process of that, right? Because then you will have data engineers are building pipelines. Why you are going to automate the process of that, right? Because then you will have data engineers losing their jobs. And in the same way, we say,
Starting point is 00:13:10 why we are going to create something that does diagnosis on x-rays. Our doctors are going to lose their jobs. Sure. But if you see at the end, what happens is that all these tools, actually what they do is that they augment what the humans are doing. Right. They don't replace the humans. And I think this is obviously truth in engineering. I think it's also very true about machine learning, at least today.
Starting point is 00:13:37 And this is something that I think came up in many episodes this season. And I think it's probably one of the little a little bit more philosophical, let's say, things that we touched, but probably one of the most interesting and most important ones that we have managed to communicate. Yeah, I agree. Because I think when you,
Starting point is 00:13:57 we talked about this a little bit in each case as well, but machine learning can tend to be an abstract concept in the mind of the average person when they think about AI. It's actually, it's one of those terms where everyone's familiar with it, but when you ask someone to put a really concise definition on it, especially your average person who isn't an engineer, it's actually pretty hard to define, but it shows up in very, almost intimate ways in your life when you think about
Starting point is 00:14:25 taking a car to get someplace that you need to go or ordering groceries because you need to make food for your family. And so I think that will become an even more important conversation in coming years because machine learning and AI are really intersecting with our lives in a lot of ways that many times you don't even necessarily know about as consumers, but there's a direct influence there, which is really interesting. Okay, next topic, unless you're done with that one.
Starting point is 00:14:56 Do you have more? I'm done. I mean, you know me. If we start discussing about this, we will probably need three episodes. Well, I'm saving the best for last. I mean, you are somewhat of a Greek philosopher yourself. So when we get onto the philosophical subjects. That's true. Okay. Another hot topic, open source. So we, there are two episodes that
Starting point is 00:15:20 come to mind. So we talked with Jim from Cockroach Labs, and he had very strong opinions about open source. We kind of talked about the whole Mongo thing and got his opinion on that. And he was great, very opinionated guy, and I loved it. And then we also talked with Sven, and he has done a lot of writing and thinking about open source business models and how they scale over time, which is, I think that episode was recently published, Slaying the Four Dragons of Data, which is a really interesting episode. Either way, they both talked about open source and they kind of had a little bit of a different take on it. So Jim from Cockroach, I think said rightly, there's an ethos behind open source where people want to help other people and provide that tooling
Starting point is 00:16:17 and those resources for free, because it's a community effort where you're trying to help make each other's jobs easier and your work easier. And then people get wrapped around the axle on the licensing conversation, right? So that's kind of where people get wrapped around the axle. And then Sven kind of had this really interesting, very interesting point around what are the core components of actually being an open source company that would allow you to become a very large organization in his terms, the next $30 billion valuation data company. So there's a lot of crossover between what they said, but what's your take on that? On Jim's question around the licensing and that's
Starting point is 00:16:57 where people get wrapped around the axle. And then also as Ben's point about what are the core components that you need in order to build open source, a company that's founded on open source that actually becomes commercially viable at a large scale. Yeah. There's also another episode that has very interesting insights about open source. And that's the one that we had with Tecton, with William. Yes. Because William is like the main contributor of FIST, which is the only open source feature store out there right now. And he also added another dimension of open source,
Starting point is 00:17:30 and especially open source that is not backed by a commercial entity, which is the abuse that some of the maintainers have to go through. Like, it's not easy. And open source projects out there, that they are maintained by big corporate entities, right? Like, when you see things like Kubernetes, it's open source. But, okay, of course, you have Google behind it, right? Like, you have engineers that are getting paid to maintain this.
Starting point is 00:18:02 And then you have other tools like FIST, for example, that's like the other extreme where you have someone who just wants to do it and build something. And he's the only maintainer, like he does it like not as part of his job. Sure. The difference there is, and the interesting part is that the people who interact with these repositories, they pretty much have like the same, let's say, expectations, regardless of who's behind the project. And that's something that it's quite interesting and can be quite taxing for projects that do not have someone to sponsor the project, right? Yeah. Now, I think that as we see this new data economy getting built, we will see open source becoming more and more important business tool, actually, like something that it's going to be important to build a business.
Starting point is 00:18:57 And it's a big conversation why and how. But a very good example of this is like database systems, right? Like you mentioned CockroachDB, for example. It's outside of the big corporations of the past, like Oracle, for example, pretty much like everything exciting that happens right now in terms of a database system is open sourced. Yeah. And that's something that I think we will keep seeing that Kafka, Databricks, Spark. I don't think there's
Starting point is 00:19:26 like a specific recipe on how you can do that. But I think that it's part of like the culture of engineers and developers to interact and use open source tools. So I think that always going to be an important dimension, especially when we build products that are going to be consumed by developers. Yeah, I think it's really interesting. I think if approached in an authentic way, when you're dealing with core data technology, it's a way to understand some of the challenges you face in building the product or some of the problems people face in using it way, way faster because the community is giving you a steady stream of feedback, which is interesting. Whereas you compare that with maybe some open source tools that
Starting point is 00:20:14 are very helpful, but aren't necessarily as core to data infrastructure inside of an organization as something like a database or a data pipeline or something like that. So super interesting. Okay. Two more questions. And I'm saving the best one for last so that we have a hard time stop on it because we've talked about it before and I know we can get long winded. Okay. Tools. I'm going to let you pick two of these to talk about, but we talked with a bunch of companies building really interesting things. We already talked about aquarium. So I'll leave that one off the list. Meroxa is doing CDC change data capture stuff. Super interesting. Tekton is doing a feature store as a service. Super interesting.
Starting point is 00:20:56 Materialize is doing some really interesting things with streaming SQL and materialized views, which is fascinating. And then we also had multiple discussions about graph. Neo4j is one of the ones that comes to mind as well. We talked with Finio as well as doing some interesting graph stuff on top of Snowflake. So pick two of those and not necessarily based on the company because they're all really cool. But when you look at the data landscape, or as you said, the new data economy, which is a very, very elevated term there. I'm biased. I want it to happen. Yeah. That's like a, you're getting into like Forbes territory there. So be careful. So as
Starting point is 00:21:42 you look at the new data economy trademark, which of those tools do you think as an engineer, do you say like, okay, this has the potential core technology to be really game changing or have an outsized impact? That's a very good question. I don't know if I would choose the product itself, but... It was an unfair question, but that makes it way more fun for me and the listeners. Yeah, yeah. But I would at least... Okay, at least one of the things
Starting point is 00:22:13 that I think is going to become really important and we saw with Afinio, if I remember correctly, if not, Afinio, please forgive me, is the whole concept of building data processing applications on top of a data cloud provider like Snowflake. So what we saw happening there is that Athenio has this amazing technology and algorithms of processing graph data. And instead of building their own database engine to do that, they build that on top of Snowflake. I think as a trend, it's quite early.
Starting point is 00:22:49 If it works, I think it's going to be one of the foundations of the data economy. It's going to accelerate and enable the data economy. So then I can go and write articles on Forbes and The Economist. That's my dream. That's why I do whatever I do in my life. So that's one thing. That's an amazing trend that I see and I'm very, very interested to see how it's going to work at the end. The other thing is anything that has to do with stream processing and streams in general. And there we see two things. One is the transformation of traditionally
Starting point is 00:23:26 not streaming data, like a database, for example, in the stream. That's what CDC is at the end. We take the database and turn it into a stream of changes on the state of the database. So from something that we use to only interact with the latest state that the database has. Now we have like a stream there. And that's what Meroxa is doing. And that's one thing. And the other thing which is quite interesting is what I see with Materialize, where we start having some streaming processing techniques
Starting point is 00:23:58 and products that they are much, much easier to work and integrate with existing database systems, like Materialize. And why I think this is important is anyone who hasn't done that in their engineering or developer career, try to take something like Spark, which can do like streaming processing or like Flink and run it and measure the time it will take you to do something with it, like something meaningful, and try to do the same thing with materialize.
Starting point is 00:24:29 And you will see what the difference is there. I'm not going to say that one is better than the other, but what we can see here is like a shift on how we, about the easiness and like the access that we have in processing real-time and streaming data. And I think these two things that I mentioned, these two trends that I think they are going to be important
Starting point is 00:24:54 and interesting in the next couple of months and years in terms of how new data processing paradigms are going to be introduced. Yeah, I agree. And we talked with lots of cool companies. Those are just the ones I pulled off the top of my head for the recap, but I just want to do a quick shout out for the ones that I didn't mention. Avvo is a data governance tool. We talked with Steph from Avvo. Really, really cool.
Starting point is 00:25:17 That was a super neat tool. We talked with Chris from Data Kitchen. Let's see, Grafana. We talked with someone from Grafana. Of course, that's a tool that we all love. I mentioned 4J and Meroxa. I just want to make sure I mention everyone here. Pandio. Josh from Pandio. They built on top of Apache Pulsar. Super interesting for AI ML use cases. We talked with Socrates of Clerk, and they're doing some really interesting things around authentication and user management as a service. I think those are all the tools that we've talked about. So just to our audience, if you missed any of those episodes, go back and listen. They're building some really cool stuff. Okay. Last but not least, we only have a couple of minutes here, but we had multiple episodes
Starting point is 00:26:00 where we talked about Snowflake versus Databricks. I'm surprised you didn't see that one coming, Kostas. I think this is going to turn into like the UFC of geeks, you know? The conversation is really interesting, though. And I'll just give the listeners a brief overview of the very high level. And you've written about this extensively, actually, Kostas, and Brooks can put that in the show notes. But Snowflake is approaching the industry from the data warehouse side with analytics use cases primary, and then moving into data lake, machine learning, and the ecosystem around those functions.
Starting point is 00:26:47 And Databricks is coming the other way, right? They came from Spark. They were open source. They came out of academia, as opposed to being born out of another large startup and being a venture-backed commercial entity from the beginning. And Databricks is now moving towards the warehouse and analytics use cases. So the simple question is, who's going to win? And when do we think that's going to happen?
Starting point is 00:27:14 I don't know. I mean, wearing my The Economist Forbes hat, the winner at the end is going to be the consumer, right? Like it's going to be the customer because they are going to have some excellent products. Steve Forbes could not have said it better. I mean, that is straight out of the editor's note in Forbes. So Steve, if you're listening, we'd love to have you on to get your opinion on this.
Starting point is 00:27:44 Yeah, please, please. I don't know who's going to win. What I know is that I think this whole war for like, or fight or whatever for like the data cloud is just beginning. We have two, let's say main competitors there. But the most interesting thing for me at least like for the next couple of months is the new companies that are getting into this arena pretty much like every data lake project from apache is going to if they haven't done it already they are going to turn into a commercial entity and try like to build the product and that's something we are going to turn into a commercial entity and try to build a product. And that's something we are going to be hearing more and more in the near future, like Iceberg, Hudi.
Starting point is 00:28:33 And of course, we have other companies like Dremio. We need to see what Confluent is going to do. So I think that this competition just starts now and there are many things to happen. And there are, okay, we were joking, but i think we are going to see some great new technologies that they will try to become let's say the data cloud because that's what it is at the end like how we can create like a platform where we can have more cases like affinio where we will go there and build products on top of that, just as we do with AWS right now, right? Like when we want to build our products, we need servers. We go to the cloud provider to do that. When we are moving to build data products, we need data infrastructure to do
Starting point is 00:29:18 that. So who's going to become like the leader in this? I don't know. We'll see. I think it's going to be very interesting. And I think that the most important thing to do is keep our eyes open to see the new companies that will enter the market. That's what at least it's very, very exciting for me. Yeah, I agree. If I put my Steve Forbes hat on and to my note from the editor, I think as we think about both Snowflake and Databricks and all the tools that we mentioned before that we had the chance to learn about in the last season, the trend that I see that I think is really what excites me is, well, there are two major things. One, like you said, we're just at the beginning here. And I think it's fun to talk about Snowflake and Databricks and debate back and forth. But the bigger story is that we're at the very beginning of a decade-long trend that we're going to get to see play out in
Starting point is 00:30:11 real time, which is going to be really cool. The other thing I would say is that a lot of the challenges that all of these tools are solving is that they're removing artificial scarcity from the equation of providing value with data. And many times the artificial scarcity is paid for in engineering time. And so as a lot of these low-level plumbing problems are solved by new technologies, we are going to see, I think, people do things with data that we haven't even conceived of yet, because they will have no artificial scarcity and they can apply all of the talent and resources in order to actually create value with data. And so I think the next 10 years is going to be phenomenally
Starting point is 00:30:57 exciting. And hopefully we're still recording episodes then. Absolutely. We'll still be here. We will also be writing for the Economist and Forbes, but the podcast is going to continue. That's right. Okay. Well, that's the buzzer. One thing we want to say before we hop off of this episode, we would love your feedback. You can send that to eric at thedatastackshow.com. That's E-R-I-C at datastackshow.com. You can also go to datastackshow.com and contact us there. We'd love your feedback. If there's a technology or a subject that you'd like us to discuss, we'll go find an expert and talk with them. If there's something that we're doing that annoys you, we'd like to hear about that as well. No guarantees that we'll stop it. And go check out
Starting point is 00:31:46 the website to see all of the episodes you missed. We didn't get to talk about data governance. We didn't get to talk about stack architectures. There were a lot of good themes that we didn't get to in the season wrap up. So make sure to go to the website or onto your favorite podcast provider and check out what you missed. And we will catch you at the beginning of season three next time. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.