The Data Stack Show - 66: How Data Infrastructure Has Evolved and Managing High Performing Data Teams with Srivatsan Sridharan

Episode Date: December 15, 2021

Highlights from this week’s conversation include:Starting his career on the first-ever data team at Yelp (2:00)How to approach the adoption of new technology (7:04)When to use stream processing vs. ...batching (11:35)What is a pipeline and why is it core to a data engineer? (14:07)Where a new data scientist should begin their career (19:14)The key factors impacting a new technology decision (27:09)Managing team emotions in decision making (34:25)The unique challenge of Fintech vs other consumer industries (45:03)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're going to talk with Sri, who is the head of data infrastructure at Robinhood. And I have tons of questions. This is not going to surprise you at
Starting point is 00:00:38 all, Kostas. Before he went to Robinhood, Sri spent almost a decade at Yelp and started doing data stuff there very early on, and then spent just a really long time there sort of heading up all sorts of stuff on the data infrastructure side of things. And so I am just fascinated to hear what his experience is like spending almost a decade at a startup like Yelp, especially because he joined sort of shortly after the iPhone came to market. So he got to see so much change. So that is what I'm going to ask him about. What's on your mind? Yeah, I'll ask questions around how things have changed through all these years. I mean, he has been there doing data engineering when the term data engineer didn't even exist until today. And I'm very,
Starting point is 00:01:26 very interested to see from his perspective how technology has changed, but also how organizations have changed through these times. So yeah, I'll focus on that. And I'm pretty sure that we'll have more stuff to chat about with him. All right, well, let's dig in and talk with Sri. Let's do it. Sri, welcome to the Data Stack Show. We're super excited to talk with you. Thank you. Thank you, Eric and Costas. Thank you for having me on the show. I'm also super excited. Well, give us a background. You've worked in after grad school about 10, 11 years ago. And in my first job, when I started at Yelp, I started on a team that was the original data team. Back then, we didn't have a concept of a data team or a data engineering team. But I still remember my first,
Starting point is 00:02:20 very first project where I was building ETL pipelines and building our very first data warehousing solution. And since then, I've really been excited about the space because I've found that data forms the central fabric for everything in any organization. And I've always enjoyed kind of being in a position where I have a lot of breadth and visibility. And I thought data gave me that. And so I stayed on that track for many years as an engineer, and then transitioned into a management role, and built up the team and supported the team to help grow the data platform at Yelp. So I was there for about eight and a half, nine years in total. And then last year, I made a switch to Robinhood.
Starting point is 00:03:03 I'm in a similar role here at Robinhood supporting the data infrastructure org, working on similar space, but different set of challenges because the FinTech world is very different from the consumer web world. But yeah, my journey has mostly been in the intersection of data, people leadership and technology. Very cool. So many questions to get to, especially around going from consumer social to fintech, which is really interesting. One question, I have a few questions about your time at Yelp. So first of all, the iPhone came out, I think just a couple of years before you joined at Yelp. And so it hadn't really hit the exponential growth curve yet. And sort of mobile adoption from say late 2008, whatever, 2007 through to the last year over that decade was just mind-blowing. And Yelp, I'm guessing was primarily mobile. I could be wrong about that, especially in the later years, but I'd love to know, from a data perspective, managing infrastructure through the mobile revolution, what was that like?
Starting point is 00:04:14 Was there anything in particular that sort of, as everything shifted to mobile, were concerns from you from an infrastructure standpoint? We'd just love to know about that. Yeah, that's a great question, Eric. And I was fortunate to be in a position to see that transition happen. I still remember, I forget the year, this was early in my time at Yelp, where we were discussing as a company that, hey, we need to go mobile first. And at the time, Yelp did have a mobile app and a web app, but most of our focus and efforts was on the website. And there was a big...
Starting point is 00:04:45 That's so funny to think about. Right? It's so funny to think about it now. But I still remember being in that company all hands meeting or whatever, where the C team is like, we need to go mobile. Mobile is the thing for the future. And the skeptical person in me at the time was like, who uses an iPhone? iPhones are too expensive and nobody's going to use this, but I was wrong. And yeah, I mean, that really took off as you said, over the next several years. And the interesting things that kind of manifested in the data world was primarily two things I can think of. Number one is the sheer rate of growth of data, right? Like all of a sudden you have single digit millions of users
Starting point is 00:05:26 to double digit millions of users or even triple digit millions of users. And that's like exponential increase in data, which means that all of the systems that store data, process data, transform data, now need to adapt to this rapid change of scale and growth. The other interesting thing was now we are dealing with lossy clients, right?
Starting point is 00:05:44 So especially when it comes to tracking or understanding user behavior, like which is a pretty common pattern that most consumer web companies do. You run your A-B tests, you collect data about how people interact with the app. Doing that on a website is easy because it's not lossy or it's less lossy. Doing it on a mobile phone is very lossy. And so dealing with those challenges made things really interesting. So I'll give you a specific example, for instance, timestamps, everybody's fun topic timestamps. I remember an issue where when we are looking at computing metrics for the
Starting point is 00:06:16 company or measuring experiment results, you can't rely on timestamps coming from mobile devices, because you don't know when those messages are going to get emitted. Right. And so dealing with those timestamp issues, weren't something that we had to deal with when we were just working with web as a platform. That's just a small example, but those were kind of the big, big things that came out. Fascinating. So let's take a little bit more So you have exponential scale in data, and you really saw massive changes in infrastructure across a number of vectors. How did you approach adoption of storing data, data warehouse, pipelines, maybe buy some, maybe you build some, whatever. And then over the decade, you have these technologies come out that really make a lot of that different or sort of change the process. How did you think about, I mean, of course, I'm sure you modify the stack over time, but I'm thinking about our listeners who there's so many tools coming out and how did you decide
Starting point is 00:07:31 when it was actually time to make a change? Because there's a cost component to that, like engineering time, ROI over time. How did you approach that? Yeah, that's a great question. And there's no easy answer to that. I think, so I'll talk about kind of how I approached it and talk about my observations on how I've seen the industry evolve here. My personal kind of approach here is to not necessarily jump
Starting point is 00:07:59 to the latest and greatest immediately, because as you correctly pointed out, like if this was an open source project that I was working on or a hobby project i was working on yes i'm absolutely up for the new shiny tech i want to jump to the latest and latest and learn all of those things but i think working in an organization where you have to support existing use cases your systems have to be really really reliable i'm not very very keen on taking risks and jumping onto the latest shiny tech until it has been battle tested. And partly, this is also the appetite of the organization, right? Like, there's, there's one thing that I can think of, or my team can think of, but at the end of the day,
Starting point is 00:08:36 we need to align with what the company strategy is, and what is the company's appetite. And so perhaps if it's, it's a much larger organization, then it might be harder to jump onto the latest and greatest very quickly because the cost of migration is very high. But if it's a much smaller organization or an organization that's early in its startup journey, it might be much easier to jump to the latest and greatest. So I'll give you an example of one such decision that was easy and one decision that was hard.
Starting point is 00:09:03 The easier one was the adoption to Kafka. So this was perhaps in 2013 or 2014. At the time, LinkedIn had launched Kafka and Kafka was really taking off. And our initial version of our ETL pipeline was, we had an open source queuing system that we were using and it really wasn't scaling for us. And we were constantly dealing with issues with data loss and workers going down and basically the distributed system not being resilient enough. And then Kafka came along. And at the time we were like, if Kafka really works for a company of LinkedIn scale, it can certainly work for a company like Yelp. And so we started prototyping 2013, 2014, and,
Starting point is 00:09:45 you know, it really worked well for us. And, and we've seen Kafka to become a very prominent industry standard now. So that was kind of an easier decision to take. The harder decision was the move to streaming, which are stream processing, which, which probably happened around 2015 or 2016. Up until then, like most of the data processing systems were batch-oriented. But with Kafka picking up, there was a big cottage industry of stream processing solutions coming up, both vendor and open source. So 2016, 2017, I remember us debating whether we should invest in a stream processing solution. And Apache Flink was something that Yelp had used back then.
Starting point is 00:10:26 It was promising new tech. It solved a lot of the problems that we wanted it to solve, but the adoption was very hard because a lot of that code was written in Scala. And Yelp at the time was a Python shop. And so that one was a harder decision to take because it wasn't clear what the value proposition is going to be. We could see that it was solving some use cases for us, but we also saw barriers to adoption because stream processing was also a harder concept for people to grasp because you have to think about things like windowing, joining, which you normally don't think of in a batch processing system. So those are some things that immediately kind of jumped to my mind.
Starting point is 00:11:09 Sri, it's very interesting what you mentioned about the streaming versus the batch why did you i mean where do you need streaming processing and where do you need batch or do you need both or you can use only one of them like what's what's your opinion on that because there's a lot of baiting of like ah we should only do like streaming batching is just a subset of streaming and blah blah blah like all these things so what's your opinion and your experience yeah that's a million dollar question that's a holy wall right there costas i hope i don't antagonize our listeners here but but i think i think there's a place for all of them although that might change in the next few years. I think streaming or stream processing works really well when you have use cases that require data to be real-time, like order of millisecond latency.
Starting point is 00:11:55 Let's say if you want to do flex joins across multiple data sources and the data is changing very, very quickly and you need to be able to deliver results with order of millisecond or order of second latency, then batching doesn't really cut it. Like an example of that would be anything that's on the critical path of a user journey, for instance, right? So you take any kind of consumer, a social media product, let's say you're tracking events, the clicks that the users are making, how they're navigating through the website. And let's say if you want to provide personalized recommendations based on their behavior, and you want to be able to provide that personalized
Starting point is 00:12:33 recommendation quickly, then you can't rely on a batch processing system to do that because the data is changing very quickly, you need to provide timely recommendations and so on. And then in places where the use cases typically batch where something needs to happen in the next four hours, in the next eight hours, in the next 30 minutes, batch processing lends itself better. That said, the worlds are bridging now, right? Like we have micro-batching, like Spark streaming is really taking off. The folks at Databricks realized that Spark is a very powerful tool and Spark streaming is a micro-batch bridge for that. And then there are technologies like Beam, which tried to abstract away batching and streaming with a common API so that users don't have to worry about whether data is being batched or streamed under the hood.
Starting point is 00:13:20 So I think the industry is changing. I think it's going to converge. But right now, I do think that there is an opportunity to leverage each of them for distinct use cases. Yeah. Yeah. Super interesting. Okay.
Starting point is 00:13:34 I have a question. I started thinking about this during our conversation, to be honest. I realized that one of the most commonly used words in data engineering is the word pipeline. But we never tried to define what a pipeline is, right? We take it for granted that pipeline is something important. But I've never asked anyone, what is a pipeline? So based on your experience, and you have a very long experience, what is a pipeline? And why is such a core concept in the life of a data engineer?
Starting point is 00:14:09 Yeah, that's an excellent question. To me, when I think about it from kind of abstract first principles, a pipeline is a way to transport and transform data. And the reason why I think it is so fundamental and so important is most companies today run on data. And I would even like to claim that every single engineering team for any kind of tech company is a data team. Because anything that you do revolves around moving, transporting, storing data. And a pipeline is a very core construct for that. And I know over the years, the meaning of the word pipeline has also changed because I think 10 years ago,
Starting point is 00:14:52 building a data pipeline meant you just, I don't know, SCP your data from one place to another, right? We're not like, we can't do that in a single machine with memory. Like this is a distributed data processing problem now, which is why pipelines become much more challenging and complex over the years. Follow-up question to that.
Starting point is 00:15:13 You mentioned that when you started working, you were actually like what we define today as a data engineer, but nobody used the term back then, right? So why do we need a difference, let's say, to define a different discipline in engineering? Like how a data engineer is different compared to a software engineer or like an SRE or, I don't know, DevOps engineer or whatever. What's the difference there?
Starting point is 00:15:39 Yeah, that's a great question. I think what has happened over the years is as the industry has grown, many of these roles have become, started to become specialized. So an example of why there's a specialized skill for data engineering, as opposed to, let's say, a backend engineer or an SRE is because one of the key skills that a data engineer needs to have is the ability to debug large-scale pipelines, the ability to optimize large-scale pipelines. And that requires knowledge of how is data organized, represented, how are people querying the data, how is the data stored,
Starting point is 00:16:18 what is the business use case for that data? So all of these things, I mean, a backend engineer could do it, right? But it requires them to learn some special skills. And given that the use of data in the industry is growing more and more, I think companies are finding more value in carving this out as a niche profession. I think it's very similar to perhaps how DevOps morphed into SRE, right? Like, yeah, there was a time when everybody was the ops team would manage all services, and then they realized that doesn't work. So I think something similar where people were probably back end engineers were probably managing data. And then they realized that, okay, this is,
Starting point is 00:16:54 there's some special set of skills needed, in addition to being a good engineer, which is perhaps why that evolved that way. Yeah. Yeah. And I think also, that's something that I, I usually say when somebody asks me, like, what is a Yeah. And I think also that's something that I usually say when somebody asks me, like, what is a data engineer? I think data engineer is an interesting discipline because it's a, let's say, a hybrid between ops and software engineering, right? You have to build your pipelines, but you also have to operate your pipelines. And operating the pipelines is closer to what an SRE does, for example. Building that is closer to what an SRE does for example building that is closer to what like a software engineer does so you need to have let's say
Starting point is 00:17:35 way of thinking that it's comes from both worlds and I think we can see that also reflecting like on the tools that we see built for data engineers right like some of them are coming more from the SRE kind of space. Some of them are coming like from software engineering. So I think that's like something that's very, very interesting with data engineering. So if someone wants to start or like their career as an engineer right now, and they are considering data engineering or someone who wants to make a change, let's say, and go towards like data engineering, what's your advice to them like how they should start and what are like some fundamental let's say knowledge that someone should have yeah that's a great question as well i think the even even the role data in engineering
Starting point is 00:18:16 is is becoming so vast that there are sub roles within it right and? And so the way I like to think about it is there's a whole spectrum from the infrastructure to the user of data. So the closest to the infrastructure is like a data infrastructure engineer, like this is the person who is building things like Kafka or building libraries on top or managing these distributed systems. And then you go one layer on top, which is what is more traditionally known as a data engineer. Like this is the person who is using these technologies to build pipelines, to build data sets, to operate those pipelines. And then you go one layer above the stack. And this person sometimes is called a data scientist or a machine learning engineer or data engineer,
Starting point is 00:18:58 but these are the people who are using data to derive insights or making decisions for the company. So for someone who's considering a profession in data, I think the first kind of suggestion I would offer them is to try to figure out where they want to be in this data stack. Like, do they want to be close to the business or do they want to be close to the infrastructure? And based on that, the paths vary. So if they want to be close to the business and they want to get involved in using data to make decisions, then kind of moving to a profession like data science where educating themselves about statistics and machine learning techniques can be a good path there. Whereas someone who is interested in moving more to the infrastructure side, learning
Starting point is 00:19:41 about how some of these distributed systems work could be a good path for them. But I think at the core of it is this passion for data. I think that's what the fundamental thing is. Because if you have that passion for data, I think you discover your path depending on the company and depending on the opportunities that those companies provide. A hundred percent. I think you gave some really valuable advice there. We also have like, people might have heard like recently the terms also like analytics engineer,
Starting point is 00:20:11 something that like it's promoted a lot like by DBT because DBT, as you said, it's like a tool that is usually affects people that are closer to the business than people that are closer to like the infrastructure behind. So yeah, and then you have MLOps, for example. It's people that are closer to the business than people that are closer to the infrastructure behind. So, yeah. And then you have MLOps, for example. So there are many things that are coming up every day. So I think we will hear more and more terms around that stuff. Great.
Starting point is 00:20:37 So another question. How did you see, you mentioned some technologies like Kafka, for example, right? And you were like an early adopter of Kafka, actually, from what I understand. And so that's quite an achievement, to be honest, because Kafka is not exactly the easiest piece of technology to operate at scale, especially in the early days. So how have you seen the data stack change through all these years, right? From 2000, like a little bit after like iPhone was released until today. Yeah, yeah. I think there's been a huge tectonic shift in the industry, right? Like just like seven or eight years ago, some of these technologies were brand new.
Starting point is 00:21:20 And as you correctly said, not easy to operate, right? Not easy to scale. But I think what we've seen in the last few years is rapid expansion of these technologies and these technologies becoming heavily commoditized. The Snowflake IPO is a good example, right? Like the company performing really well and carving a niche for itself, similar to Confluent, like what started as Kafka became a big company, an enterprise company there. So I think a lot of these technologies are becoming commoditized. And I think that's a big change. And I think that's actually healthy for the industry because it reduces the barrier of entry for companies that want to get better, that want to serve their customers better with data, but maybe don't have the skill set and experience to build and operate these large scale distributed systems. And so I think what has happened is like, it's just democratized the space
Starting point is 00:22:10 and it's opened up possibilities for newer companies to emerge. Because if you look at like seven, eight years back, it was only the likes of, you know, the Googles and the Facebooks, the places that were known for really good quality engineers, like making dents in the data world. But now, if I were to start my own company, it's so easy for me, I just, obviously, I need to have the money to do that, because some of these vendor solutions might be expensive, but like, I can easily integrate with a Databricks or a Confluent or Snowflake and bootstrap my data stack. And based on your like, if you had to choose just one technology that you have worked with all these years, which one you would say it was the most influential in making this tectonic change? Yeah, it's a good question. Let me think about this. I think some of the ones that immediately come to my mind, probably not surprising to our audience is Kafka and Spark are the two big things that I see. It's very clear how they became their own companies and very successful because of that. And
Starting point is 00:23:12 the shift to Spark was interesting because when MapReduce came out many, many years ago, it was the hottest thing in the industry. And at the time I was thinking this is probably the compute framework for decades to come. And then Spark replaced it. And then it's now the compute framework for perhaps decades to come. So I think those were very, very influential. Data warehousing has been an interesting topic because I don't think there's been a clear winner in the data warehousing space. Data warehousing has been there for many, many years, right? Like we've had the likes of Informatica and IBM, and it's not a new concept. And there are different, obviously newer technologies and newer ways of doing it.
Starting point is 00:23:51 But I feel like that space is still pretty wide open. Yeah. I was sorry to jump in, Kostas. Oh, go ahead. I was just going to say one thing I would love your perspective on is you at Yelp worked at a massive scale. At Robinhood, you're working at a massive scale. You have the data and the resources to do machine learning. If you think about the last decade, are these new technologies that are making these components of dealing with data easier, is that unlocking machine learning use
Starting point is 00:24:26 cases for companies that aren't operating at that large of a scale who don't necessarily have the resources? Are you starting to see that shift happen and sort of go down market? And so machine learning use cases are now being made available or sort of are much more easily enabled for smaller companies? Yeah, I definitely think so because the barrier of entry for using data is very low. So you don't need to have a 40% engineering team to build up this data stack anymore. If you have the funding, you just use these solutions and bootstrap that very quickly. And I do think that that's opening up new opportunities for smaller companies, for sure. I was thinking as you were giving your answer about the most influential technology,
Starting point is 00:25:10 the first thing that came to my mind, it's actually Kafka. But I'm super biased because I built a company on top of Kafka. So it was like a very important piece of our architecture at Blendo. But anyway, we talked about the technology, but the technology like lives inside some context and the context usually is the company, right? So how does technology reflect to the company and vice versa? How different types of companies you have seen them like adopt different technologies, what kind of impact like choices there might have
Starting point is 00:25:45 and all these things that like, let's say, how we can approach this from more of an organizational point of view based on your experience. Yeah, yeah, it's a good question and a hard one to answer because at the end of the day, theoretically, you can evaluate
Starting point is 00:26:00 these different technologies and say, hey, they provide this capability, they provide this capability. But when it comes to the rubber hitting the road, it becomes a different ballgame because you have the complexities around what are the interests of the engineers on your team, right? Like you can't just say, use this technology. And if people don't like the technology, nothing will happen, right? And you also have to understand the lifecycle of the company, the appetite of leadership. So for instance, if the appetite of leadership is to move towards a buy solution, then you
Starting point is 00:26:36 either have to adopt that strategy or convince people up the organization to change their strategy. So a lot of these things come into play. I think what I've seen is with many of these evaluations that we have done with me and my teams over the years, at the end of the day, the final decision ends up being on three factors. Cost, how quickly can we get this up and running?
Starting point is 00:27:01 And what is the excitement level of the engineers who are working on that piece of technology? Because let's take data warehousing, for example. There isn't, I mean, there are obviously different solutions have different benefits, but not a lot of foundational differences between Snowflake or Delta Lake or Iceberg or Hoodie. They're fairly similar. There are some differences in feature sets, but at the end of the day, like how do you pick that, right? And I think that really boils down to based on the existing organization's context, how quickly, how easily can you adapt this technology, make it work
Starting point is 00:27:35 with the cloud or your data center that you have in your company, make it work with your developer tooling ecosystem, understand what your customers are passionate about, what your engineers are passionate about. So those things kind of come into play. So Sri, based on your experience, and like you're working with many engineers like every day, it's a little bit of a provocative question, but do you think that the technology there right now,
Starting point is 00:27:59 that it's much more, let's say, preferred by data engineers? Great question. Let me think about this. I don't think I've heard a consistent answer. I've heard different people wanting different things. Maybe what I would say, going back to my previous comment about Kafka and Spark, I think Kafka has become so foundational
Starting point is 00:28:21 that people don't even think about it anymore, right? Like it's a layer that exists underneath and there are abstractions that people have built on top of it. And I think Spark is another critical one. When I've interviewed engineers or when I have interviewed with other companies, I've seen this very common pattern where people kind of assume or expect that, you know, Spark and that might be something that has come up. And Airflow is another thing, right?
Starting point is 00:28:47 Like some people love it, some people hate it, but a lot of the data engineering community uses it. So those are probably things that immediately come to my mind, but I've definitely seen the jury to be divided there. That's super interesting. And sorry, Eric, I have one follow-up question. Please do.
Starting point is 00:29:02 Hey, get provocative. That's great. Provocative is good. There's a very interesting detail in the technologies that you have mentioned. All three of them are open source. You didn't say like,
Starting point is 00:29:15 for example, Snowflake, like the $100 billion like Gorillaz in their own show. Do you think that like being open source is something that is important when it comes like to the preferences that developers have? I 100% believe that because a lot of the engineers that I've worked with and being an engineer
Starting point is 00:29:33 myself, I can kind of like empathize with that, even though I'm probably a terrible engineer now. But I think it's very ingrained to us as engineers, right? Like this aspect of open source and being able to showcase to the world what we've worked on and being able to incorporate what the world has given to us. So what I've seen is companies
Starting point is 00:29:53 that are able to hire software engineers, data engineers, and so on, which have a high density of engineers. I've seen them to naturally gravitate towards these solutions that have an open source component, because it's just more appealing. As you said, $100 billion company like Snowflake is successful, and I've known a lot of big tech companies using Snowflake. But I've not seen a lot of big tech
Starting point is 00:30:16 companies yet completely using vendor solutions across the board. So you might see a company that might be using Snowflake for their data warehousing solution, but they might be using open source Spark or open source Kafka or open source Flink. And that might be because engineers are very excited to work on open source things. But companies that don't have a big engineering presence, like I like to call them companies that maybe I'll irk some people here, companies that have an IT department rather than an engineering department. I love our IT professionals. I'm just making a joke here. But I think companies that don't have a big presence of data engineers or software engineers, they perhaps would want to go to off-the-shelf solutions, which just makes it really easy for them to move forward. Yeah, those are all great points. And I'm very interested to see how Snowflake is going to respond,
Starting point is 00:31:11 let's say, in this lack of open source that they have. Because I think at some point they will do something. There is a gap there for them compared to Databricks or Confluent or, I don't know, even Google, let's say, for example. So I'm really looking into what they are going to do in the next couple of months around that. That's all, Eric. He's all yours. I promise.
Starting point is 00:31:33 No more questions from me. I wanted to jump in because we talk so much about technology, which we love, obviously. That's a huge part of what the show's about, but you have so much experience managing teams working on data and data infrastructure. And you mentioned your evaluation criteria for new tools. And I think, correct me if I'm wrong, but it's the first time we've heard someone talk about engineer excitement as a major factor in adopting a technology. And I would love to just dig in on that a little bit more. Have we heard that, Kostas? You looked unsure. Yeah, no, I don't think we have discussed this before, but I think especially among people who are responsible for hiring or who had like at some point to hire engineers the stack it's something
Starting point is 00:32:27 that is like always always important like it's there are like these jokes about like cobol for example where you have i mean being a cobol developer right now is probably the best thing you can do in your life because there are so few of them and there's so much compel code in ban seriously seriously i believe it yeah they can make like crazy amounts of money because there are so few of them, and there's so much Compile code in Ban. Seriously, seriously. Oh, I believe it, yeah. They can make like crazy amounts of money. But I don't think that's, I mean, it's easy to go and like hire Compile developers, right? So it's always important what kind of stack you have.
Starting point is 00:33:00 The stack changes. We see, for example, what happens with Rust right now as a language. What happened with Golang like a couple of years ago, what happened with Scala even before that. That's why we have products like Kafka and knew how to write code in Scala. Like Scala for like a small at least period of time, it was considered like the data engineering language. And these trends change. And as they change, you have like to keep that in mind because, yeah, you might end up like having issues like hiring. So it is important.
Starting point is 00:33:53 Yeah. Well, so this question is for both of you. How do you approach making a decision? And I'll set up a little bit of an unfair question to continue our theme of trying to be provocative. Maybe this isn't going to be provocative, but so three, let's say you are looking at adopting a new technology. The cost component makes a ton of sense. The ease of migration makes a ton of sense. You have buy-in from your boss and the other stakeholders, but the engineers aren't excited about it. How much do you wait that? And how do you think about that both from a near term and then a long term? Like maybe they're not excited now, but this is the right decision. Like how do you navigate that as a manager? Yeah, that's a really good question. I think one of the biggest mistakes you can do as a manager here, which I have done before in the
Starting point is 00:34:41 past and therefore can speak with authority is taking a decision in a vacuum and then going back to your engineering team and saying, this is the decision we've taken and then trying to convince them to buy in on that decision or even incurring the cost of people being pissed off and so on. So I think what I've learned to be a good way to approach it, I still won't call it the most optimal way, because I don't know if that's the most optimal way. But I think a good way to approach it is to include everyone in your decision making process. So not only include your stakeholders, your boss, but also include the engineers who are going to be responsible for implementing that. And then when you get all of these diverse perspectives together, consensus is much easier to be built because the people who are on the ground might have a lot more
Starting point is 00:35:30 detail about how one specific thing works better or the other. And obviously there's the excitement piece too. And the engineers can see what I am thinking or what the stakeholders are thinking. And it just builds that shared context, which makes the decision much more easier to take. To kind of answer your earlier question, which you were kind of alluding to, I think, Eric, around how do you factor in the engineering excitement towards it? Yeah. I don't think that should be the only factor, obviously,
Starting point is 00:36:00 because if you're over-indexing on your engineers' excitement, what happens if those engineers leave the company, right? Like you can't really base a decision based on what two or three people are excited about. It has to be a decision. The technology needs to have a future and you need to be confident about the technology having a future. And you need to be confident about the ability to hire people with those experiences and skills for that technology. This goes back to that Scala question, right? There are far fewer Scala developers today than there are Java and Python developers. So that's an interesting data point to consider, for instance. So at the end of the day, it's an input in the process, but it's not the be all end all. Sure. So continuing on this topic of teams, because I think it's really helpful. One thing we chatted about as we were prepping for the show is how do you think about building a team, right? So we've talked about sort of the context of operating in companies like Yelp or Robinhood, where there's a ton of both sort of technological infrastructure, team infrastructure.
Starting point is 00:37:12 You got to see that both from the beginning at Yelp, and I would guess have built sort of internal teams in different disciplines. How do you think about and help our audience think through what's the best way to think through building data infrastructure? And maybe you can help us think about that from like the startup perspective, which is very different to building like a data team inside of a larger organization. Yeah, yeah. I think given that what we've seen over the last eight years, 10 years and so on, data is a foundational fabric for every company. Every company is a data company today. And startups actually have a distinctive advantage as compared to larger companies, because you get to build things from the ground up and get things right from the get go. Many organizations make the mistake of not investing in their data stack early on. The challenge with not, it might be appealing to not invest in a data stack early on, because when you're an early stage startup, you're building the product, you're finding your product market fit. There's a lot of unknowns here and you want to be scrappy.
Starting point is 00:38:10 You don't want to be investing in pipelines and curated data sets and machine learning techniques. You're just trying to get a product out of the door. So it's understandable why companies don't invest in data early on, but the cost of not doing that can be immense. So I'll give you an example. So I think once companies kind of become larger and you're trying to introduce a data stack in a larger company, you deal with a bunch of problems. Number one, you're probably already
Starting point is 00:38:35 a data company without realizing that. And if you've not invested in your data stack, you're probably doing a lot of manual work to generate your metrics, to generate your experiments. And then you need to take a sufficiently large organization and get them to adopt a new technology, which can be very, very costly, a lot of migrations. And then if you delay that even further, you get into the situation where different teams will start building their own data stack, whether they know it or not, because the business will put pressure on them to deliver data-driven insights. So if you don't have a consolidated data stack, whether they know it or not, because the business will put pressure on them to deliver data-driven insights. So if you don't have a consolidated data stack, they would start building their siloed data stacks. And then you'll run into a situation where there are different
Starting point is 00:39:15 domains of your company will produce different data and those data won't agree with each other. You'll run into issues of schema compatibility and what's the source of truth. And these are things that I've run into and I've seen my peers and partners run into. So I think startups have a distinctive advantage of getting this right. And I'll make a kind of brief hat tip to the data mesh, which is everybody's favorite topic these days. I do think it's a great paper. And I think building a self-serve data platform and making sure that every person in the company is invested in making sure that the data is accurate and treating their data as if it were a product or an API can make a huge difference early on.
Starting point is 00:39:54 And so let's talk about maybe one of our listeners who might find themselves in a situation where they buy into that. And maybe even they're in a situation where they came from a larger organization where everyone was bought into data. They had awesome tooling. It was pretty self-serve. They were able to do really cool things. And then they go to work for an earlier stage company and they can kind of see, okay, like in six months or a year, we're really going to wish we had sort of these pieces of infrastructure in place, or even a lot of times it can be really wish we had this kind of data and had been collecting it for a long time.
Starting point is 00:40:37 And we're not doing that. That can be kind of a hard sell, right? Because it's like, okay, I want to spend money and engineering resources, which are the most valuable hours to vie for inside of a company. How do they sell that internally? Because like you said, we're trying to get a product out the door. We're trying to figure out if we have product market fit. And in many ways, it's true that the founder says that's just not as important as this. And it's, of course, more complicated than that, but how would you approach that? Yeah.
Starting point is 00:41:09 And there's merit to that, right? I think if you're a very, very early stage startup and you're in this existential crisis mode, then maybe it's not right to invest in your data stack. Then you need to figure out your story of what you're delivering and who are your customers. But I think once you've achieved some kind of a product market fit, that is a great time to push for building a data-centric culture. And the way I would approach that, or I would suggest that is, obviously you can't go to your CEO or to your boss and say, I need five people, I need $5 million to set this up, right? Obviously, that's going to be a
Starting point is 00:41:45 hard no. I think it's about finding incremental places or incremental opportunities to build towards a self-serve data platform towards the future. So one example of that could be every company has report metrics, right? And so typically what companies do in their early stages, a lot of this is manual, right? Somebody's writing a SQL query. Somebody's putting that into an Excel. Maybe there's a visualization dashboard, right? You throw that in there. And there's a good business case to be said around automating that, how to make that data
Starting point is 00:42:15 correct, how to ensure that people aren't copy pasting stuff and spending hours and hours validating data and introducing a tool to solve that problem. And then once you solve that problem, then there comes the data collection problem introduces tool to solve that data collection problem. So I think you can incrementally build this. The key thing I think is making sure that you get your entire company invested in using data. If you're able to do that, the technology and the building the platforms
Starting point is 00:42:44 becomes much much much easier yeah it's interesting the it's easy to think about okay we need to build a state a data stack and it's easy to start with here's the upfront cost right it's engineering it's technology or whatever and it is i think way better to think about it in terms of, and the metrics is a great point. It's like, how much time can we save the company? Right. And in many cases, it's probably going to break even or even like have positive ROI because you're right. I mean, someone writing SQL, I mean, really, I mean, if you think about how many amazing tools we have, like there's still a lot of companies who are like query the production Postgres, write SQL,
Starting point is 00:43:26 and then deliver an Excel file to someone who an analyst who hammers on it to sort of do the metrics. And if you can automate all that, it's a huge savings. Yep. Yep. Definitely. Sri, I have one last question because we are almost in time here. You have worked in, no, let me put it like in a different way. Usually I ask what's the difference between B2C and B2B when it comes to like the data stack that someone needs, right?
Starting point is 00:43:52 And there are like, okay, some very obvious differences there. But you have experience in two, let's say, very different types of B2C companies. You've been like in Yelp and now you're like in Robinhood, both of them like consumer facing products, many users, a lot of data, but they are very different in terms of like the type of product in the industry they come from. So what are the differences there?
Starting point is 00:44:15 How the one data stack differs from the other because of like being in a different industry? Yeah, yeah. They're definitely kind of bringing unique challenges. I think the biggest contrast that I've seen is the stakes get much higher in a fintech company. I don't mean to say that companies that are in the social media world
Starting point is 00:44:39 are the stakes aren't high there. I'm sure the stakes are high there too. But the analogy that I'd like to draw is what happens if the review that you posted doesn't show up or the like that you made on a post doesn't show up versus the transaction that you made to purchase some shares at a certain price doesn't execute, right? I mean, this is not from a data perspective. I'm just talking about the fundamental nature of the businesses. But I think what that translates to is the stakes get higher on the data side. One particular place where it manifests is correctness. So in a company that's a B2C
Starting point is 00:45:13 company, that's a consumer web, social media, and so on, you can afford, I don't want to say you can afford to be incorrect, but there's a certain tolerance to correctness, right? Like you can afford to be 99.999% correct with your data. In a fintech company, you can't afford to be anything less than 100% correct. And that introduces very unique challenges on the distributed system side, because you have to not only optimize for, you know, scale and latency and for tolerance and performance, but also for correctness. The flip side is typically large social media companies, the scale is much larger compared to fintech companies. But I don't mean this to be a sell for Robinhood. But I do think it's a unique space. Because when you merge the world of like fintech and mass consumer product, you basically get challenges of scale and correctness, which
Starting point is 00:46:01 is something very, very unique. The other thing is customer privacy, right? Obviously, it's a very big topic. It's a very important topic for everybody here. The stakes get higher again in a fintech company because privacy means a lot more now than in a different kind of social media type company, because now you're dealing with people's money and that just raises the stakes. That's super interesting. And I'm pretty sure we need probably another episode
Starting point is 00:46:28 to discuss about how you're going to have scale and consistency and like everything at the same time. So hopefully we will have the chance to do that in the future, Sri. It was a great pleasure having you here. We really enjoyed talking with you and we're looking forward to record another episode with you. Sounds great. And it was a pleasure. I think you guys asked really good provocative questions, which I appreciated. And I think it was definitely a wonderful experience for me.
Starting point is 00:47:02 That was a great conversation. I'm trying to decide what my big takeaway is, but I'm actually going to talk about something from the very end of the conversation, which is when Sri talked about how the stakes are higher in fintech than consumer. And I loved that you could tell there was a tension there. He didn't want to say that correctness was not important in a social direct-to-consumer app context. It is very important. But when you're talking about someone writing a review on a meal they had at a restaurant versus someone trying to spend their own hard-earned money to buy stock in another company. It's just a little bit of a different game. But I guess all of that to say, my big takeaway was, he said, in a consumer social
Starting point is 00:47:52 company, you can afford to be 99.9% correct or whatever, how many nines he used. But in a consumer financial company, you can't afford to be anything less than 100%. And it just really made me think about that a lot because in like so many things in life, sort of the last 1% to perfection can be the most difficult or the most complex, right? Or sort of building the infrastructure to ensure that. So that is going to consume a lot of my thought this week, 99 to 100%.
Starting point is 00:48:27 You're very philosophical today, Eric. And usually it's me who's more into that. Maybe I'm just trying to share the burden with you, Kostas. Since you... Yeah, it was a great conversation. I think there are many things to take from this conversation outside of the fact that like our relationship probably means like a consult but we had i mean there's a there's like
Starting point is 00:48:51 a wealth of information that comes like from three like even things of like today we defined what the data pipeline is why it is important like you know like things that we take for granted but actually they shouldn't be taken for granted like We should spend time on meditating on all these core concepts that we have in our discipline and always keep in mind that these things change very, very rapidly. And I think that's the biggest takeaway from this conversation like how things can change really, really fast and how you have to be, let's say, always stay relevant in this profession. Keep up to date with like whatever is happening.
Starting point is 00:49:35 And yeah, that's what I keep from his conversation. And I'm really looking forward to have another episode with him because I think we have like plenty more to chat about with him. And we also learned that if your main goal is being highly in demand and making a ton of money you should learn cobalt ah yeah yeah that's true that's true yeah cobalt is like yeah like it's the key to paradise I mean if you if you are willing to do that like sacrifice so much in your life and yeah you will will be rewarded. That's true.
Starting point is 00:50:06 All right. Well, thanks for joining us and we'll catch you on the next Data Stack Show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.