The Data Stack Show - 37: The Components of Data Governance with Dave Melillo of FanDuel

Episode Date: May 26, 2021

Highlights from this week's episode include:Dave's "nerdy" interests in sports statistics and data (2:12)Trends in collecting, processing, and using data (4:45)Finding a better term for "reverse ETL" ...(5:48)The blurring of the distinction between sources and destinations (7:41)The role of BI is changing (13:24)Data governance and the physical execution behind it (19:00)Data governance is defining and managing data in a logical way that is actionable by the business (23:43)Consolidation of tools and services (28:49)Databricks vs. Snowflake (33:49) Dave's focus on regulatory data at FanDuel (45:47)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Thanks for joining the show today. Welcome back to the Data Stack Show. Very interesting guest, Dave from FanDuel. FanDuel is sort of a fantasy sports and sports betting suite of apps. So it's going to be really interesting to talk with Dave about that. I think it's a fairly new, I know they've been doing fantasy sports for a while, but I think the betting aspect has been as new from a regulatory standpoint. So perhaps we'll
Starting point is 00:00:40 get to hear a little bit about that. But Dave has a really varied background and has worked in all sorts of contexts with data. So I think one of my burning questions, which is pretty tactical, is when you think about fantasy sports, you have to ingest data from a ton of different places. I mean, you're talking about statistics, you have games across multiple sports happening on a daily basis. And so when you think about all of this required to run a sort of consumer mobile app like that, where people are interacting every single day with data that needs to come from third parties, I always look at that and say, man, that's an interesting pipeline problem. So I want to ask about that. Kostas, what's on your mind? Yeah, first of all, I'll probably want to learn a little bit more about fantasy games to be honest like i don't know much about it but outside of this i mean dave has like a very diverse background he has worked with data science data engineering he has even done work in data architecting so i want to learn from him about his experience with all the different fields around data and also pick his brain on what's coming in the future in this space.
Starting point is 00:01:50 Great. Well, let's dive in. Dave, welcome to the Data Stack Show. We're really excited to chat with you. Thanks, Eric. I appreciate it. I'm really excited to be here. So you have a varied history with data and we want to hear all about it. So why don't you give us the brief, you know, sort of two minute overview of when you got started with data, the different companies you've been at and what you're up to today? Totally. I think I'm going to Tarantino it because you know, where I am today, it's kind of the apex of what I've been trying to do with data my whole life. I'm currently working at FanDuel, which for people who don't know is a daily fantasy
Starting point is 00:02:24 sports betting company. It's in, is a daily fantasy sports betting company. It's in the sports entertainment space. When I first started studying data right back in high school and things like that, you know, what piqued my interest was sports statistics. I've always been kind of a nerd that way. I thought I was going to graduate college and be the statistician for the New York Yankees. Unfortunately, that didn't happen. But what did happen is I was able to kind of parlay that interest in statistics and data and information technology into roles at, you know, Fortune 500 companies. I worked at software startups, kind of ran the gamut from different places that I worked at throughout my career. But everything's
Starting point is 00:03:02 been revolved around data, right? The same things that I was doing at Fortune 500 companies, I was doing on the side, consulting for small businesses in my area. And so that's everything from data engineering, to data architecture, to data science, and all the fun stuff in the tip of the spear. But yeah, it all finally came full circle to me landing closer to my passion here at FanDuel where, you know, we solve everything with data. So. Very cool. And tell us a little bit. I know you I know you haven't been there too long, but what's your role? Do you have a team? What kind of data projects are you working on at FanDuel? Yeah, FanDuel, I'm on the operational side of the business. So it's a lot of back office support, compliance support, you know, regulatory support. So it might not sound like the sexiest of roles, but it's really cool because it's at the hub of everything that FanDuel does with data.
Starting point is 00:03:57 So I get exposed to a lot of different pieces of data, not just gameplay stuff. And it's really, really interesting. And there's a team and it's growing exponentially along with the market. So it's a really fun and exciting place to be right now. Very cool. Well, I have tons of technical questions. I know Costas does too, but one thing I would be interested to hear your perspective on is sort of major trends in the data space. And I'll name one specifically to maybe direct the question a little bit more, but we have this concept of a data mesh that seems to be becoming more popular. What kind of trends are you seeing in terms of the way that companies are sort of organizing themselves around sort of collecting, processing, and actually using data? Yeah, that's a great question.
Starting point is 00:04:53 So the trend that I've seen most strongly over the past few months, maybe six, maybe even a year, right, is that people have doubled down on technology like Snowflake, right? And like cloud data warehouses have become commonplace. And that requires a significant amount of investment from a company's perspective, right? To just spin up and migrate to Snowflake or Redshift or BigQuery or anything like that is no easy job, right? And it's no cheap job either. So as over the past, probably five, 10 years, companies have done that. They've started to understand or have like a revelation that just because everything's in that cloud data warehouse doesn't mean that the business is exposed to it.
Starting point is 00:05:28 So that leads to this whole trend of reverse ETL that has started to emerge. I don't really like the word reverse ETL. I feel like that's very much like a sales and marketing term. I really like- Oh, I'm so glad you said that because, okay, let's talk about this. I was going to ask you why you don't like it. Tell us what you would call it. I call it data portability.
Starting point is 00:05:50 And that's how I've always advertised it internally to my stakeholders and people I'm working on with projects because it's about making sure that data is portable no matter where the analysis or the data is generated, right? I think like thinking about reverse ETL is it makes sense because you can marry it to something that people are familiar with, but I'm not really sure if the concepts of ETL are actually what this is doing. So portability is the word that I use, but it's, you know, it's really all about getting data in front of people where they're working, right? On a regular basis. Because just as IT organizations have doubled down and gotten things like Snowflake and Fivetran and this whole chain of tools, you know, go to market
Starting point is 00:06:33 functions and the business have also done the same thing. You know, I'm sure that you guys can sympathize with this, but at any company that I go to, they have like a million different SaaS applications, you know, one for customer success, one for sales, one for marketing, et cetera, et cetera. And, you know, asking people in the 21st century to be swivel chairing between like a BI dashboard and Salesforce and some spreadsheets and things like that is a little bit archaic. So, you know So that's where I think this whole thing with data portability and this trend, that's what people are trying to solve. They understand that, hey, just because I have analysis in a dashboard
Starting point is 00:07:12 or in a data warehouse really doesn't mean anything to the people who are actually using this data and making it actionable. Sure. Okay, so I'm going to give you a really brief sort of three-stage history of where we've been with data, maybe even the last five years. And Costas, I want your opinion on this too. This is me. I'm going to go off the cuff here, but you kind of have the introduction of the, this is dangerous. This is really dangerous.
Starting point is 00:07:38 It's very exciting, I think. Go, go. Do it, do it. All right. So you have the introduction of the data warehouse, right? So Redshift was sort of the first major player there. And then on the heels of that came Snowflake. And of course, BigQuery is a major player there too. And this allows you to collect all of your data in one place and sort of achieve analysis that before was much more
Starting point is 00:08:00 difficult. And then you have that, that sort of created of created the challenge of the second phase and tools like Fivetran and all that solved it, which was, okay, now I can collect all my data, but actually doing that is kind of hard. And so I need much, much easier ways to get all these pipelines to talk to each other and sort of integrate my stack, whether it's sort of sources to the warehouse and then also sort of sources to like SaaS tools. And that was phase two, right? So you saw that the segments in the five trends and all the pipeline tools come of age over the last five years. And many are sort of mature now. And I think the third phase, and this is probably where there's sort of some prediction coming in is, and I love the term data portability, is where every source is also becoming a destination.
Starting point is 00:08:47 And so this paradigm of sort of linear collect, store, transform process and deliver is actually becoming almost bi-directional in a way where the distinction between sources and destinations is starting to blur. How did I do? Was that accurate? Oh yeah, I think you nailed it. And I mean, as you were talking, you know, one of the things that started to percolate in my mind is also this whole movement around kind of view materialization. I know DBT has come on really strong as of recently. And again, I think, you know, maybe even like the next phase of all this and what's going on in the future is all around data governance. Right. And and maybe that's the data mesh piece that you talked about at the beginning.
Starting point is 00:09:31 It's like, OK, I have all these sources. I have all these tools. How can I observe them? Make sure that they're available. How can I make sure that people know what the single sources of truth are? How can I easily create these single sources of truth from large data sets and kind of make that available to the rest of the organization? So, yeah, I think you did a great job. And I think the future, to your point, is kind of a little bit like the Wild West, right? Because all the big boulders have been solved, but people still experience pain. So, you know, I think you see different vendors kind of attacking the future from different angles, you know. Dave, I have a question and I'd like to hear from your experience.
Starting point is 00:10:11 So about reverse ETL, right? I mean, it's a new term, as you said, let's say this portability, data portability, let's call it like this. How it was done before? I mean, before the markets and why the market decided now to go after this problem? Yeah, again, I always say this. When I started this conversation about data portability and why it's emerging, I believe it's been because things like the cloud data warehouse have become very accessible, right? It's not hard. Like usually in the past, there'd be a lot of configuration, a lot of'd be a lot of configuration,
Starting point is 00:10:49 a lot of customization, a lot of integration, but now you can white label everything, right? You just subscribe to Snowflake. Look at that. You have a cloud data warehouse. Same thing for building data pipelines. In the past, you'd have to know Airflow. You'd have to get familiar with DAGs and you'd have to build it all yourself. Now you just subscribe to Stitch for a monthly fee and you can get all of your data into your data warehouse. But now people understand, they're like, wow, so we've doubled down on this minimum viable data stack, but no one cares, right? Like my salespeople don't care
Starting point is 00:11:18 that I have a cloud data warehouse because they're still consuming content through BI dashboards or through things that we send to Salesforce. So, you know, it's completing that circuit. And that's really why I believe these portability tools or even things like DBT have really become needed because it's that bridge between all of that technical debt that you built up with this minimum viable data stack and actually making it actionable. And yeah, I don't know if that answers your question 100%, but I do believe
Starting point is 00:11:52 that you wouldn't have one without the other, right? If these cloud data warehouses, if these pipeline tools didn't exist, I don't think things in the data portability or the DBT space would be emerging as well. Yeah, absolutely. To be honest, I think and I believe actually that the real enabler here is the cloud data warehouse. I think the rest pretty much emerges because we have access to cheap storage and processing on the cloud, something that in the past we didn't. I mean, and that's what makes things easy from one side, but also complicates things. Like the cloud makes things cheaper
Starting point is 00:12:33 and more accessible, but at the same time, it complicates things by introducing many silos there, right? Like all the different SaaS applications that we have and suddenly we also have to pull data from there. And it's not just database systems in the same data center where we control everything as it was in the past.
Starting point is 00:12:49 Because of course, like ETL, it's not something new. It exists pretty much since we have database systems. So yeah, I would say that I totally agree with you. And I would probably emphasize it a little bit more like the importance of cloud data warehouses for that. So you mentioned BI. And I mean, traditionally, data warehousing was the technology that was supporting BI. Do you see the role of BI changing inside the organization?
Starting point is 00:13:16 Do you see it like going away or you see new roles outside of the BI analyst emerging? A hundred percent. And that's, you know, now I'm remembering what your last question was, right? Like, how did we solve for this before all of these great tools? And BI was the answer, right? I remember when I started my career probably closer to 10, 15 years ago, BI was the thing, right? BI was solving all these complex problems that you couldn't with spreadsheets, right? So something like ClickView and Tableau, like they were dominating the space because they made it so much easier to answer the questions that you had that you were trying to solve with spreadsheets and kind of first gen technology back then. So in that way, I totally think that BI now is changing, right? Because you don't have to do the end to end process with BI anymore. And if you are still doing that, and you're basically using something like Power BI or Tableau as like your data platform, I think that you're way behind the curve because you just can't process things as quickly,
Starting point is 00:14:22 you can't anticipate as quickly, it's not scalable. Right. And so now, yeah, it's very interesting. I don't think BI is going away, but it's just not the one-stop shop anymore. I think it's one of many tools that analysts will have to learn. And in that way, to your point, Costas, I think the definition of an analyst is changing. You know, it used to be that you just had to be good with visualizations and creating some charts. I think like the scope of an analyst is increasing now, right? Like I think analysts nowadays have to be comfortable jumping into like a Jupyter notebook or a Databricks notebook, right? Because there you can do some ETL, you could do some transformation and, you know, set up for visualization
Starting point is 00:15:06 later down the line where I don't think it was like that before. So I totally think that there still is a role for BI. I just don't think it's going to be as critical or as pivotal as it was in the past. Yeah, I totally agree. I think that's the role of BI is transformed. Obviously, it's not going away because reporting is always going to be like the foundation of whatever we're doing, right? Like we need to understand the past
Starting point is 00:15:32 in order to act in the future and in present. So I don't think that like BI is going anywhere. It's just that instead of like the BI analysts, we will have a little bit of different roles where BI is going to be just part of the tool set that you are using as very well, you put it earlier. And talking about roles, there's this very interesting new category or new role, let's say that it's very promoted by DBT of the analytics engineer, right? What do you think about this? Like what's your definition?
Starting point is 00:16:04 What does it mean for someone to be an analytics engineer what is this thing yeah it's funny like i don't know what does it mean right all these data things have all these data roles have been malleable from day one right because when i came in what an analyst was is not what an analyst is today and i like this idea of an analytics engineer i mean what that means to me me is someone who's doing the more technical work behind analytics, right? Because people say, okay, analytics, you get a data set, you chop it up, you pivot it in Excel, and you're an analyst.
Starting point is 00:16:37 It's like, well, that world has increased in scope, right? In breadth, very much so. So I think of an analytics engineer as doing those things like view materialization, even like some data governance and maybe more of what would be thought of more of a data engineer, but not like a very technical data engineer. So in that breadth, right? I think that data engineers are becoming more and more and more like developers, right? They are definitely shifting over to more of a developer persona, developer day-to-day, developer tool stacks. enough to be in the conversation with the developers, but still analytical and business mind enough to be able to match business requirements to what needs to be done on the back end to set up the business to analyze, right? So again, when I think of things that analytics
Starting point is 00:17:35 engineers are doing, it's, you know, view materialization, data governance and indexing, data, building data catalogs, even building maybe some observability and monitoring pieces of the stack, which, you know, that's another piece that's emerging. So, so yeah, I don't know. I'm probably not the person to define what the analytics engineer is, but that would be my best guess if I had to take it. Dave, we brought up data governance a couple of times here and I'm, I'm really interested in, so in many ways, like the, and this is, you know, unfortunately a lot of times the case where the marketing kind of leads too early with the future vision that companies can achieve.
Starting point is 00:18:16 And then, you know, you sort of like when the, when the data warehouses came out, you know, sort of, or, or in the early days when they were becoming really popular, you know, you had this whole thing of like, now you can get a 360 review of the customer. It's like, well, in reality, you needed all these pipeline tools in order to make that feasible for, you know, your average company. But to your point on data governance, and I think it's really interesting in the data mesh concept, governance becomes a problem because now you have all these different pipelines, maybe different vendors, you know, different internal builds, all that sort of stuff. And so you can sort of move data more easily and centralize it more easily.
Starting point is 00:18:50 But now you're sending it to all these different places. And so now you have sort of a it's hard to do governance at a central level. What are the ways that you see companies solving that? I think a great question. And I think, you know, Costas also kind of tipped onto this or touched onto this, I should say. I really thinkibra, right? It was a, it's a really famous, like data governance data catalog. And all that it was really was like a fancy spreadsheet of, you know, data metric definitions and, and what they were and allow people to collaborate. Right. So in that way, I think it's bringing that concept to life and making it physical. So again, what does that really mean?
Starting point is 00:19:45 I keep coming back to view materialization, but like there is no data governance without some type of physical execution behind it. So whether that means that you're going to roll out GitOps so that everything in your GitHub repository aligns very much with all the metrics that you're creating. I mean, this whole code is documentation, I think is a piece of it as well, right? Like your code should be your data governance assets. When someone asks like what, you know, MAU monthly active users are, you shouldn't be like pointing to a cell in a spreadsheet and words that define what a monthly active user is.
Starting point is 00:20:23 Like you should be able to point to like maybe a view that, oh, well, here's our view of active users. And this is the SQL behind it that, or the Python that builds this view. And it's pulling from these tables and it has these columns and these are the, you know, the, these are the characteristics of each column and the type, like that's the piece of data governance that have been that has been missing i think for probably a long time is that physical piece to say okay yeah you've defined it right and you're governing it from that aspect but how are you making it real right yeah and that and that kind of goes back to something that has been a recurring
Starting point is 00:21:02 theme on the show across so many disciplines within data, whether it's data science, data engineering, data governance, is that it's an organizational and sort of cultural question first. And that is getting shared definitions around how you define the business. And then I love the analogy you gave of the physical manifestation of that. I think that's just a really helpful way to think about that. And I agree with you there. I mean, the DBT is a huge step forward in building some process and tooling around that, but I still think we have yet to see all the different things that
Starting point is 00:21:44 are going to make it way easier to do that centrally within the context of sort of the data mesh future, if we want to call it that. And you know where you hit the nail on the head is, I think all of these tools are still a little bit too technical for business users, right? Like when I think about DBT, when I think about any of the good, you know, tools that are making it easy to, you know, manifest this whole process, they're still very technical. I think the first company or, you know, vendor who comes up with like a business way or a way to empower business users to participate in that process, I think that that'll be where the major impact comes because that's what you're missing. At the end of the day, data people are data people. And it's great that that's starting to happen because I feel like in the past, you were like a marketing person that also knew how to work spreadsheets. So now you're the marketing data
Starting point is 00:22:40 person, right? And I think it's flipping now. People are understanding, like you wouldn't do that with HR, right? Like at a company, you wouldn't be like, you're marketing and you're good with people. So you're going to be our HR person. But think about that's the way the data has been working for, for the better part of, you know, the 21st century, only recently have there been college graduates, you know graduates graduating with analytics degrees and a concentration in statistics that is specific to programming. So it's like, I think as data people actually stake their claim and they are data people, you're going to need tools that bridge the gap between the data-minded person and the subject matter expert. Yep, totally. You're so right. that bridge the gap between the data-minded person and the subject matter expert, you know? Yep, totally. You're so right. Before we move forward, and I have a feeling that this conversation is going to be a lot around data governance,
Starting point is 00:23:34 and for a good reason, because it's something that's, like, very, very interesting. So, Dave, can you give us, like, a bit of a definition of what data governance is? Yeah, I mean, I really think it's defining and managing your data in a logical way that is actionable by the business. I think of data governance as, for example, a lot of single source of truth projects, right? It could be as simple as customer value. Well, how do you have a data governance program around customer value? It might seem really easy.
Starting point is 00:24:09 It's like, well, the number in Salesforce is our customer value, but where did that number in Salesforce come from, right? So it's this whole data lineage that maps all the different data sources to the metric that you want to create. And not only the data lineage and where that information is coming from, but then what is the logic, right? Like, is a customer value based off of a start and end date? Is it a monthly value? Is it an annual value? And, you know, for all of those questions, how is the answer manifested? And that's where I think the documentation as code or code as documentation really plays a point. So you have this data lineage piece that traces all of the information that you're using for the metric. You have the logical piece that is using code to define what these metrics are. And that's where, again, the physical piece is really stressed. It's like, okay, well, once we have the lineage right, once we have the logic down and committed to code, how are we delivering this to stakeholders on a regular basis? Are we materializing views?
Starting point is 00:25:15 Are we using a reverse ETL tool to get it out of our data warehouse? Is there another process that we're using? That's where I think there's many solutions to the problem. But when I think of data governance, those three pieces of lineage, logic, and delivery are kind of the main components for me. Makes sense. That's very interesting. And what are the tools that today we have to implement data governance? Yeah, like I said, I think that there's like some all-in-one tools. I know Calibra is a really big player in this space.
Starting point is 00:25:55 Obviously, like there's some more legacy providers like Informatica. I know they have really robust MDM and data governance features. You know, personally, I think that's, I think the people, I don't, I don't really think that there's like a cool data governance platform, right? And like an emerging one that kind of fits with this minimum viable data stack, because people are kind of managing data governance in a everything lives in this zone of our snowflake data warehouse. And then when we clean it and we prepare information that's ready for consumption, it's in this other zone of our data warehouse. Some people I think are solving with a tool like DBT, right? If it's scheduled with DBT and then set it and forget it, then that's our data governance. And basically anything in production is governed. Anything in dev is not governed. But again, what that does is in a way it excludes the business user because unless the business user can fork a GitHub repo, can read SQL, can understand all the different programming languages and the transformations that are being done to that data, it's kind of hard. Like you need to be walked through that process. So like I said,
Starting point is 00:27:09 the first company that comes by and can map the technical pieces of data governance, the lineage, the logic, and the delivery to things that the business people would understand and also be able to contribute to, like, I think that's where you're going to get lightning in a bottle. Yeah, that's very interesting what you are saying, because you are talking a lot about, like, let's say, there are governance platform, like a unifying kind of experience around governance, which is what, let's say, Informatica was trying to do, right?
Starting point is 00:27:47 Or Colibra and in general, like all these more enterprise kind of companies that we have seen so far, like in this space. IBM, I mean, all these companies had some kind of like master data management platform. But at the same time, I think that the Silicon Valley way of doing things is getting these platforms, right? And decompose them into meaningful parts and build companies pretty much and products around that, right? And decompose them into meaningful parts and build companies pretty much
Starting point is 00:28:06 and products around that, right? So we have like, now we see companies like Immuta, for example, right? Like they just raised like Series D, $90 million. And they are working, the product is all about data access, right? And how you manage that. And then you have like a number of companies
Starting point is 00:28:23 that they are doing quality, and even more niche things than just quality, right? Like just tracking schema changes, right? Totally. So this creates a very fragmented kind of landscape with all the tools that there are out there. Do you think that this can work? Or it's like pretty much a necessity
Starting point is 00:28:42 in order to realize the real value of data governance, have just one platform that does all that stuff. I honestly think that we're on a bubble of all these different data tools. And I have to believe that there will be consolidation in the future, which I think is what you're hinting at. You know, I think you're already starting to see it with like, I think Twilio bot segment, or maybe it was the other way around. I'm not like, I think Twilio bought Segment or maybe it was the other way around.
Starting point is 00:29:06 I'm not sure. But it's Twilio. Yeah. Yeah. So, you know, that was a big, not shocking, but, you know, I thought Segment was a huge player in the space and you see them consolidate. You know, I worked at a DevOps company and they have a very similar, they have similar problems when it comes to tool chains that data does.
Starting point is 00:29:24 Like there's a different data pieces that you can do and you know for devops you can have you know five different tools just for testing right and so as there has been consolidation in the devops space where like you know google and microsoft and start buying up these little pieces i think it's going to happen with data again if you if we want to map like the journey of from BI to where we are now, like think about the huge BI vendors that have got acquired. Right. I think about Looker. I think they went to Google. Right. Yeah. And there, there've been some other, so I totally think that in the future, our conversation, like in the next five to 10 years years i don't think that we're going to be talking about a bunch of different vendors i think we'll be talking about one or
Starting point is 00:30:09 there will be a solution that emerges and i've already seen this because i i like to work with early stage uh startups and around data you'll you'll find a a tool almost like zapier right that can almost white label all these services and put them in one place so that you're kind of working off the snowflake engine, the five Tran engine, but you're working in X tool, right. To bring it all together. I'm not sure which one's going to happen first, but it's either going to be consolidation or it's going to be some type of white labeling because there's no way that people are going to want to,
Starting point is 00:30:44 you know, switch from, from thing that people are going to want to, you know, switch from thing to thing as they're trying to go about their day, you know? Yeah. And you kind of see it broken out by business discipline because you have some companies in the space to Costas' point that are focusing on sort of like sales ops and some are sort of like marketing ops and governance there. I think, Dave, have you heard of a company called Great Expectations? No, I don't think so.
Starting point is 00:31:07 They're kind of an interesting, and our listeners, if you haven't checked them out, it's just kind of an interesting, I think it gets at some of the things you're talking about where, I mean, they're an early stage startup as well. And so they're in their own way taking a slice of the pie, but they kind of have an interesting framework for thinking about data governance and sort of managing it at the pipeline level, which is really interesting. So definitely give them a look. Definitely. No, no. And honestly, like maybe we're far away from the consolidation because it feels like I'm learning about new tools all the time. I know Presto has started to emerge, you know, from like to solve for big data issues. I've been speaking to
Starting point is 00:31:46 Monte Carlo because I think that's just a really interesting space around data observability, right? And it makes a lot of sense. You have all these data tools now, what if one of them fails? Would you even know? Like, are you even doing data quality checks across the whole tool chain to make sure that there's some, you's some form of validity to everything. So yeah, to your point, I think that there's new emerging ones all the time. I just can't imagine that people will want to continue to buy more subscriptions. Someone's got to come along and consolidate for the good of the market. Yeah, it's going to be interesting. I was thinking that what's interesting with the data space is that the acquisitions actually started from the BI tools,
Starting point is 00:32:28 which probably makes sense because they are like the most mature ones. But if you think about it, it's crazy that even publicly traded companies like Tableau got acquired. Tableau got acquired by Salesforce, right? But outside of this, we haven't seen anything major happening.
Starting point is 00:32:42 And I think it's probably, okay, we have the Twilio segment acquisition, which was pretty big, right? I think it was like 3.2 billion. But the market is in the right conditions for acquisitions. There's a lot of liquidity. There's a lot of cash. Stocks are like pretty high.
Starting point is 00:32:56 So I don't know. I really want to see what Snowflake is going to do. I don't think they have acquired anything so far. So I think we should pay, like keep our eyes on them. Oh, definitely. I would peg Snowflake as one of the consolidators. I mean, if you think about it,
Starting point is 00:33:14 it would be great to get Snowflake to acquire something like Fivetran and something like a Rudder stack, a census, a high touch. Because then, right, basically you have a way into your cloud data warehouse, you have the cloud data warehouse, and you have a way out of the cloud data warehouse, right? So in that way, I basically have everything I need.
Starting point is 00:33:36 Obviously, there's other bells and whistles that I could add to that. But I mean, you know, I could kind of plug and go and have a data platform with one vendor, you know, so sure. So yeah, it'll be very interesting to see what happens. Which makes total sense because a lot of the, especially in sort of the SMB mid market are already using all those tools, right? I mean, it's just consolidating it into one sort of one, one system. Okay. So speaking of data warehouses, and this is actually Costas for you and Dave. So Costas wrote an article recently about sort of Snowflake versus Databricks and sort of the
Starting point is 00:34:12 impending collision there, which is really interesting. We'll put it in the show notes for everyone to read. It's really an excellent piece, but Dave would love your opinion and Costas jump in here as well, because you've studied this pretty deeply. You have sort of the warehouse side, which is Snowflake, and then you have the data lake side, which is Databricks. And then you have this new emerging category, which is being called data lake. So would love your thoughts on what are we going to see? What are we going to see happen there in the next five to 10 years related to sort of all the, all the things we've talked about?
Starting point is 00:34:43 Yeah. I'd love if Costas went first, so I could kind of copy his answer because I have thoughts on it. But if we have a subject matter expert, it'd be great for you to get us going. Yeah, I don't know if I'm an expert, but I'm very fascinated, especially from the product side of things with that stuff. And that was the whole idea of like the article and what I tried to communicate that we are actually converging into one data platform at the end. Now, how is this going to be named? Is it going to be
Starting point is 00:35:10 named data cloud, a cloud data platform, or it's going to be a data lake or a lake house or whatever, that's something that product marketing will figure out. And it's not that important. But what is important is that, and that I think resonates very well with what Dave was saying also about data governance, is that we need to have like one experience and one platform working with data and unify many of the functions that we have under one platform. That's the opportunity for the market, but also that's what is needed if you want to really create this data economy and create an industry around data.
Starting point is 00:35:48 Right now, things are extremely fragmented. For a company to manage to have a data stack, there are just way too many vendors that have to be involved there. Even for pipelines, Eric, think about it. How many different vendors someone needs to have a complete data pipeline inside the company? It's probably at least three. So everything is going to be around one platform. And what I'm thinking is that,
Starting point is 00:36:14 and I think that's also the vision that Snowflake was trying to communicate through their H1 filing, is that there's going to be a data platform. And on top of that, there are applications that are built. So BI becomes an application, right? The pipelines are something that are working around this platform and connect to this platform in and out.
Starting point is 00:36:36 And you can build like some very interesting things over that. Like for example, you can start having marketplaces around data. And when you do that, then you have network effects, right? And that's where it gets like really, really fascinating. I think we are just at the beginning, but I think also that the direction of where we're heading is becoming more clear. Totally. And I would agree with all that. And to just pick up on it from my perspective, I think that the platform that is most wide open, I'm familiar with Snowflake. I've used it, but, you know, it feels a little bit more kludgy to me or like click and drag and drop. I know there is SQL components to it as well. But, you know, I think it's very appealing to be able to leverage your developer language skills, right? The number one thing that I hate and which I hope
Starting point is 00:37:43 does not happen is that someone comes up with their own syntax to manage all of this, right? The number one thing that I hate, and which I hope does not happen, is that someone comes up with their own syntax to manage all of this, right? I really think that the success of any platform, whether it's Snowflake or Python, is to capitalize on standard components of the data industry, right? Because again, if you think about it, like if I'm in Power BI, I need to know like DAX and their language, right? Because again, if you think about it, like if I'm in Power BI, I need to know like DAX and their language, right? In ClickView, they had their own. And so when it was, when it came to BI, like even Looker, you have to know LookML. It's all based off of SQL and Java based languages. But I mean, it's kind of a pain if you've invested five years or, you know, you went to the Flatiron school and you learned how to code Python and then all of a sudden you're in Tableau and you have to drag things onto shelves and figure out how to create a chart by clicking on a bunch of different buttons. Right.
Starting point is 00:38:36 So when I think of what has the most potential in the future, I mean, I love the notebook infrastructure. Right. I'm a big fan of Jupyter notebooks. I love the notebook infrastructure, right? I'm a big fan of Jupyter notebooks. I love the Google collab product. And I'm a big fan of Databricks that way because it's like a blank canvas. You're still guided, but I could be using Python in one cell. I could use SQL in another. It's super flexible. I can fork different pieces of code that I find on the internet into my notebook and make it all work together. The scheduling is a little bit more technical
Starting point is 00:39:09 and less clicky. So when I think about what's going to emerge, I think it's going to be the platform that takes advantage of the popular skill sets in data and doesn't make people relearn things or learn like a specific way of doing things that hopefully that makes sense yeah it does and i totally agree with what you are saying about platforms and parandexments all that stuff i think products need to be built with the assumption that they are going to really fast become part of the workflow that the developer has, right? And not create more friction or more, let's say, mental overhead to the engineer to learn something new, right? Which by the way, it's probably something that as long as it will exist only as long as the company exists there. So yeah, I totally agree with that. And I
Starting point is 00:40:04 think we will see this paradigm of the past where companies were building their own languages, like Splunk, for example, right? Like you have to use their own query language to do that. And you have people who specialize only on that, like that's what they have on their CV. I think we are going to see that less and less in the future. And it's going to be much more risky for companies to do that and try like to build a business around that, unless they do execute very, very well, dbt, for example. But what I think is very smart that dbt did is that it builds on top of an existing language, which is SQL. And they just
Starting point is 00:40:38 added enough, let's say, special source there from their engineering to make it easier to work with and do things that we couldn't do in the past because, okay, SQL also had a lot of issues and their ergonomics of the language were very problematic. I mean, that's amazing what they did with that. But yeah, I totally agree. I think that Python, R, Jupyter notebooks, every product in the data space need to at least interoperate with these tools. Definitely. Yeah. And again, to your point on like, what does this become? Does it become the Delta Lake, the lake house, the, you know, the data? I've heard, remember data marts and data stores from, you know, BI times.
Starting point is 00:41:20 I mean, the architecture of this, I think is really up for grabs. And I think that's the part that needs to be bespoke, right? Because I've worked at places that have big data problems, right? I'm at a place like that now at FanDuel, right? I mean, data is the product and there's just voluminous volumes of data coming in every second, right? So there's a whole, you know, the whole data streaming thing is appropriate here, you know, data lakes talking about that's appropriate here. And so using tools like Databricks that solve for big data problems is really apropos, right? But, you know, I've used, I do a lot of consulting gigs. I've also worked at smaller startups and they don't have those
Starting point is 00:42:02 problems, right? They're, you right? I was at a startup where their biggest data set was the 50,000 accounts that they had in Salesforce. You know what I mean? So there wasn't necessarily a big data problem there, but I should still be able to use something like Databricks to solve for all the problems that I have at a small company that might not need a data lake. They might not need like this robust cloud data warehouse, but I can still use that tool in order to facilitate a solution, right? It won't feel like I'm using a rocket launcher to solve for, you know, something that I could with a hammer. So that's where like that flexibility piece comes in. I do believe like the architecture piece is
Starting point is 00:42:43 going to continue to be bespoke per industry, per company, per vertical, right? Because SMB companies in software development are going to have much different data needs than, you know, like a restaurant company that might be, you know, nationwide or global. And that's another trend that I see emerging. And that's why I think these data tools are so important is I think small businesses haven't even really taken full advantage of their data because they see that like, oh, well, that's a that's like a corporate that's an enterprise problem. Right. who knows SQL or Python could jump into, you know, you'd be able to solve, you know, for problems of like a local gym or like a local bar so that they can manage their data and all their data assets in their business the same way that, you know, Google or a fortune 1000 company would. So, so yeah, for me, like that question of what does the architecture look like in the future? Is it lakes? Is it warehouses? What is it?
Starting point is 00:43:46 I don't think that'll ever be standardized, but I think that like the tools that we have should be able to build a variety of those solutions. Yeah. You know, Dave, it's really interesting to think about DBT, and I know I'm not the first person to have this thought, but I think an interesting point to make for our conversation is DBT has spanned individual user to enterprise and retains its ability to add value. And that in and of itself is extremely rare to be able to serve successfully as a company or tool, to be able to serve an individual user and the enterprise, especially as you grow, because the natural need of any business is to focus on the users that it serves best, right? And so it's almost impossible to serve an individual user in an enterprise simultaneously. I mean, you have to make all sorts of choices around product features and roadmap and
Starting point is 00:44:42 marketing and all that sort of stuff. So really interesting thought there, but I agree. I think it's going to, you know, tools that sort of help democratize that and span size of business are going to be a huge part. One thing I want to do, I know we're getting close to time here. I cannot believe we're getting close to time. It feels like we just started talking. We'll have to have you back on because I have so many more questions. I'd love to ask you about FanDuel a little bit. And my question is pretty tactical, but I think it'd be interesting for our audience. You have all sorts of types of data at FanDuel. And it looks like, I mean, I'm not an expert, but it looks like you have to ingest a ton of data and statistics across a huge variety of disciplines. And so I'm interested to know, you know, even if we just think about sports like fantasy football, for example,
Starting point is 00:45:34 where and how do you ingest all of the statistics that you need in order to sort of run daily fantasy programs in the app? I mean, that seems like a major sort of data engineering pipeline challenge. Yeah. And you know, what's great about my job now and working at a big company is that I'm obfuscated from a lot of those decisions, right? I, you know, in other roles where I used to be involved in building the pipeline, building and managing the database and also producing the insights, you know, in this role, I'm really fortunate to be able to focus on delivering the insights. And so we still have to build like mini pipelines, because to your point, I still have like this
Starting point is 00:46:16 massive data lake, let's call it. And that's how I see a data lake is like all of the possible information that you could use for analysis from a standpoint. And we have to build mini pipelines so that we have dependable views that are slices of time for performance reasons and just for feasibility. But I wish I could tell you how all of the information gets into there. And to be honest with you, as the company goes through acquisitions, as the company transforms, think about it. I mean, this company FanDuel has started to work in an industry that just became legal when you're talking about sports betting, right? Like you look back five, 10 years ago, like there wasn't
Starting point is 00:46:59 legalized sports betting outside of like Las Vegas and maybe Atlantic city. I don't even know if it was an Atlantic city at that point, but to that point, it's still a little bit of the wild west. I mean, it's a mystery to me. And that's probably because it's something that the company is solving for on a daily basis. So I wish I could answer that question, but I can confirm that there is information coming from mobile applications, from reference data that we're grabbing from databases. And again, it's a multi-product company. So you're not just talking about one application for daily fantasy, but you're talking about sports book, racing, poker, like there's a myriad of them. But again, I think that process of ingesting, architecting, and delivering, that is still the core tenants.
Starting point is 00:47:49 Like the things that we're doing at our group level, you know, the little mini pipelines that we're building, the little databases that we're building, and the views that we're materializing, it's the same approach that the company is using. But, you know, I wish I could answer all of it. Yeah. We'll have someone from maybe the data engineering team, if you'd make an intro for us, just because I think it'd be so interesting to hear. Anytime we talk with companies who are ingesting significant amounts of outside data and combining it with internal data, those are always fascinating pipeline conversations. Well, to close us out, could you just tell us maybe since you are sort of delivering data products and less on the pipeline side, could you just tell us about maybe one of the data
Starting point is 00:48:34 products you're delivering at FanDuel right now? Sure. We do a bunch of regulatory reporting, which again, doesn't sound very interesting, but regulators are, they're mostly accountants by trade. So they're people who know numbers and they hold us very accountable. Let's just say that, right? Because what's at stake at the end of the day is tax money that fuels everything in their state, right? So building regulatory reporting, it's not as exciting. There's not as many colors and graphs and charts and dashboards as I've been used to in my career. The focus is really more on data timeliness, let's call it, right? Sure.
Starting point is 00:49:14 All about accuracy and setting up those checks. And it's also about delivering information in very interesting ways, right? Like pushing a CSV file to an SFTP server that people could pick up. Now, in my past, I really haven't done a lot of that, right? Because I'm delivering dashboards to people. I'm delivering people analysis and prediction and things like that. Very rarely am I trying to figure out how I get a 6 million row file scheduled on a daily basis to drop into an SFTP server across multiple states, you know, and regulators on a daily basis, right? So those are really interesting challenges that we're solving for. And that's where like having a Swiss army knife tool like Databricks is super, super helpful because anything that I find online on Stack Overflow about, you know,
Starting point is 00:50:06 building those types of pipelines, I can repurpose immediately and then start deploying in our organization. So yeah, it's solving those less sexy. It sometimes feels like archaic types of information, but it's all about knowing who your audience is, right? And regulators and things like that and accountants, they want the line level data, right? There's no if, ands, or buts. You can't give them a fancy chart. You can't give them summary information. And on top of it, it has to be accurate
Starting point is 00:50:38 or else you're going to be spending more time doing reconciliations than you will delivering the product that you signed up for, right? So that whole regulatory reporting pipeline has been really interesting for me. And it feels a little unnatural about like what I'm doing, but, but, you know, everything changes when your audience changes. Yeah. I love it. I think it's, it's really fun for me and I hope our audience as well to hear about a different kind of data product on the regulatory side, because the requirements are very different than sort of maybe summary data
Starting point is 00:51:10 around usage, you know, where margin of error is acceptable on some level, because, you know, customer data is a little bit messy and, you know, you know, they're sort of outliers and other things like that. But line level data on regulatory that is critical to your business continuing to function is a very different type of product to deliver. So super interesting to hear about that. So I've very much made comparisons to healthcare, right? It's like, you know, if you can't like messing up insurance claims is really affecting people's lives. Like same thing here. It might seem trite, but this tax money is really important to these states, especially now with the state of the world. So there's a lot more riding on it.
Starting point is 00:51:53 It's a lot less directionally accurate. That's a word that I've used throughout my career to save my butt is directional accuracy. And I can't use that word anymore. Sure. Very cool. Well, Dave, this has been a really wonderful show. We'd love to have you back on. We'd love to get someone from the data engineering team to hear about your pipeline. So we'll be in touch. And thank you again for your time. Oh, thank you guys. This has been wonderful. And I really appreciate you having me. Well, that was a fascinating conversation. I didn't get my question answered, but maybe we'll get someone from data engineering on the show. But I love it when we have a show where we get on a topic that everyone's passionate about and has opinions about,
Starting point is 00:52:34 and we can really dig in on it. I think one of the interesting things to me from the conversation was the comment around data portability. So there's all sorts of terminologies, so data mesh and connected stack and all these different things. And the concept of data portability, I think, is a really, really helpful way to think about where things are headed, at least as far as we can see now. So that was what stuck out to me. Yeah. I mean, I think you're not alone in this, Eric. I also didn't manage to ask that many questions about fantasy games, but it doesn't matter. I think I really enjoyed the conversation. We had a lot to chat with Dave about data platforms
Starting point is 00:53:12 and what the future will look like. So that was super interesting for me. And I really liked his opinion and his view on all these things around how we are going to be using data in the future. It's funny also to interact with people who really understand the products and they don't agree with the marketing terms that we come with while we try to market new products. So this whole thing about reverse ETL and what's the right name of it, I think he put
Starting point is 00:53:42 it very well with the term data portability. And yeah, I'm really looking forward to chat with him again. And hopefully next time I'll manage to ask my questions around fantasy gaming. Yes, we can have it. That'd be actually fun to have an episode where we cover topics we don't know about.
Starting point is 00:53:59 All right. Well, thank you again for joining us. Subscribe on your favorite podcast app. You'll get notified of new episodes every week and we'll catch you next time. The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Learn more at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.