The Data Stack Show - 37: The Components of Data Governance with Dave Melillo of FanDuel
Episode Date: May 26, 2021Highlights from this week's episode include:Dave's "nerdy" interests in sports statistics and data (2:12)Trends in collecting, processing, and using data (4:45)Finding a better term for "reverse ETL" ...(5:48)The blurring of the distinction between sources and destinations (7:41)The role of BI is changing (13:24)Data governance and the physical execution behind it (19:00)Data governance is defining and managing data in a logical way that is actionable by the business (23:43)Consolidation of tools and services (28:49)Databricks vs. Snowflake (33:49) Dave's focus on regulatory data at FanDuel (45:47)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution.
Thanks for joining the show today.
Welcome back to the Data Stack Show.
Very interesting guest, Dave from FanDuel.
FanDuel is sort of a fantasy sports and sports betting
suite of apps. So it's going to be really interesting to talk with Dave about that.
I think it's a fairly new, I know they've been doing fantasy sports for a while,
but I think the betting aspect has been as new from a regulatory standpoint. So perhaps we'll
get to hear a little bit about that. But Dave has a really varied background and has worked in all sorts of contexts with data. So I think one of my burning questions, which is pretty tactical,
is when you think about fantasy sports, you have to ingest data from a ton of different places.
I mean, you're talking about statistics, you have games across multiple sports happening on a daily
basis. And so when you think about all of this required to run a sort of consumer mobile app like that, where people are interacting every single day with data that needs to come from third parties, I always look at that and say, man, that's an interesting pipeline problem.
So I want to ask about that. Kostas, what's on your mind?
Yeah, first of all, I'll probably want to learn a little bit more about fantasy games to be honest like i don't know much about it but outside of this i mean dave has like a very diverse background he has worked with data science
data engineering he has even done work in data architecting so i want to learn from him about
his experience with all the different fields around data and also pick his brain on what's coming in the future in this space.
Great. Well, let's dive in.
Dave, welcome to the Data Stack Show. We're really excited to chat with you.
Thanks, Eric. I appreciate it. I'm really excited to be here.
So you have a varied history with data and we want to hear all about it. So why don't you give us the brief,
you know, sort of two minute overview of when you got started with data, the different companies
you've been at and what you're up to today? Totally. I think I'm going to Tarantino it
because you know, where I am today, it's kind of the apex of what I've been trying to do with data
my whole life. I'm currently working at FanDuel, which for people who don't know is a daily fantasy
sports betting company. It's in, is a daily fantasy sports betting company.
It's in the sports entertainment space.
When I first started studying data right back in high school and things like that, you know, what piqued my interest was sports statistics.
I've always been kind of a nerd that way.
I thought I was going to graduate college and be the statistician for the New York Yankees.
Unfortunately, that didn't happen. But what did happen is I was able to kind of parlay that interest in statistics and data and information
technology into roles at, you know, Fortune 500 companies. I worked at software startups,
kind of ran the gamut from different places that I worked at throughout my career. But everything's
been revolved around data, right? The same things that I was doing at Fortune 500 companies, I was doing on the side,
consulting for small businesses in my area. And so that's everything from data engineering,
to data architecture, to data science, and all the fun stuff in the tip of the spear.
But yeah, it all finally came full circle to me landing closer to my passion here
at FanDuel where, you know, we solve everything with data. So. Very cool. And tell us a little
bit. I know you I know you haven't been there too long, but what's your role? Do you have a team?
What kind of data projects are you working on at FanDuel? Yeah, FanDuel, I'm on the operational
side of the business. So it's a lot of back office support, compliance support, you know, regulatory support. So it might not sound like the sexiest of roles, but it's really cool because it's at the hub of everything that FanDuel does with data.
So I get exposed to a lot of different pieces of data, not just gameplay stuff. And it's really, really interesting. And there's a team and it's growing exponentially along with the market. So it's a really fun and exciting place
to be right now. Very cool. Well, I have tons of technical questions. I know Costas does too, but
one thing I would be interested to hear your perspective on is sort of major trends in the
data space. And I'll name one specifically to maybe direct
the question a little bit more, but we have this concept of a data mesh that seems to be
becoming more popular. What kind of trends are you seeing in terms of the way that companies
are sort of organizing themselves around sort of collecting, processing, and actually using data?
Yeah, that's a great question.
So the trend that I've seen most strongly over the past few months, maybe six, maybe even a year, right, is that people have doubled down on technology like Snowflake, right?
And like cloud data warehouses have become commonplace.
And that requires a significant amount of investment from a company's perspective, right?
To just spin up and migrate to Snowflake or Redshift or BigQuery or anything like that is
no easy job, right? And it's no cheap job either. So as over the past, probably five, 10 years,
companies have done that. They've started to understand or have like a revelation that
just because everything's in that cloud data warehouse
doesn't mean that the business is exposed to it.
So that leads to this whole trend of reverse ETL that has started to emerge.
I don't really like the word reverse ETL.
I feel like that's very much like a sales and marketing term.
I really like-
Oh, I'm so glad you said that because, okay, let's talk about this.
I was going to ask you why you don't like it.
Tell us what you would call it.
I call it data portability.
And that's how I've always advertised it internally to my stakeholders and people I'm working
on with projects because it's about making sure that data is portable no matter where
the analysis or the data is generated, right?
I think like thinking about reverse ETL is it makes sense because you can marry it to something that people are familiar with,
but I'm not really sure if the concepts of ETL are actually what this is doing. So portability
is the word that I use, but it's, you know, it's really all about getting data in front of people
where they're working, right? On a regular basis. Because just as IT organizations have doubled down
and gotten things like Snowflake and Fivetran and this whole chain of tools, you know, go to market
functions and the business have also done the same thing. You know, I'm sure that you guys can
sympathize with this, but at any company that I go to, they have like a million different SaaS
applications, you know, one for customer
success, one for sales, one for marketing, et cetera, et cetera. And, you know, asking people
in the 21st century to be swivel chairing between like a BI dashboard and Salesforce
and some spreadsheets and things like that is a little bit archaic. So, you know So that's where I think this whole thing with data portability and this trend,
that's what people are trying to solve.
They understand that, hey, just because I have analysis in a dashboard
or in a data warehouse really doesn't mean anything to the people
who are actually using this data and making it actionable.
Sure.
Okay, so I'm going to give you a really brief sort of three-stage history of where we've been with data, maybe even the last five years.
And Costas, I want your opinion on this too.
This is me.
I'm going to go off the cuff here, but you kind of have the introduction of the, this is dangerous.
This is really dangerous.
It's very exciting, I think.
Go, go.
Do it, do it.
All right.
So you have the introduction of the data warehouse,
right? So Redshift was sort of the first major player there. And then on the heels of that
came Snowflake. And of course, BigQuery is a major player there too. And this allows you to
collect all of your data in one place and sort of achieve analysis that before was much more
difficult. And then you have that, that sort of created of created the challenge of the second phase and tools like
Fivetran and all that solved it, which was, okay, now I can collect all my data, but actually doing
that is kind of hard. And so I need much, much easier ways to get all these pipelines to talk
to each other and sort of integrate my stack, whether it's sort of sources to the warehouse
and then also sort of sources to like SaaS tools. And that was phase two, right? So you saw that the segments in the five trends
and all the pipeline tools come of age over the last five years. And many are sort of mature now.
And I think the third phase, and this is probably where there's sort of some prediction coming in
is, and I love the term data portability, is where every source is also becoming a destination.
And so this paradigm of sort of linear collect, store, transform process and deliver is actually
becoming almost bi-directional in a way where the distinction between sources and destinations is
starting to blur. How did I do? Was that accurate? Oh yeah, I think you nailed it.
And I mean, as you were talking, you know, one of the things that started to percolate in my mind
is also this whole movement around kind of view materialization. I know DBT has come on really
strong as of recently. And again, I think, you know, maybe even like the next phase of all this
and what's going on in the future is all around data governance.
Right. And and maybe that's the data mesh piece that you talked about at the beginning.
It's like, OK, I have all these sources. I have all these tools. How can I observe them?
Make sure that they're available. How can I make sure that people know what the single sources of truth are?
How can I easily create these single sources of truth from large data sets and kind of make that available to the rest of the organization?
So, yeah, I think you did a great job.
And I think the future, to your point, is kind of a little bit like the Wild West, right?
Because all the big boulders have been solved, but people still experience pain.
So, you know, I think you see different vendors kind of attacking the future from different angles, you know.
Dave, I have a question and I'd like to hear from your experience.
So about reverse ETL, right?
I mean, it's a new term, as you said, let's say this portability, data portability, let's call it like this.
How it was done before?
I mean, before the markets and why the market decided now
to go after this problem? Yeah, again, I always say this. When I started this conversation about
data portability and why it's emerging, I believe it's been because things like the cloud data
warehouse have become very accessible, right? It's not hard. Like usually in the past, there'd be a
lot of configuration, a lot of'd be a lot of configuration,
a lot of customization, a lot of integration, but now you can white label everything, right?
You just subscribe to Snowflake. Look at that. You have a cloud data warehouse. Same thing for building data pipelines. In the past, you'd have to know Airflow. You'd have to get familiar with
DAGs and you'd have to build it all yourself. Now you just subscribe to Stitch for a monthly fee
and you can get all of your data into your data warehouse.
But now people understand, they're like, wow,
so we've doubled down on this minimum viable data stack,
but no one cares, right?
Like my salespeople don't care
that I have a cloud data warehouse
because they're still consuming content
through BI dashboards or through things that we send to Salesforce.
So, you know, it's completing that circuit.
And that's really why I believe these portability tools or even things like DBT have really become needed
because it's that bridge between all of that technical debt that you built up with this minimum viable data stack
and actually
making it actionable. And yeah, I don't know if that answers your question 100%, but I do believe
that you wouldn't have one without the other, right? If these cloud data warehouses, if these
pipeline tools didn't exist, I don't think things in the data portability or the DBT space would be emerging as well.
Yeah, absolutely.
To be honest, I think and I believe actually that the real enabler here is the cloud data warehouse.
I think the rest pretty much emerges because we have access to cheap storage and processing
on the cloud, something that in the past we didn't. I mean, and that's what makes things easy
from one side, but also complicates things.
Like the cloud makes things cheaper
and more accessible, but at the same time,
it complicates things by introducing
many silos there, right?
Like all the different SaaS applications
that we have and suddenly we also have
to pull data from there.
And it's not just database systems in the same data center
where we control everything as it was in the past.
Because of course, like ETL, it's not something new.
It exists pretty much since we have database systems.
So yeah, I would say that I totally agree with you.
And I would probably emphasize it a little bit more
like the importance of cloud data warehouses for that.
So you mentioned BI.
And I mean, traditionally, data warehousing was the technology that was supporting BI.
Do you see the role of BI changing inside the organization?
Do you see it like going away or you see new roles outside of the BI analyst emerging?
A hundred percent. And that's, you know, now I'm remembering what your last question was, right? Like, how did we solve for
this before all of these great tools? And BI was the answer, right? I remember when I started my
career probably closer to 10, 15 years ago, BI was the thing, right? BI was solving all these
complex problems that you couldn't with spreadsheets, right? So something like ClickView and Tableau, like they were dominating the space because they made it so much easier to answer the questions that you had that you were trying to solve with spreadsheets and kind of first gen technology back then. So in that way, I totally think that BI now is changing, right?
Because you don't have to do the end to end process with BI anymore. And if you are still
doing that, and you're basically using something like Power BI or Tableau as like your data
platform, I think that you're way behind the curve because you just can't process things as quickly,
you can't anticipate as quickly, it's not scalable. Right. And so now, yeah, it's very interesting. I don't think BI is going away,
but it's just not the one-stop shop anymore. I think it's one of many tools that analysts will
have to learn. And in that way, to your point, Costas, I think the definition of an analyst
is changing. You know, it used to be that you just had to be good
with visualizations and creating some charts. I think like the scope of an analyst is increasing
now, right? Like I think analysts nowadays have to be comfortable jumping into like a Jupyter
notebook or a Databricks notebook, right? Because there you can do some ETL, you could do some
transformation and, you know, set up for visualization
later down the line where I don't think it was like that before.
So I totally think that there still is a role for BI.
I just don't think it's going to be as critical or as pivotal as it was in the past.
Yeah, I totally agree.
I think that's the role of BI is transformed. Obviously, it's not going away
because reporting is always going to be
like the foundation of whatever we're doing, right?
Like we need to understand the past
in order to act in the future and in present.
So I don't think that like BI is going anywhere.
It's just that instead of like the BI analysts,
we will have a little bit of different roles
where BI is going to be just
part of the tool set that you are using as very well, you put it earlier. And talking about roles,
there's this very interesting new category or new role, let's say that it's very promoted by
DBT of the analytics engineer, right? What do you think about this? Like what's your definition?
What does it mean for
someone to be an analytics engineer what is this thing yeah it's funny like i don't know what does
it mean right all these data things have all these data roles have been malleable from day one right
because when i came in what an analyst was is not what an analyst is today and i like this idea of
an analytics engineer i mean what that means to me me is someone who's doing the more technical work behind analytics,
right?
Because people say, okay, analytics, you get a data set, you chop it up, you pivot it in
Excel, and you're an analyst.
It's like, well, that world has increased in scope, right?
In breadth, very much so.
So I think of an analytics engineer
as doing those things like view materialization, even like some data governance and maybe more of
what would be thought of more of a data engineer, but not like a very technical data engineer.
So in that breadth, right? I think that data engineers are becoming more and more and more
like developers, right? They are definitely shifting over to more of a developer persona, developer day-to-day, developer tool stacks. enough to be in the conversation with the developers, but still analytical and business mind enough to be able to match business requirements to what needs to be done on the
back end to set up the business to analyze, right? So again, when I think of things that analytics
engineers are doing, it's, you know, view materialization, data governance and indexing,
data, building data catalogs, even building maybe some observability
and monitoring pieces of the stack, which, you know, that's another piece that's emerging. So,
so yeah, I don't know. I'm probably not the person to define what the analytics engineer is,
but that would be my best guess if I had to take it. Dave, we brought up data governance a couple
of times here and I'm, I'm really interested in, so
in many ways, like the, and this is, you know, unfortunately a lot of times the case where
the marketing kind of leads too early with the future vision that companies can achieve.
And then, you know, you sort of like when the, when the data warehouses came out, you
know, sort of, or, or in the early days when they were becoming really popular, you know,
you had this whole thing of like, now you can get a 360 review of the customer. It's
like, well, in reality, you needed all these pipeline tools in order to make that feasible
for, you know, your average company. But to your point on data governance, and I think it's really
interesting in the data mesh concept, governance becomes a problem because now you have all these
different pipelines, maybe different vendors, you know, different internal builds, all that sort of stuff.
And so you can sort of move data more easily and centralize it more easily.
But now you're sending it to all these different places.
And so now you have sort of a it's hard to do governance at a central level.
What are the ways that you see companies solving that?
I think a great question.
And I think, you know, Costas also kind of tipped onto this or touched onto this, I should say. I really thinkibra, right? It was a, it's a really famous, like data
governance data catalog. And all that it was really was like a fancy spreadsheet of, you know,
data metric definitions and, and what they were and allow people to collaborate. Right. So in that
way, I think it's bringing that concept to life and making it physical. So again, what does that really mean?
I keep coming back to view materialization, but like there is no data governance without some
type of physical execution behind it. So whether that means that you're going to roll out GitOps
so that everything in your GitHub repository aligns very much with all the metrics that you're
creating. I mean, this whole code is documentation, I think is a piece of it as well, right?
Like your code should be your data governance assets.
When someone asks like what, you know, MAU monthly active users are, you shouldn't be
like pointing to a cell in a spreadsheet and words that define what a monthly active user
is.
Like you should be able to point
to like maybe a view that, oh, well, here's our view of active users. And this is the SQL behind
it that, or the Python that builds this view. And it's pulling from these tables and it has
these columns and these are the, you know, the, these are the characteristics of each column and
the type, like that's the piece of data governance
that have been that has been missing i think for probably a long time is that physical piece to say
okay yeah you've defined it right and you're governing it from that aspect but how are you
making it real right yeah and that and that kind of goes back to something that has been a recurring
theme on the show across
so many disciplines within data, whether it's data science, data engineering, data governance,
is that it's an organizational and sort of cultural question first.
And that is getting shared definitions around how you define the business.
And then I love the analogy you gave
of the physical manifestation of that. I think that's just a really helpful way to think about
that. And I agree with you there. I mean, the DBT is a huge step forward in building some process
and tooling around that, but I still think we have yet to see all the different things that
are going to make it way easier to do that centrally within the context of sort of the
data mesh future, if we want to call it that. And you know where you hit the nail on the head is,
I think all of these tools are still a little bit too technical for business users, right? Like when
I think about DBT, when I think about any of the good, you know, tools that are making it easy to, you know, manifest this whole process, they're still very technical.
I think the first company or, you know, vendor who comes up with like a business way or a way to empower business users to participate in that process, I think that that'll be where the major
impact comes because that's what you're missing. At the end of the day, data people are data people.
And it's great that that's starting to happen because I feel like in the past, you were like
a marketing person that also knew how to work spreadsheets. So now you're the marketing data
person, right? And I think it's flipping now. People are understanding, like you wouldn't do that with HR, right? Like at a company, you wouldn't be like,
you're marketing and you're good with people. So you're going to be our HR person. But think about
that's the way the data has been working for, for the better part of, you know, the 21st century,
only recently have there been college graduates, you know graduates graduating with analytics degrees and a concentration in statistics that is specific to programming.
So it's like, I think as data people actually stake their claim and they are data people, you're going to need tools that bridge the gap between the data-minded person and the subject matter expert.
Yep, totally. You're so right. that bridge the gap between the data-minded person and the subject matter expert, you know?
Yep, totally. You're so right.
Before we move forward, and I have a feeling that this conversation is going to be a lot around data governance,
and for a good reason, because it's something
that's, like, very, very interesting.
So, Dave, can you give us, like, a bit of a definition
of what data governance is?
Yeah, I mean, I really think it's defining
and managing your data in a logical way that is actionable by the business. I think of data
governance as, for example, a lot of single source of truth projects, right? It could be as simple as
customer value. Well, how do you have a data governance program around customer value? It might seem really easy.
It's like, well, the number in Salesforce is our customer value, but where did that number in
Salesforce come from, right? So it's this whole data lineage that maps all the different data
sources to the metric that you want to create. And not only the data lineage and where that information
is coming from, but then what is the logic, right? Like, is a customer value based off of
a start and end date? Is it a monthly value? Is it an annual value? And, you know, for all of those
questions, how is the answer manifested? And that's where I think the documentation as code
or code as documentation really plays a point. So you have this data lineage piece that traces all of the information that you're using for the metric. You have the logical piece that is using code to define what these metrics are. And that's where, again, the physical piece is really stressed. It's like, okay, well, once we have the lineage right, once we have the logic down and committed
to code, how are we delivering this to stakeholders on a regular basis? Are we materializing views?
Are we using a reverse ETL tool to get it out of our data warehouse? Is there another process that
we're using? That's where I think there's many
solutions to the problem. But when I think of data governance, those three pieces of lineage,
logic, and delivery are kind of the main components for me.
Makes sense. That's very interesting. And what are the tools that today we have to implement
data governance?
Yeah, like I said, I think that there's like some all-in-one tools.
I know Calibra is a really big player in this space.
Obviously, like there's some more legacy providers like Informatica. I know they have really robust MDM and data governance features.
You know, personally, I think that's, I think the people, I don't, I don't really think that there's like a cool data governance platform, right? And like an emerging one that kind of fits with this minimum viable data stack, because people are kind of managing data governance in a everything lives in this zone of our snowflake data warehouse. And then when we clean it and we prepare information that's ready for consumption, it's in this other zone of
our data warehouse. Some people I think are solving with a tool like DBT, right? If it's
scheduled with DBT and then set it and forget it, then that's our data governance. And basically
anything in production is governed. Anything in dev is not
governed. But again, what that does is in a way it excludes the business user because unless the
business user can fork a GitHub repo, can read SQL, can understand all the different programming
languages and the transformations that are being done to that data, it's kind of hard. Like you need to be walked through that process. So like I said,
the first company that comes by and can map the technical pieces of data governance, the lineage,
the logic, and the delivery to things that the business people would understand and also be able
to contribute to,
like, I think that's where you're going to get lightning in a bottle.
Yeah, that's very interesting what you are saying,
because you are talking a lot about, like, let's say,
there are governance platform, like a unifying kind of experience around governance,
which is what, let's say, Informatica was trying to do, right?
Or Colibra and in general, like all these more enterprise kind of companies that we have seen so far, like in this space.
IBM, I mean, all these companies had some kind
of like master data management platform.
But at the same time,
I think that the Silicon Valley way of doing things
is getting these platforms, right?
And decompose them into meaningful parts
and build companies pretty much and products around that, right? And decompose them into meaningful parts and build companies pretty much
and products around that, right?
So we have like, now we see companies
like Immuta, for example, right?
Like they just raised like Series D, $90 million.
And they are working,
the product is all about data access, right?
And how you manage that.
And then you have like a number of companies
that they are doing quality,
and even more niche things than just quality, right?
Like just tracking schema changes, right?
Totally.
So this creates a very fragmented kind of landscape
with all the tools that there are out there.
Do you think that this can work?
Or it's like pretty much a necessity
in order to realize the real value of data governance,
have just one platform that does all that stuff.
I honestly think that we're on a bubble of all these different data tools.
And I have to believe that there will be consolidation in the future,
which I think is what you're hinting at.
You know, I think you're already starting to see it with like,
I think Twilio bot segment, or maybe it was the other way around. I'm not like, I think Twilio bought Segment or
maybe it was the other way around.
I'm not sure.
But it's Twilio.
Yeah.
Yeah.
So, you know, that was a big, not shocking, but, you know, I thought Segment was a huge
player in the space and you see them consolidate.
You know, I worked at a DevOps company and they have a very similar, they have similar
problems when it comes to tool chains that data does.
Like there's a
different data pieces that you can do and you know for devops you can have you know five different
tools just for testing right and so as there has been consolidation in the devops space where like
you know google and microsoft and start buying up these little pieces i think it's going to happen
with data again if you if we want to map like the journey of from BI to where we are now, like think about the huge BI
vendors that have got acquired. Right. I think about Looker. I think they went to Google. Right.
Yeah. And there, there've been some other, so I totally think that in the future,
our conversation, like in the next five to 10 years years i don't think that we're going to be talking about a bunch of different vendors i think we'll be talking about one or
there will be a solution that emerges and i've already seen this because i i like to work with
early stage uh startups and around data you'll you'll find a a tool almost like zapier right
that can almost white label all these services and put them in one place so
that you're kind of working off the snowflake engine, the five Tran engine,
but you're working in X tool, right. To bring it all together.
I'm not sure which one's going to happen first,
but it's either going to be consolidation or it's going to be some type of
white labeling because there's no way that people are going to want to,
you know, switch from, from thing that people are going to want to, you know,
switch from thing to thing as they're trying to go about their day, you know?
Yeah.
And you kind of see it broken out by business discipline because you have
some companies in the space to Costas' point that are focusing on sort of
like sales ops and some are sort of like marketing ops and governance there.
I think, Dave, have you heard of a company called Great Expectations?
No, I don't think so.
They're kind of an interesting, and our listeners, if you haven't checked them out,
it's just kind of an interesting, I think it gets at some of the things you're talking about where,
I mean, they're an early stage startup as well. And so they're in their own way taking a slice
of the pie, but they kind of have an interesting framework for thinking about
data governance and sort of managing it at the pipeline level, which is really interesting. So
definitely give them a look. Definitely. No, no. And honestly, like maybe we're far away from the
consolidation because it feels like I'm learning about new tools all the time. I know Presto
has started to emerge, you know, from like to solve for big data issues. I've been speaking to
Monte Carlo because I think that's just a really interesting space around data observability,
right? And it makes a lot of sense. You have all these data tools now, what if one of them fails?
Would you even know? Like, are you even doing data quality checks across the whole tool chain to make
sure that there's some, you's some form of validity to everything.
So yeah, to your point, I think that there's new emerging ones all the time. I just can't imagine
that people will want to continue to buy more subscriptions. Someone's got to come along and
consolidate for the good of the market. Yeah, it's going to be interesting.
I was thinking that what's interesting with the data space is that the acquisitions actually started from the BI tools,
which probably makes sense
because they are like the most mature ones.
But if you think about it,
it's crazy that even publicly traded companies
like Tableau got acquired.
Tableau got acquired by Salesforce, right?
But outside of this,
we haven't seen anything major happening.
And I think it's probably,
okay, we have the Twilio segment acquisition,
which was pretty big, right?
I think it was like 3.2 billion.
But the market is in the right conditions for acquisitions.
There's a lot of liquidity.
There's a lot of cash.
Stocks are like pretty high.
So I don't know.
I really want to see what Snowflake is going to do.
I don't think they have acquired anything so far.
So I think we should pay,
like keep our eyes on them.
Oh, definitely.
I would peg Snowflake as one of the consolidators.
I mean, if you think about it,
it would be great to get Snowflake
to acquire something like Fivetran
and something like a Rudder stack,
a census, a high touch.
Because then, right, basically you have a way into your cloud data warehouse,
you have the cloud data warehouse,
and you have a way out of the cloud data warehouse, right?
So in that way, I basically have everything I need.
Obviously, there's other bells and whistles that I could add to that.
But I mean, you know, I could kind of plug and go
and have a data platform with one vendor,
you know, so sure. So yeah, it'll be very interesting to see what happens.
Which makes total sense because a lot of the, especially in sort of the SMB mid market are
already using all those tools, right? I mean, it's just consolidating it into one sort of one,
one system. Okay. So speaking of data warehouses, and this is actually Costas for you and Dave. So
Costas wrote an article recently about sort of Snowflake versus Databricks and sort of the
impending collision there, which is really interesting. We'll put it in the show notes
for everyone to read. It's really an excellent piece, but Dave would love your opinion and
Costas jump in here as well, because you've studied this pretty deeply. You have sort of the warehouse side, which is Snowflake, and then you have the data lake
side, which is Databricks.
And then you have this new emerging category, which is being called data lake.
So would love your thoughts on what are we going to see?
What are we going to see happen there in the next five to 10 years related to sort of all
the, all the things we've talked about?
Yeah.
I'd love if Costas went first, so I could kind of copy his answer because I have thoughts
on it.
But if we have a subject matter expert, it'd be great for you to get us going.
Yeah, I don't know if I'm an expert, but I'm very fascinated, especially from the product
side of things with that stuff.
And that was the whole idea of like the article and what I tried to communicate that we are
actually converging into one data platform at the end. Now, how is this going to be named? Is it going to be
named data cloud, a cloud data platform, or it's going to be a data lake or a lake house or
whatever, that's something that product marketing will figure out. And it's not that important. But
what is important is that, and that
I think resonates very well with what Dave was saying also about data governance, is that
we need to have like one experience and one platform working with data and unify many of
the functions that we have under one platform. That's the opportunity for the market, but also
that's what is needed if you want to really create this data economy
and create an industry around data.
Right now, things are extremely fragmented.
For a company to manage to have a data stack, there are just way too many vendors that have
to be involved there.
Even for pipelines, Eric, think about it.
How many different vendors someone needs to have a complete data pipeline inside the company?
It's probably at least three.
So everything is going to be around one platform.
And what I'm thinking is that,
and I think that's also the vision
that Snowflake was trying to communicate
through their H1 filing,
is that there's going to be a data platform.
And on top of that, there are applications that are built.
So BI becomes an application, right?
The pipelines are something that are working around this platform
and connect to this platform in and out.
And you can build like some very interesting things over that.
Like for example, you can start having marketplaces around data. And when you do
that, then you have network effects, right? And that's where it gets like really, really
fascinating. I think we are just at the beginning, but I think also that the direction of where we're
heading is becoming more clear. Totally. And I would agree with all that. And to just pick up on it from my perspective, I think that the platform that is most wide open, I'm familiar with Snowflake. I've used it,
but, you know, it feels a little bit more kludgy to me or like click and drag and drop. I know
there is SQL components to it as well. But, you know, I think it's very appealing to be able to
leverage your developer language skills, right? The number one thing that I hate and which I hope
does not happen is that someone comes up with their own syntax to manage all of this, right? The number one thing that I hate, and which I hope does not happen, is that someone comes up with their own syntax to manage all of this, right? I really think that
the success of any platform, whether it's Snowflake or Python, is to capitalize on standard
components of the data industry, right? Because again, if you think about it, like if I'm in
Power BI, I need to know like DAX and their language, right? Because again, if you think about it, like if I'm in Power BI,
I need to know like DAX and their language, right? In ClickView, they had their own. And so when it was, when it came to BI, like even Looker, you have to know LookML. It's all based off of
SQL and Java based languages. But I mean, it's kind of a pain if you've invested five years or,
you know, you went to the Flatiron school and you learned how to code Python and then all of a sudden you're in Tableau and you have to drag things onto shelves and figure out how to create a chart by clicking on a bunch of different buttons.
Right.
So when I think of what has the most potential in the future, I mean, I love the notebook infrastructure.
Right.
I'm a big fan of Jupyter notebooks. I love the notebook infrastructure, right? I'm a big fan of
Jupyter notebooks. I love the Google collab product. And I'm a big fan of Databricks that
way because it's like a blank canvas. You're still guided, but I could be using Python in one cell.
I could use SQL in another. It's super flexible. I can fork different pieces of code that I find
on the internet into my
notebook and make it all work together. The scheduling is a little bit more technical
and less clicky. So when I think about what's going to emerge, I think it's going to be the
platform that takes advantage of the popular skill sets in data and doesn't make people
relearn things or learn like a specific way of doing
things that hopefully that makes sense yeah it does and i totally agree with what you are saying
about platforms and parandexments all that stuff i think products need to be built with the
assumption that they are going to really fast become part of the workflow that the developer has, right? And not create more friction or more, let's say, mental overhead to the
engineer to learn something new, right? Which by the way, it's probably something that as long as
it will exist only as long as the company exists there. So yeah, I totally agree with that. And I
think we will see this
paradigm of the past where companies were building their own languages, like Splunk,
for example, right? Like you have to use their own query language to do that. And you have people
who specialize only on that, like that's what they have on their CV. I think we are going to see that
less and less in the future. And it's going to be much more risky for companies to do that and try
like to build a business
around that, unless they do execute very, very well, dbt, for example. But what I think is very
smart that dbt did is that it builds on top of an existing language, which is SQL. And they just
added enough, let's say, special source there from their engineering to make it easier to work with and
do things that we couldn't do in the past because, okay, SQL also had a lot of issues and their
ergonomics of the language were very problematic. I mean, that's amazing what they did with that.
But yeah, I totally agree. I think that Python, R, Jupyter notebooks, every product in the data
space need to at least interoperate with these tools.
Definitely. Yeah. And again, to your point on like, what does this become?
Does it become the Delta Lake, the lake house, the, you know, the data?
I've heard, remember data marts and data stores from, you know, BI times.
I mean, the architecture of this, I think is really up for grabs. And I think that's
the part that needs to be bespoke, right? Because I've worked at places that have big data problems,
right? I'm at a place like that now at FanDuel, right? I mean, data is the product and there's
just voluminous volumes of data coming in every second, right? So there's a whole, you know,
the whole data streaming thing is
appropriate here, you know, data lakes talking about that's appropriate here. And so using tools
like Databricks that solve for big data problems is really apropos, right? But, you know, I've used,
I do a lot of consulting gigs. I've also worked at smaller startups and they don't have those
problems, right? They're, you right? I was at a startup where
their biggest data set was the 50,000 accounts that they had in Salesforce. You know what I mean?
So there wasn't necessarily a big data problem there, but I should still be able to use something
like Databricks to solve for all the problems that I have at a small company that might not
need a data lake. They might not need like this robust cloud data
warehouse, but I can still use that tool in order to facilitate a solution, right? It won't feel
like I'm using a rocket launcher to solve for, you know, something that I could with a hammer.
So that's where like that flexibility piece comes in. I do believe like the architecture piece is
going to continue to be bespoke per industry, per company, per vertical, right? Because SMB companies in software development
are going to have much different data needs than, you know, like a restaurant company that might be,
you know, nationwide or global. And that's another trend that I see emerging. And that's why I think
these data tools are so important is I think small businesses haven't even really taken full advantage of their data because they see that like, oh, well, that's a that's like a corporate that's an enterprise problem. Right. who knows SQL or Python could jump into, you know, you'd be able to solve, you know, for problems of
like a local gym or like a local bar so that they can manage their data and all their data assets
in their business the same way that, you know, Google or a fortune 1000 company would. So,
so yeah, for me, like that question of what does the architecture look like in the future? Is it
lakes? Is it warehouses? What is it?
I don't think that'll ever be standardized, but I think that like the tools that we have should
be able to build a variety of those solutions. Yeah. You know, Dave, it's really interesting
to think about DBT, and I know I'm not the first person to have this thought, but I think an interesting point to make for our conversation is DBT has spanned individual user to enterprise and retains its ability to add
value. And that in and of itself is extremely rare to be able to serve successfully as a company or
tool, to be able to serve an individual user and the enterprise,
especially as you grow, because the natural need of any business is to focus on the users that it
serves best, right? And so it's almost impossible to serve an individual user in an enterprise
simultaneously. I mean, you have to make all sorts of choices around product features and roadmap and
marketing and all that sort of stuff. So really interesting thought there, but I agree. I think it's going to, you know,
tools that sort of help democratize that and span size of business are going to be a huge part.
One thing I want to do, I know we're getting close to time here. I cannot believe we're
getting close to time. It feels like we just started talking. We'll have to have you back
on because I have so many more questions.
I'd love to ask you about FanDuel a little bit. And my question is pretty tactical, but I think it'd be interesting for our audience. You have all sorts of types of data at FanDuel. And it looks
like, I mean, I'm not an expert, but it looks like you have to ingest a ton of data and statistics across a huge variety of disciplines.
And so I'm interested to know, you know, even if we just think about sports like fantasy football, for example,
where and how do you ingest all of the statistics that you need in order to sort of run daily fantasy programs in the app?
I mean, that seems like a major sort of data engineering
pipeline challenge. Yeah. And you know, what's great about my job now and working at a big
company is that I'm obfuscated from a lot of those decisions, right? I, you know, in other roles where
I used to be involved in building the pipeline, building and managing the database and also
producing the insights,
you know, in this role, I'm really fortunate to be able to focus on delivering the insights.
And so we still have to build like mini pipelines, because to your point, I still have like this
massive data lake, let's call it. And that's how I see a data lake is like all of the possible
information that you could use for analysis from a standpoint. And
we have to build mini pipelines so that we have dependable views that are slices of time for
performance reasons and just for feasibility. But I wish I could tell you how all of the
information gets into there. And to be honest with you, as the company goes through acquisitions,
as the company transforms, think about it. I mean,
this company FanDuel has started to work in an industry that just became legal when you're
talking about sports betting, right? Like you look back five, 10 years ago, like there wasn't
legalized sports betting outside of like Las Vegas and maybe Atlantic city. I don't even know if it
was an Atlantic city at that point, but to that point, it's still a little bit of the wild west. I mean,
it's a mystery to me. And that's probably because it's something that the company is solving for on
a daily basis. So I wish I could answer that question, but I can confirm that there is
information coming from mobile applications, from reference data that we're
grabbing from databases. And again, it's a multi-product company. So you're not just
talking about one application for daily fantasy, but you're talking about sports book, racing,
poker, like there's a myriad of them. But again, I think that process of ingesting, architecting, and delivering, that is still the core tenants.
Like the things that we're doing at our group level, you know, the little mini pipelines that we're building, the little databases that we're building, and the views that we're materializing, it's the same approach that the company is using.
But, you know, I wish I could answer all of it.
Yeah. We'll have someone from maybe the data engineering team, if you'd make an intro for us,
just because I think it'd be so interesting to hear. Anytime we talk with companies who are
ingesting significant amounts of outside data and combining it with internal data,
those are always fascinating pipeline conversations.
Well, to close us out, could you just tell us maybe since you are sort of delivering data
products and less on the pipeline side, could you just tell us about maybe one of the data
products you're delivering at FanDuel right now? Sure. We do a bunch of regulatory reporting,
which again, doesn't sound very interesting, but regulators are, they're mostly accountants by trade. So they're
people who know numbers and they hold us very accountable. Let's just say that, right? Because
what's at stake at the end of the day is tax money that fuels everything in their state, right? So
building regulatory reporting, it's not as exciting. There's not as many colors and graphs
and charts and dashboards
as I've been used to in my career. The focus is really more on data timeliness, let's call it,
right? Sure.
All about accuracy and setting up those checks. And it's also about delivering information in
very interesting ways, right? Like pushing a CSV file to an SFTP server that people could pick up. Now,
in my past, I really haven't done a lot of that, right? Because I'm delivering dashboards to
people. I'm delivering people analysis and prediction and things like that. Very rarely
am I trying to figure out how I get a 6 million row file scheduled on a daily basis to drop into an SFTP server across multiple states,
you know, and regulators on a daily basis, right? So those are really interesting challenges that
we're solving for. And that's where like having a Swiss army knife tool like Databricks is super,
super helpful because anything that I find online on Stack Overflow about, you know,
building those types of pipelines, I can repurpose immediately and then start deploying in our
organization. So yeah, it's solving those less sexy. It sometimes feels like archaic types of
information, but it's all about knowing who your audience is, right? And regulators and things like that and accountants,
they want the line level data, right?
There's no if, ands, or buts.
You can't give them a fancy chart.
You can't give them summary information.
And on top of it, it has to be accurate
or else you're going to be spending more time
doing reconciliations than you will delivering the product
that you signed up for, right?
So that whole regulatory reporting pipeline has been really interesting for me. And it feels a
little unnatural about like what I'm doing, but, but, you know, everything changes when your audience
changes. Yeah. I love it. I think it's, it's really fun for me and I hope our audience as well to hear
about a different kind of data product on
the regulatory side, because the requirements are very different than sort of maybe summary data
around usage, you know, where margin of error is acceptable on some level, because, you know,
customer data is a little bit messy and, you know, you know, they're sort of outliers and
other things like that. But line level data on regulatory that is critical to your business
continuing to function is a very different type of product to deliver. So super interesting to
hear about that. So I've very much made comparisons to healthcare, right? It's like,
you know, if you can't like messing up insurance claims is really affecting people's lives. Like
same thing here. It might seem trite, but this tax money is really important to
these states, especially now with the state of the world. So there's a lot more riding on it.
It's a lot less directionally accurate. That's a word that I've used throughout my career to
save my butt is directional accuracy. And I can't use that word anymore.
Sure. Very cool. Well, Dave, this has been
a really wonderful show. We'd love to have you back on. We'd love to get someone from the data
engineering team to hear about your pipeline. So we'll be in touch. And thank you again for your
time. Oh, thank you guys. This has been wonderful. And I really appreciate you having me.
Well, that was a fascinating conversation. I didn't get my question answered,
but maybe we'll get someone from data engineering on the show. But I love it when we have a show where we get on a topic that everyone's passionate about and has opinions about,
and we can really dig in on it. I think one of the interesting things to me from the conversation was
the comment around data portability. So there's all sorts of terminologies, so data mesh and
connected stack and all these different things. And the concept of data portability, I think,
is a really, really helpful way to think about where things are headed, at least as far as we
can see now. So that was what stuck out to me. Yeah. I mean, I think you're not alone in this,
Eric. I also didn't manage to ask that many questions about fantasy games, but it doesn't matter.
I think I really enjoyed the conversation.
We had a lot to chat with Dave about data platforms
and what the future will look like.
So that was super interesting for me.
And I really liked his opinion and his view
on all these things around
how we are going to be using data in the future.
It's funny also to interact with people who really understand the products and they don't
agree with the marketing terms that we come with while we try to market new products.
So this whole thing about reverse ETL and what's the right name of it, I think he put
it very well with the term data portability.
And yeah, I'm really looking forward
to chat with him again.
And hopefully next time I'll manage
to ask my questions around fantasy gaming.
Yes, we can have it.
That'd be actually fun to have an episode
where we cover topics we don't know about.
All right.
Well, thank you again for joining us.
Subscribe on your favorite podcast app.
You'll get notified of new episodes
every week and we'll catch you next time. The Data Stack Show is brought to you by
Rudderstack, the complete customer data pipeline solution. Learn more at rudderstack.com.