The Data Stack Show - 69: What is the Modern Data Stack?
Episode Date: January 5, 2022Highlights from this week’s conversation include:Panel introductions and backgrounds (2:55)What the modern data stack means to each of our panelists (5:04)Defining the fundamental components of a mo...dern data stack (17:22)How the modern stack drives insights and actions for businesses (28:03)Getting to a uniform definition to the modern stack (33:45)Managing the modernization of a large scale data stack (39:09)How testing works in the dbt context (48:44)The relationship between the data warehouse and the data lake (52:25)What has us most excited or the future of modern data stacks (56:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today, we are recording an episode with a panel of guests.
This episode is also live streamed. So thanks for everyone who joined us on YouTube
for the live stream. We'll be doing another one of those and we'll let you know about that on
upcoming shows. This panel is pretty incredible. Back when we started the show, I would have said you were crazy if you said that we were
going to have a panel with people from DBT, Databricks, Fivetran, someone who's building
data infrastructure at Hinge, and then a VC investor from Essence VC who invests only
in data infrastructure products. So this is
pretty incredible. And I'm really excited to hear about the different ways that each of these
people talk about the modern data stack, because they each come from a different part of it on the
sort of SaaS side of the tooling
providers. But then you have someone who's actually using some of these tools to implement
stuff. And then you have someone who's trying to think about how to invest in them. And so I think
the variety of perspectives are going to be really, really helpful. So I'm pumped.
But Kostas, what are you going to ask everyone? I think I will improvise, to be honest.
As I usually do. We always do.
We always do.
But I think it's an excellent opportunity to see if the term modern data stack is just
like a marketing term, as many people say, or something more than that.
So yeah, I think we have the right panel there to figure this out.
Hopefully, at the end of this discussion, it's going to become much more clear why we
needed this term and what's the essence of the term.
So yeah, that's my goal for today, trying to understand better what the modern data
stack is.
All right.
Well, let's dig in and talk with all of these amazing thinkers about the modern data stack.
Let's do it.
Welcome to the Data Stack Show.
This is probably our most exciting episode to date.
We're also live streaming this episode, which is really exciting.
We have some of the best minds in data here to talk about the modern data stack. And I just could not be more excited.
So we have so much to get through. Let's just do
quick intros, maybe 30 seconds or a minute introducing yourself. And I'll just call out
the name since we have such a big crew here today. So Jason, do you want to kick us off?
Sure. Hi, my name is Jason Pohl. So I'm a principal solutions architect here at Databricks.
I've been, I was one of the first 10 solution architects.
So I've seen the company grow from just one instance type on one cloud to supporting all
three clouds and more instance types than I can count.
So I lead up the data management subject matter expert group for Databricks.
So anything that has to do with data engineering or data governance, I basically help enable
the field, our customers and partners on how to do it best and serve as a conduit back to product management as well.
Awesome. Amy, how about you?
Hi there. I'm Amy Diora. I head up partnerships for dbt labs.
So I lead those relationships with other products that are integrating with dbt and then with consulting partners that are bringing dbt into industries all over the world.
Before that, I worked about 15 years in data analytics and
data science consulting before joining dbt labs. Happy to be here today. We're happy to have you,
Paul, you're up next. Sure. I'm Paul Picaccio. I work at Hinge. I'm on the core data platform team
where I've built out our modernization of our pipeline. So this conversation is,
so that's really interesting to me. Yeah. We can't wait to hear what you built. All right, Brandon. Hey, good. Thank you everyone for tuning in. I am a manager
of our technical product marketing team here at Fivetran. Prior to joining product marketing at
Fivetran, I was our first West coast sales engineer. So really excited to see how Fivetran
has grown over these past couple of years. And I should also note when I'm talking about Fivetran,
I'm also referring to HVR, one of the companies that we've recently merged with as well.
Great. And Timothy.
Hi, everybody. I'm Timothy Chen. I'm an investor here at Essence VC,
especially invest a lot in data infrastructure. So especially very excited about this as well.
Before that, I was an engineer, work on open source, contributed at Spark, Kafka,
Drill, others related data
stuff as well.
And yeah, so definitely seeing a lot of interesting stuff happening in this space.
Great.
Well, like I said, we're all in for a real treat here.
We're not going to waste any time.
So Costas, why don't you kick us off with the first question?
Yeah, let's start with the most important one, right?
So I'd like to ask our panel here what the modern data stack is and how they
understand it. So let's start actually with Paul, because I'd like to hear the opinion from
the stakeholder, right? Like someone who gets benefited by it and is using it on every day.
So Paul, what is the modern data stack? This is of course a loaded question
because it can mean all kinds of things.
But to me, I think it's a combination
of like volume access and trust.
So like pulling heavy volumes of data
through a reliable way.
And obviously trust, like it has to be secure
against intrusion and whatever,
but it also has to have high data quality.
You have to know that what is coming in
at various stages of your process is what you get out the other side. Because if you're running these experiments,
a very small margin could mean a lot to your data scientists and stakeholders and so on.
But in terms of access, I was thinking about this as we were talking earlier,
that you have sort of disparate patterns of access, that you have to be able to explore
the data to not just machine learning models or what have you but have like actual human
exploration and you have to be able to sort of examine your data at point of time so you have
like repeatability concerns but you also have like you're thinking about like oh what is modern like
you have to have if to be compliant with the law as well.
So like privacy and, and just like GDPR, all the, all of the, all of the ways in which you have to
now touch every piece of data across your entire stack at reliable intervals in order to ensure
that people are being protected under law. Okay. Well, that's, That's, I think like it's a very interesting definition.
And I'd like to ask next the VC of this panel,
because he's probably getting like a lot of pitch decks out there where
people are talking or they are trying like to define themselves as like the
modern data stack or part of the modern data stack.
So team,
like what do you see out there?
Like how do people communicate to you, the modern data stack. So Tim, what do you see out there? How do people communicate to you
the modern data stack and what's your opinion about it? Yeah, the modern data stack is one of
those buzzwords that you're so intrigued about that you really have no idea what it means. And
it just sounds good to keep saying it over and over, because I think it's one of those things where it feels like it's in the center of something.
You just cannot fully grasp it because everybody has their own definitions and it can be used in so many different contexts.
But in your response, back to your question about what do I see?
I think modern data stack, what is the most comprised of what I can tell is really the the move to the cloud everybody
has democratizing access to all this sort of data that they need and different functions and
therefore there's a collection of tools there's a collection of products that everybody's keep
meshing over and over and you can sort of change and mix and matches but there's a few things that
are always sticking there which is kind of like your storage, your warehouses, and some level of different things here. And so that I think in some way,
it feels like people are trying to redefine modern data stack every single day, because, hey, I think
there needs to be a real-time streaming one. And maybe there needs to be a better way to get
analysts, a better graphical report style, right? You can
kind of stick a lot of things in here. But I think the core, from what I can tell, especially
talking to different customers and friends that are also investing in this space, what we're
seeing is modern data stack is also an opportunity where people are re-looking at all the things
they've done in the past. And like, hey, do I really need to be doing data the same way as before right does
all the new buzzwords like data mesh and all the stuff like we're seeing right now do we have a
collection of tools and practices and we actually help enable democratizing access and make data
infrastructure and data analytics done in a very different way but of course modern has no goal and also has no specific limitations or even like requirements.
So anyway, it's a very fuzzy word, but I'll just leave it at that because we're all I'm trying to figure out.
Right. Even based on this. Yeah. Yeah. I'm glad we're trying to talk about modern data.
Second, not data mesh, though, because I think it'll at least be a little bit.
We just reject every buzzword we could ever hear of
and we could have a five-hour discussion too.
Yeah, maybe
we should make another panel just for
the data mesh and
see who is most confused about the term
at the end. But
I'd like to ask next Brandon.
Mainly,
what is interesting about Brandon, I'm very interested
to hear his opinion, is because I think in many people's minds, the modern data stack is a term that it has started or it is heavily associated with Fivetran.
But also because, as Timothy said, an important part of what the modern data stack is, is also the democratization of accessing data.
And I think that's a big part of the mission
and the vision that Fivetran has.
So Braden, the stage is yours.
Tell us about the modern data stack.
Fivetran even has a modern data stack conference.
So we discussed this quite in detail across the conference
with all the attendees, all the different panels.
And I truthfully expect that we'll be talking about this
for multiple years to come because the definition will continue to evolve, will continue
to evolve. But at some point, we might not refer to it as modern data stack, we might refer to it
as some other type of data stack. But I think it was Notorious B.I.G. who said, more tech,
more problems. And ultimately, I think that's where the modern data stack comes from, right?
And then Timothy was talking about this too, the rise of new cloud technologies.
They make it a lot easier to scale out
what you're trying to do with your data strategy.
And with all these new technologies,
the capabilities that we can do as different data teams
across different companies continue to expand as well.
And with all of these new tech,
really it just comes to what else can we try to solve for?
So then I'm going to throw in another buzzword-y term,
machine learning, AI.
These are just generalizations of modern data problems
that people continue to run into.
And really, it starts to become, what is your modern data stack?
Really depends, in my point of view,
on what your company is trying to solve for.
If you all have never utilized any sort of new technology
before, imagine your company is a single entity, single
one-person entity, and all you have is one laptop to work off of. You have no access to any cloud.
Then for you, a modern day stack might simply be a database that's locally set up for your
computer that happens to record transactions just from this one little office that you're in with
your one little room. And really, the definition will continue to change as new companies introduce
new terms, introduce new use cases.
And I fully believe that we'll try to keep up with all that change as a company as well.
Yep. That makes a lot of sense.
All right. So next, I'd like to hear like what Jason has to say. He's, I mean, as part of Databricks and like what I find like extremely, extremely
interesting about Databricks and Spark is that we are talking about a tool that exists for a long
time and it has evolved all these years. So Jason, from your experience and like working in this
space, like all this time, what is this modern data stack?
Yeah, I think for me, the modern data stack,
I used to be a data warehouse architect years ago.
So I would work with companies.
I would use the popular ETL, BI, and databases of the day to build these data warehouses.
And I think since then,
we've had like these digital native businesses
like Facebook, Airbnb, Uber, they've
kind of like, they've built their entire businesses off of different tech stacks than the one that I
used to implement 15, 20 years ago. And those tech stacks were open. So they came from open
source. And initially I don't think there was, there wasn't a public cloud to go to, but now
there is a public cloud. And what was really unique was they were using these tech stacks and it was really multiple
stacks to do data processing and do historical analytics, but also do artificial intelligence
and apply machine learning to their models, to their data, to be able to optimize everything
from lead flow or lead gen to optimize ride routes for Uber.
So there's this been, I think, evolution where as these digital natives
started up, they used open source, created their own open source projects, and then use those also
for applying machine learning. But now those same projects have been ported to the cloud. And now
I think the modern data stack is the culmination of all these open projects that have now got
either hosted by the cloud providers themselves or other cloud services
like Databricks, where we host Apache Spark and MLflow and all these other open source projects
that we've developed over the years. So I see it as a way for companies to kind of like pick and
choose whichever parts of the stack they want, the best of breed and combine them in a way that
gives them the maximum velocity for whatever they're trying to do. Okay. Okay. That's great.
Amy, I left you to be the last one because I have some very important reasons for that.
And we will see that in a bit because there is a follow-up question.
But before we move to the follow-up question, tell us from your perspective, personal perspective,
because you've also been in this space for a long time, and also like from the DBTs perspective, like what these modern data stack is.
Yeah, I think of the modern data stack kind of in contrast, right, to what we had before, how data teams were working kind of before we had these suite of tools. And probably the biggest
change is kind of what Jason said about having really focused on best of breed tools for
each specific job that the data team does, right? So in kind of in the past, when we were looking
at different solutions, folks would look at some of maybe the informaticas of the world or these
kind of all-in-one solutions that did a lot of different things, right? And that was kind of
thought to be kind of easier and better. But now we have a data team that's choosing kind of the very specific, the best tool for what particular job they're doing, whether that's ingestion or transformation, or we're even bring in kind of bring in notebooks from data science
into analytics in a new way, new tools where we're bringing in data from the data warehouse
back to Salesforce and kind of back to these other applications in a way that we weren't before.
So finding those best in breed tools and being able to have interoperability. So teams have
the ability to both choose the tool that works best
for their particular use case and also change those tools, right? When folks see there's new
data warehouses on the scene, there's new different tools on the scene. And because of the interoperability
between all of those different tools in the modern data stack, folks can then choose what's best for
their use case. And a lot of innovation just happens from that choice now that the team has
in terms of the tools that they use. Yeah, that makes a lot of sense.
Anyone want to add something? Yeah. I think both Jason and Amy's
responses are good examples of how their companies have also continued to push forward the idea of
the modern data stack. Before Databricks, there was a concept of, let's say like a data lake,
and then AWS reInvent really pushed that concept forward. And then Databricks came out with this,
okay, on top of this data lake, how are we going to make things more structured? How are we going
to add, for example, like asset transactions to your data lake? And now we have this concept
of data lake house, which AWS has also started to, what is it called, adopt as well. And then across, what is it called, DVT, for example. The lines between analysts and typical
data analysts and typical data engineers are continuing to be more and more blurred with what
DVT is putting on top of traditional analysts, workflow, traditional modeling. And now we have
this term analytics engineer. And these are all, in my point of view, great examples of how
technology and the terms that are being pushed out as this technology evolves continues to
also make the definition of a modern data stack evolve as well.
Yeah, that's a great point. I'll go back to Amy. And I want to ask my next question. And the reason
that I'm asking this first to you, Amy, is because you are a person that works with partnerships.
I think everyone will agree in this panel that partnerships are something very important
for everyone that works and who is part of the data stack, which makes sense, right? Because
each tool needs another tool in order to deliver value at the end. That's why we have a stack and
we just have one tool out there. So my question, Amy, for you and for the rest of the panel later is what are, let's say,
the most fundamental and important parts of this stack?
Like what defines this stack?
What kind of like functionality at minimum we need in order to say that we have implemented
like the modern data stack?
Yeah, this is a question where the answer is definitely evolving,
right? So if you ask folks maybe a year ago, they would say data ingestion, a data warehouse,
and transformation at a BI tool, right? They would say those are kind of like the categories.
Now we really have a lot of innovation around a lot of that, right? So people would say ingestion,
a data warehouse, or a data lake, right? Or a would say ingestion, a data warehouse or a data lake,
right? Or a query engine. There's all kinds of like pieces that we can make our data source.
Transformation, then we might say in the BI layer, we'll have kind of more exploratory
analytics tools like a notebook. We'll have traditional BI tools like dashboards. We'll have
also what's called sometimes reverse ETL or operational analytics,
or basically this idea of being able to take data from our warehouse and put it back in source
systems. Then there's kind of two other kind of categories that are kind of, that are now part of
what most people call the modern, the modern data stack, but they're kind of more in a little bit of
still exploratory stage. One of them is probably observation and testing,
right? Observability and testing. So kind of data quality, observability, there's a lot of
companies in this space and a lot of folks are figuring out exactly how this fits in the modern
data stack. But I think to Paul's point earlier, like this is going to be important, right? Kind
of understanding kind of testing observability. Some folks also put kind of data privacy into that bucket as well. Then also there's probably a part that we're really excited about
at DBT is what some people call the metrics layer, right? So this idea of between your data
transformation, creating your data sets that are ready for analysis and kind of your traditional
BI tools, how do we make sure that the definitions of the metrics that
we use to measure our business are consistent across all of the folks, whether they are
using that data for BI, whether they're using it for exploratory analytics, whether they're
pushing it into another system. So the metrics layer is kind of a piece of evolving that I think
there's a lot of different, a lot of companies and a lot of different folks are really thinking about as,
as kind of a new evolving part of, of what we call the modern stack.
Okay. That's a, I think a very,
very thorough definition of what it is. And it's like,
it's great to hear about like all these different layers that it has,
let's say. And I think that's probably also contributing to a lot to,
let's say, the problem that people have out there, to define it exactly and say what this is and why
we are using it. Because there are also, from what I understand, many different variations.
It's not like every company has exactly the same needs or it's at the same maturity level to utilize
AI and ML.
Some companies, they just build their BI layer.
It doesn't mean that there's no stack there
or that the modern data stack doesn't apply to them.
That was great.
What's five-tramp things, Brandon,
about the most important components of the data stack?
Of course, ingestion, right?
Like we need that.
Is this correct?
Or we can live without ingestion.
What do you think?
Yeah, I mean, in my point of view, of course,
ingestion is always first and foremost, right?
Getting the data to where it needs to be
so that you can actually do what you want to do
with your data, with the complementary tools.
And that feeds back to Amy's earlier point
about interoperability,
of making sure that all of the tools that you pick are seamless, they all work together. And many times people actually start with the, let's say like data storage layer, right? Trying to figure out how to make sure that, hey, as their cloud application, number of tools that they're using continue to evolve.
They want to make sure that as they're pouring that data in, whatever underlying data storage
system they're using is able to support all of that new data that they're going to be working
off of. And that supportability is broken into a few pieces. Of course, price is always going
to be a part of it. Money is always going to play a part of it. But the other part is even back in query optimization, how efficient can the queries that your team
is used to running actually work on these different data storage pieces?
My point of view, it's fine to start with that place.
Just always consider as you're building out the rest of the stack, how it's going to function
with that piece.
And of course, part of it is going to be driven by what your company wants to do.
What is prompting the move to re-evaluate certain data tools you're using?
In that case, it oftentimes starts with what is the problem that you're trying to solve for?
Is it that your data storage doesn't work and just going really high level here? Is it that your data integration is breaking all the time and you need a fully managed service to accommodate
changes like API changes across all the tools you use and whatnot. Depends on what problems you're having. Sorry, that's a bit of
a non-answer. No, no, no. I think it's a very, very good point. I have a question that it's
just for you, mainly because it has to do with initiation. I mean, Fivetron has been in this
space for quite a while. It has actually disrupted, let's say, the space. I remember like if we were talking about six years ago,
seven years ago, like all the noise was about
how we can get access to our data, right?
And I think like big part of the mission
that Fivetron has is like to make it as easy as possible
like to get access to your data.
Have you seen like after all these years
that this goal has been achieved
and more and more
companies are actually focused on other parts, on implementing other parts of the modern data stack,
and they consider the injection part solved? Or do you think there's still a lot of work to be done
there? I think there's always going to be more work to do. And when we think about data integration, data replication, sometimes we just think
about it in terms of how to get to data from point A to point B.
But there's a lot that goes into it, right?
Beyond just getting data from point A to point B, how can we make it as efficient as possible?
How can we make sure that we're connecting not just to maybe one endpoint through some
API, but how can we make sure we're pulling all the fields from that endpoint?
How can we make sure we're structuring that data so that when it lands in the warehouse that whatever
queries you want to run out will actually function because the data types are cast correctly and part
of making data integration easy and making data accessible is making sure that everyone understands
how to use the tool so that's one of the core components of five trend is ease of use easy
easy to use you won't see a lot of buttons in there. When I was a sales engineer,
sometimes I actually dreaded demoing Fivetran
because people would ask all these questions
about what's going on behind the scenes.
I can show you what's going on in a workflow,
but you won't see anything in the UI
except for some GIFs that are moving back and forth
to represent data.
And the reason I say that this will always evolve
is because if the goal is to make it as easy as possible, make data integrations as easy as possible, it means abstracting away a lot of those considerations.
Things like reading API documentation should be, in theory, a thing of the past if you're using any of these tools.
Maintenance has, let's say, schemas changed the data source system.
That should also be a thing of the past.
And to make all those backend considerations work, a lot of it goes back to,
will it evolve? Yes, it will evolve because people continue to do funky things with their
source systems. People continue to adopt best of breed tools unique to their departments,
various challenges that they're trying to solve. So there are, in my opinion,
will always be work for 5G to do to further optimize some of these backend considerations
we're making on behalf of our customers, supporting higher throughputs as the data volumes across all these sources
continue to grow as well.
And ultimately making it as out of box as possible.
So making sure that we hit all the edge cases
that our thousands of customers
are continuing to run into
with all their funky setup.
Okay.
Jason, you are the last one from the vendors
that we have on the panel.
So what do you think from Databricks perspective about how you would define, let's say, the fundamental pieces of the modern data stock?
Yeah, I mean, one thing I think that's changing a little bit since I started was that when I first started at Databricks,
it was pretty common as an architecture to basically have a data lake where you store all
of your data and you do a lot of the data transformations there. And then you'd offload
some of that data to like a data warehouse, either at Redshift or Snowflake or BigQuery or something.
And then you would basically use that for serving up the analytical queries from
business intelligence. And I think we've been kind of like marching towards this confluence for a while, but now you really can basically achieve both things on the data lake. So we have this
data lake house concept that we've come up with. And the concept behind it is like, you can have
the economics of a data lake and do all your ETL there, but you can also have your interactive BI
queries run on top of that data lake using SQL.
So essentially you can write your data once and then do all your use cases on top of it,
whether that's streaming or SQL for BI or machine learning or graph analysis, it doesn't
really matter.
And so I think with the modern data stack, there's a number of different tools out there
that do these different individual things, but you can kind of like combine all of them on top of the lake house as long as they're
adhering to open standards.
And in some way, this is kind of like realizing what Google realized 20 some years ago when
they wrote the white paper for MapReduce is that they realized that they had way too much
data to copy it around to do their processing.
And they're going to have to bring the processing to the data.
And that's kind of like where the impetus of their MapReduce started.
And then it spawned off all these other open source projects from there.
That's very interesting.
And I'll ask next our investor.
I think this is going to be interesting because some people might also find good ideas to go and pitch to the investors after that. So what is from an investor point of
view, the important parts that someone needs to define this data stack and implement it?
You mean from a customer point of view or what actually you want to buy?
Yeah, from a customer because like okay i guess investors
like investing like in ideas that they have a market so i think it's important you know like
i think it's you're just hearing the last few folks talking about right we're actually talking
about different kind of personas that are actually looking at this stack in different ways right when
we talk about ingestions or infrastructure or data, those are definitely more of a data engineer infrastructure sort of point of view of people.
So that's kind of what's one kind of customers, right? And then we look at the analytical users,
right? Data analytics, right? DBT has a large brand around those kinds of users coming into
this space. But we also have, it was really interesting, exciting that we're seeing as
investors is that data is truly
getting pushed out for many different kind of functions lots of people are increasing budgets
to hopefully get into like a data-driven enterprise right or data-driven businesses where can we
actually leverage analytics where can we leverage ai where can we leverage where we actually manually
have to do processes is there a way to do rpas you know
there's data driving insights and insight driving into actions and actually actually turns
automations can we continue to actually feed this loop into multiple places and so there's also a
segment of users are just as business analysts product managers anybody's actually doing any
sort of functions like sales or marketing right this is where reverse ETL even came from. Why do we even need to have this into a modern data stack?
Well, we don't just stuck them into a dashboard. We've got to push it out to some other people,
tools that they use. And so if you're looking at what is necessary to implement this,
this really is a very tricky question because everyone's needs look different.
Everyone's maturity level of how they view data and the sort of complexity differs a lot. And
that's actually one of the difficulty investing in this space because, hey, one person will be
maybe proposing, hey, we need an all-in-one solution that just takes modern data stack
in a box.
Install it, everything you go.
Don't worry about anything else.
Don't worry about any other vendors, right?
Those are solutions and definitely companies out there trying to make all the complexity reduced into one single click install.
And there's pros and cons of that, right?
How do you do the best of breed if you have, how do you hide all the complexity?
Can you make the five trend like experience
just mask every single vendor out there?
Like that's one kind of considerations.
However, on the flip side, if you say,
hey, we want to be the most modular,
we can truly choose whatever you need.
There's a lot of engineering work
and sort of like maintenance work
and integration glue work that needs to happen.
And so there's a huge spectrum. Everyone's trying to figure out if you talk to any head of data,
they're scratching their heads every day still because they're fighting this fights between,
I'm hearing all this buzzwords about modern data stack. And I think maybe I need something like
this, but I'm not entirely sure how do I mature my organization?
So you understand what do I need first? How long is it to implement it?
What are the trade-offs?
What are the ways I should actually even think about doing this in the first
place? And it's, it's not easy. Cause I think, you know,
if you really talk to data engineers, they have one set of concerns,
analysts, head of data, and truly the business.
You really need somebody that understands all those pieces in most of these companies that can
actually say, hey, here's what we're going to buy first. Let's ignore all the noises. Let's take
this approach. Let's start from this business unit. Let's spend this as much money. And for
larger places, it might start from more foundational pieces like like the data bricks and so so sorts
and fight your way up and some more smaller companies this might be starting from like more
more defined smaller pieces of of solutions that can kind of combine them all right so you need
both but it's just like it's hard to have one product catch-all in for everyone's need and
that's that's a complexity of the space.
And we're still evolving of all the new tools people are proposing about.
So it's hard to answer that question
because we have so many discussions.
It's not, there's no one single answer.
They're all asking me
and I have to relate to each other
and they all fight, talk to each other.
There's no consensus whatsoever
of how one way to implement modern data stack.
Yeah.
Tim, I have a question that it's for you.
And actually it's a question that comes, it's inspired like from a question that I got like
from the community.
There is a very specific, how to say that, like there are very specific semantics around
the term stack when it comes to developers, right?
For example, we have Jamstack, right?
This is like the latest thing.
Before that, like, I don't know, like a decade plus ago, we had LAMP, where we had like Linux
bundled together with MySQL and Apache and PHP.
So it's something that's like extremely, extremely well defined in the mind of a developer,
what like a developer, what's like
a developer stack is, right? And I'm asking you because you've been like a developer, so you
understand this like very well. So do you think we will reach a point where we will have something
similar also for data and that the modern data stack can become this? Yeah, it's, I think it
would be interesting because if you look at everyone's definition
of modern data stack at the moment,
there's actually a few pieces that's fairly consistent.
It hasn't changed that much.
And so you kind of have to play the devil advocates
of each category a little bit, right?
You're like, okay, is there new categories
that we need to, because we're, we're constantly
inserting new, new, new categories to our data stack.
You know, we have a lab set.
It's just four letters.
We messaged her when I worked at, there's a smack stack, right?
There was only five letters.
And the modern data is like this infinite number of letters.
There's no set number of characters anymore.
So this is like a unbuffer string.
You can just do whatever you want.
And that's why it's hard
because we're going to increase the number of categories for sure.
I don't think we're ever going to have just five.
There's more business analyst stuff.
There's AI things.
We haven't talked about catalogs or discoverability.
There's stuff we're going to just add to the stack over time and we see consolidation in each one of those categories
i think we will and there's also going to be cross-functional things my tool that's catalog
also do quality my quality also does catalog multiple vendor names are showing them in
different categories of problems right So will we get there?
Will all this shuffling end and just become one letter?
I don't think so.
Our future will be really interesting.
I do think that there's definitely going to be consolidation of logos.
Really, there's no way to have a crystal ball.
I don't know exactly what that will look like.
I do think, though, like I said, the number of categories will increase.
And so we're going to actually, it's even debatable what categories even makes worth
the category as a modern data stack in the first place.
And then even figure out what is the cross-section of things that people really like.
Yeah, 100%. 100%. even figure out like what what is the cross-section of things that people really 100 100 and i think
like if someone sits down and like all the different categories that they appear like from
quality even like if you take data quality for example there are probably subcategories inside
that you know so yeah absolutely and i think that's one of the reasons that like people
should keep in their mind like why it's so hard to define it is
because it's just too early. All these vendors are appearing right now because now the market
is ready for that, but we still need time to define exactly all the categories and all these
things. Paul, you're last, but for a good reason. You, let's say, represent the most
important stakeholder, which is the customer and the user. And you have implemented at a very large
scale, like a data stack. So what is the data stack for you? And what are the essential components
of this data stack, of the modern data stack? So I think it's related to two pieces.
It's easy to talk about data in the abstract, but really like thinking about the end purpose
of all of this machination is important at each step.
Like we're doing all of this work so that people can answer questions to drive the business
to do particular things like tangible decisions have to be made as a result of all of
these systems. So like, for me, it's, it's mostly about trustworthiness, like over even over
performance is like, although trustworthiness and performance, like if you don't have anything to,
to look at, then it's very trustworthy, but it's not very useful, but it's yes, you have like,
like confluence, like K tables, like materialize, like all of these
sort of vendors in this space are dealing with the abstraction question that Brandon
was talking about earlier.
And like, they're doing it in a way that is observable.
So for me, it's like, I want to know what is happening at each stage.
I want to have like the sort of like eye into the opaque box and see like what what decisions is it making
about my data and so like just correct like correctness and and trustworthiness seems to be
the the most essential piece of the modern data stack to me because if you're if you're serving
up answers that are not provably correct, then you lose stakeholder trust.
And then also, you lead your business
to make the wrong decision,
maybe at a crucial time.
So you have to make sure
that you're telling people the truth.
Yeah, 100%.
I think that's a pretty hard problem to solve.
And I think there's a lot of debate
on what's the best way to implement quality.
I remember having discussions
where there was a question,
is quality something that can be implemented by one system
or it should be the responsibility of each part
of the data stack that you have, right?
Is the pipeline responsible for its own quality
and then the storage for its own?
I think it's both a product and an engineering difficult problem to solve.
But that makes it also like so interesting and fascinating.
Eric, the stage is yours.
Oh, well, perfect timing because I was thinking, Paul, I have a two-part question for you.
Following along with thinking kind of beyond the actual tooling
or individual componentry into sort of what is the outcome
of these things coming together.
Before we get there though,
I'd love to know from your perspective, it's easy.
I loved your comment about thinking about this
in the abstract.
And when you think about the modern data stack
in the abstract, you sort of arrive on the scene
with like this amazing modern tool set, right? And in reality, different
parts of the stack become modernized at different rates. Could you speak a little bit to what that
dynamic is like sort of managing a large scale, you know, complex data stack?
Sure. I mean, I think it starts with the, your like message bus,
like however you pass information from one group inside your organization to another.
And so like, like, well, I guess way, way back in the day, you had like maybe one database where
everybody would be like, go ask the database, capital D. And it's like, well, that doesn't
really work once you scale past enough people
to sit in a room, you know?
So like being able to communicate through in different parts of your organization means
that, I mean, people love to like fetishize Conway's law or whatever, but like you have,
you have these like sub organizations or like sub organisms that have to like learn
to trust each other as well.
So like if you're modernizing at a different speed,
you have to sort of prove that your piece is trustworthy, first of all,
but also will work with the other organization's piece.
And so like passing data back and forth,
like I know RedStack does that.
That's like your bread and butter.
So it's like being able to have like inspectability into
or like an eye into the systems that communicate
like internally so you're taking data in from the outside and so there's like the idea of a
single source of truth or what have you but you end up having whatever multiple views on that data
or yeah or just like the i suppose it does come back to trust again so that's kind of the the
main piece is just like you have have your overall data strategy of like,
this is how we're going to manage lineage, for instance,
throughout the organization.
This is how we're going to manage.
Just like knowing what did we know?
When did we know it?
How do we know that we know what we know?
That's kind of circular.
Sure.
I think that's a really, it's so fun to talk about the specific
tech, but it's a really helpful thought pattern to put trust at the center and say, you modernize
your stack according to the ability or sort of need of various components to deliver trust. And that's such a helpful
thought pattern. So second part of the question, so building trust, especially as you have a stack
that's sort of increasing in complexity is non-trivial, right? And I know there's some
tooling around that. My guess is that on the ground in a lot of companies, you have a team
that's sort of managing that across a tool set. So let's say trust is sort of a central use case.
Are there other use cases in the business?
I mean, analytics is a primary use case, but let's just sort of assume, let's abstract
the modern data stack and say, you have a great set of tools in every part.
What are the other use cases that are hard to build for?
And are those, you know, sort of still in the analytics space?
Are those enabling teams who are delivering like the actual experience to the end user?
I'd just love to know sort of, great, you have all the modern tools, but what are still
the hard problems to solve, even if you have best in class?
Well, I think it's, I mean, not to, the answer depends.
It's always like both true and disappointing.
But if you have, if you sort of like focus on that, like, oh, it depends. It's always, it's always like both true and disappointing, but if you have, if you sort
of like focus on that, like, oh, it depends.
And like, like you're focusing on flexibility as your, as like your method of, of, of delivery.
So like, if you, if you, like, I'm thinking of like ML ops, for instance, like everybody
seems to be like reinventing ML ops for themselves and being like, our, our use case is super
different from yours.
We're doing statistics on numbers. And it's like, our use case is super different from yours. We're doing statistics on numbers.
And it's like, cool.
But so like talking to individual,
like ML engineers,
or like you're talking to like somebody
who's building out an experimentation platform
or integrating an experimentation platform
or something like that,
getting them what,
like getting them enough information
to answer the questions that they need to answer now without like without like
building a behemoth of a thing where you're like this solves all of human knowledge and they're
like cool we needed to not do that and give us an answer tomorrow so like so like being like
talking with individual people like no that's just like one of i think that's a lot of companies
value set like at hinge specifically we have a lot of like individual sort of power to, to like make
what we think is necessary for the space.
And so like having the flexibility within your team to make whatever, whatever, like
it depends means to deliver the answers that you need to answer.
You know, that's, yeah, that's a word.
Yeah, that's great. That's, I once had a coach who said, you know, there's not sort of like
advanced maneuvers. There's just mastery of the basics and then combining those
to apply to like a really specific situation in a really specific, you know, context on the field.
And so I'm so, I'm glad that you said it, because like, there's not a tidy answer to that, but really does depend. And I think, to your point, if you have the core there
and a team that has a tool set that allows them to be flexible to address those needs,
I think is a great answer. All right, well, I think we're coming up on close to time here,
Costas, why don't you go ahead? I think we have time for another question or two.
Yeah, actually, we have like two or three
questions that are coming from the community. So these are not like questions that like everyone
has to answer. I'll try like to find the right person to answer the question, but I think it
makes sense like to try and get like some quick answers to these. And I'd like to start and ask
Brandon about ELT. It's something that, I mean, it's heavily associated
with like Fivetron. And many people argue that like, actually, ELT is nothing new. It's something
that has been existing like for since forever. Help us understand a little bit better, like,
at the end, what's the difference between ETL and DLT and how new this EOT thing is. Yeah and my thoughts on this are pretty similar I'd say to
George our CEO's thoughts. Everyone's right, ELT is not new. You can use other tools to do ELT too.
You can use like script your own nightly dump, full load, dump and refresh. Effectively that's
what you're doing too with the ELT. If we're just taking that face value
of taking everything of the EL
and then you transform the warehouse,
you do that with almost any tool.
But the difference between what FITRAN does
with the differential between what we do
and what the process is taking at face value
is all of that backend stuff.
All of the things of,
hey, how can we make sure
we're not just doing a full nightly pool
and crashing all of your systems?
How can we make sure that we're using the most efficient way possible to actually read
this sort of information?
And how can we make sure that as your data updates over time, where I'm categorizing
updates as schema changes or updates to values in previous rows or net new rows, how can
we make sure that all those changes are pushed?
And the evolution of the term of ELT, in my point of view will continue to go more and more and more.
Oh, sorry.
The value of the efficiency can't be understated because it'll only continue to be more important,
especially as we move towards more real-time analytics.
And part of that ELT too is also making sure that we're doing everything that Paul's talked
about.
And I think Paul has a very interesting point of view on the data tracking piece, right? All of that, making sure that you understand where your
data is flowing end to end, making sure that you're staying within compliance, making sure
that your data stays secure and adheres to ever-changing rules and laws as they relate to
data privacy, like GDPR or CCPA, any of those. I think ELT is nothing new, but the way that we're
approaching it, our interoperability,
I really like that term, with the other tool sets really enabled the best use, the most
efficient use of ELT so that you can get your data to work off of.
Yeah, yeah.
I totally agree with you.
What I think people get confused a little bit with is that many things, it's probably
like everything in technology is not new.
We keep reinventing things.
And there is a good reason for that. It's because we have different needs and different technologies that we have to work with.
That's why, I mean, the database has been invented like in the 70s, but we are still building databases, right?
And we create new categories of databases.
The same thing is also with ETL.
Like the moment that we had the database, we started needing ETL, but we are still doing ETL and we
are inventing ETL and we are implementing it in a different way. So I don't think people should be
mad about using terms that have been used in the past or processes that we keep doing from the
past. We just need to understand that there are good reasons to iterate on these technologies and create a different version of it, right? At least that's
how I see it. But I totally agree with you. And I think it was a great explanation of why ELT is
something that we use more often today. Next, I'd like to ask Amy something, which I think here and DBT is probably like the best
person like to answer. Traditionally, SQL has been a very, very hard language to apply all the best
practices that software engineering has come up with, right? And it's one of the reasons that it was always like a little bit of, let's say, second class
citizen in terms of like language, right?
One of the things that's really, really hard is testing.
Like it's very hard like to maintain a code base of SQL and like do testing and unit tests
and like all that stuff that we have like in software engineering.
How do you see this from the perspective of dbt?
Because my feeling is that dbt has added a lot of value on this and has changed things.
So I'd love to hear like your opinion on that.
And like, what, where do you think, what do you think is missing?
And what do you think has been solved because of dbt?
Yeah.
So for folks that aren't kind of familiar with with testing in
the dbt context so a test is any sql statement that can return zero rows if that test is passed
and we use tests in dbt to look at kind of quality both data quality and just kind of unexpected
things that are happening in your data pipeline. And those tests can then either provide a warning to an analyst or someone that something
is unexpected kind of in this data pipeline, or also that something is broken with the
idea that you can kind of find those things before your end user of your pipeline finds
that.
Again, there's lots of ways that folks in the community, this is a really a point of a lot of innovation, people using kind of the DBT testing functionality to implement things that look set that you can load in and see, okay, did this operation do exactly this at this step along the way?
There's lots of folks using DBT tests several folks that have posted a lot of ways
and frameworks of doing this, of testing data, kind of using the dbt test framework.
So it's definitely something where we're seeing a lot of innovation and a lot of kind of great
ideas.
And we're kind of providing those tools to the community so that folks can kind of develop
that set of best practices.
Okay, that's great.
I mean, what I would add is that, I mean, people keep thinking that testing is still something really hard to do with SQL.
And as we said, like also apply the rest of like the software engineering,
like best practices, but this has changed actually.
I mean, there are still like things that are happening.
Like there's a lot of of innovation still happening there.
But I think that with the introduction of DBT,
many of these problems are not that much of a problem anymore.
And it's more about trying to figure out
what's the best way to do it,
not if we can do it or not do it, right?
So that's why I also wanted to ask you about that
because from my perspective,
it was one of the huge contributions of DBT
as a framework
to the community. Cool. Next one is Jason, and I have a bit of a provocative question for him,
but I'm sure I cannot help myself. I have to ask that. Do you see a future where...
Before I ask that, many times during our conversation today, I think almost everyone mentioned that storage is probably one of the most important parts of this stack
that we are talking about.
Without storage, we have nothing.
Do you see a future where the data warehouse is not at the center of the data stack
and another technology like the lake house, for example, or the data lake,
or I don't know, something exotic that we don't even know about it might take this position that
the data warehouse today has in terms of how, let's say, important it is for the data stack?
Yeah. I mean, I think the data warehouse is, I don't know if it's ever really been in the center.
I think it's been on the edge, to be honest. Because if you ask people who have a data warehouse, do you also have a data lake? Most of them will say yes. I think if you ask them, do you have more data in your data lake or your data warehouse? Most of them would say most of the data is in the data lake. And in fact, the data that's in the data warehouse, it's not that it only resides there. It's basically just a replicated copy of data that's already in the data lake. And in fact, the data that's in the data warehouse, it's not that it only resides there.
It's basically just a replicated copy of data that's already in the data lake.
So I feel like the reason why people were attracted to data lakes in the first place was that it's an
easy, cheap way to put any type of data. So not just for the structured data that you might need,
but unstructured data like pictures and videos. I worked with one customer
who was like an advertising agency and they were doing a campaign for like a software drink company
and they wanted to look at social media pictures to see which pictures had that soft drink in it
and who were the demographics of the people in the picture so that they could better market their
product. And I don't know if you could do that in a data warehouse. It'd be really hard to do that.
And so I feel like those types of advanced use cases,
they're just going to get more and more.
And the people and the companies that can master that type of analysis
are basically going to get an edge over their competition,
which is why you see the top 1% of the Fortune 500 being the fan companies.
These are the companies that have mastered how to do this stuff.
And so I do feel like the data lake is not going to go away because that is the place to do this granular machine
learning type of analysis. And I do feel like now that you can do high performance interactive
queries on the data lake, which we, Databricks recently broke the data warehousing benchmark
with the TPCDS benchmark. So we're able to prove that you can have fast queries on a data lake
and you don't need a data warehouse anymore.
So I think it's just going to be,
if you wanted to consolidate fewer things on that stack,
I don't know why you would have to have a separate data warehouse
for these things anymore.
Yeah, that's great.
Like the answer that I was expecting, to be honest.
Thank you so much.
I'm personally very, very interested
in like this kind of, in this space,
like what happens,
like what's happening today
with all the innovation,
like in data lakes.
So it's just like data bricks.
It's also like, you see like stuff
like Hudi, Iceberg out there.
I mean, there's a lot of innovation happening there
and it's very, very interesting
like to see what will happen in the future.
So hopefully we'll have the opportunity
in a couple of months to chat again about that.
So we are at the end here.
And I'd like to close our panel today
by one last question that I'd love
for you to reply, to answer, sorry,
with, I don't know, just one word if possible,
right? And the question is, what is, what do you expect as the next, what next technology,
let's say, makes you very excited, right? What do you wait to see like in the market around the data
stack? And what you would like to share with our audience out there in terms of excitement things that are happening.
And let's start, Jason, with you.
I'm kind of interested in the governance layer for these stacks.
I think in the A16Z diagram, it's at the bottom somewhere.
But I think it's at the bottom because you kind of have to have it.
And all these different tools have got different ways of governing either their data assets or AI assets
or whatever. And so I think unifying that is going to be something that's going to be interesting.
And then just the sheer number of open source data catalogs that have come onto the scene in
the last few years, I think it's like half a dozen of them and maybe a handful of commercial
companies that are behind those,
like seeing how that plays out is going to be interesting because I'm a big fan of open source,
but I feel like it's hard to have like six open source projects of something that does the same
thing. You usually end up with like two. So it's going to be interesting to see how that whittles
down. All right. That's interesting. Amy, what about you? Yeah, I think the idea of headless BI.
So the idea of like kind of keeping your metrics in one layer that can then feed all kinds of various BI tools, including kind of those set of BI tools that are very specific to industry use cases.
So I think that's going to be really interesting.
So everyone in the organization kind of being able to interrogate data using tools that are real specific to kind of their use case,
but all kind of in one source of truth. Okay. That's very exciting. Brandon.
I have a very similar answer, Jason. Maybe slightly different approach. I think a lot
of these data cataloging tools are very interesting to me because one of the problems that I see with
a lot of analytics teams is just the rate of onboarding, understanding what tools they're working with, understanding what field definitions are.
So any tool that could solve for that, whether it's a kind of explicit catalog tool or some
other data dictionary evolution, it would be fantastic to have.
And last but not least, Paul.
I'm split, honestly, because personally with if i didn't have to think
about like value to the business i would say like bespoke ml in the people companies like
hugging face or or something like that which is like delivering like easy ml that you can just
like deploy via sage maker via like whatever whatever tool and like get back fast answers to your,
to your questions. But I think like that's trying to hit the, my real answer, which is
democratization. So like letting similar, I think to Amy's answer, but like allowing each person to
ask the questions that are most pressing to them without, without friction. So like giving them
access to the most relevant parts of the data
without confusing them with irrelevant pieces.
Okay.
Thank you.
I'll give the microphone to Eric.
I really enjoyed this panel today.
I hope that you also had fun
and hopefully our audience also like will be a little bit wiser and after
today's panel and understand a little bit better what the modern data stack is eric
ah yes well that the live stream the live stream listeners got to experience that but the podcast
listeners won't because they'll edit it out. This has been amazing.
I learned a ton. And I think one of the big takeaways is that it's an evolving question
and we're all working on some pretty hard stuff here, but some pretty exciting stuff that's
enabling all sorts of interesting use cases, technologies, and job roles inside a company.
So thank you for your time. Thank you for helping us understand all of these things on a deeper level. And we'll catch you on the next show.
Have a good rest of your day.
What a treat to hear from so many great minds in the data space. I think one of my big takeaways
is that the data stack is changing. It has changed, right? If you think about five
years ago, there were tools that didn't even exist, right? And now a tool like DBT is a key
piece of many data stacks, right? It's changed drastically. And then hearing Timothy talk about how he views the modern data stack in the context of investing in companies, it was so interesting to me that he said, you know, you can kind of apply that term to like so many different pieces of the puzzle here or to the whole.
And so I think part of the dynamic nature of the stack and how it's changing and increasing in complexity makes it a pretty hard term to nail down.
And I thought I really loved how when we were talking, when we asked questions to Paul from Hinge, who's doing this work every day, he's a brilliant guy, which was clear from his answers.
But it was hard to answer these questions, I think, because of some of those reasons.
So it was really helpful.
And I think it's, sure, it's a marketing term,
but I also think all the factors
of sort of the dynamics and complexity
make it hard to nail down.
What do you think?
Yeah, first of all,
using marketing terms is not a bad thing.
I mean, there is a reason
that we have marketing out there
and it's not like just evil reasons
behind it. Whenever we build something new or we are, let's say, reinventing something,
we need to also invent new terms and new language so we can talk about it. So the modern data is
another attempt towards that, nothing else. But it's very strong evidence that something is happening in this
space. And that's what we have to keep. Like, there's so many things happening. As Tim mentioned
at some point, you see so many new categories, like, appearing every day, things that were just
like monoliths in the past, they are not monoliths anymore. Like we, the products from the past
are like broken down into like
a large number of other products.
I mean, that's a good thing.
That's an indication that
many great people with very smart people
are trying to figure out
how to make things better, right?
And that's why I think that
even if we manage today to give a definition of what
the modern data stack is, probably like in a couple of months from now, it's going to be at
least slightly different, right? And that's fine. I mean, it's an indication of that's the testament,
I would say that we are living in very exciting times for anyone who's like in this space and
working with data. So that's what I take from our conversation today.
As always, a very thoughtful and concise reflection from Costas.
Of course, of course.
All right.
We'll have another panel coming up in early 2022,
which will be super exciting.
So we'll let you know when that's going to come up
so you can register for it.
Yeah, and that was our first one.
So if anyone from the people that was listening to it,
they have any suggestions or criticism or whatever,
please reach out.
Oh yeah, you can go to datasackshow.com
and we have a contact form on the site now
at datasackshow.com.
So please, we'd love your feedback
and ideas on a live stream that you'd want to see.
Yes, please come. We want to be friends.show.com. So please, we'd love your feedback and ideas on a live stream that you'd want to see. Yes, please come. We want to be friends. Absolutely. All right. We'll catch you on the next show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on
your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. coffee.