The Data Stack Show - 68: Season Three Recap: Holiday Edition with Eric Dodds and Kostas Pardalis
Episode Date: December 29, 2021In this episode, Eric and Kostas look back over the great topics and guests from season three of the Data Stack Show. The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for dev...elopers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data. RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com. If you are going to the Data Council Austin event
in January on the 27th and 28th,
you're definitely going to want to meet Costas
and me in person at El Mercado on the night of the 26th.
We will buy you a drink and talk all things data.
We will be on site for the conference
and we're super excited to meet you, Kostas.
Tell me what you are most excited about
asking our listeners
if they actually show up to meet us in person.
I don't know if I want to ask something.
I was thinking that like,
maybe I'd love to play the game
where I say this is interesting
and then we all do shots.
So yeah, if you come and visit us,
like you will have the opportunity to play this new game
that we don't have a name yet,
where I say this is interesting
and we all take a shot of tequila or something like that.
Maybe you can play that game.
I don't want to do that many shots of tequila,
but we would love to meet you in person.
It's going to be a great conference
and we're excited to meet some of person. It's going to be a great conference
and we're excited to meet some of our listeners. So come by January 26th. You can reach out to us
on datasackshow.com, fill out the contact form, let us know you're coming and we'll buy you a drink.
See you there. Welcome to the Data Stack Show season three recap. Costas, I can't believe we
recorded three seasons of shows.
It's kind of crazy.
I think we have 80 shows in the books.
Of course, not all of them are quite released yet
because we do recording ahead of time,
but that's pretty wild.
Did you think that we would get this far when we started?
Oh, yeah.
No, I mean, I never expected that it's going to last that long,
to be honest.
Yeah, it's been quite a journey.
I think that we are going to be, you know, shows like French and all that stuff.
We are getting closer.
Okay, so I want to talk about a couple of specific themes that arose that are really interesting.
But first, I want to ask you a question.
So when you messaged me on Slack and said, let's do a podcast. And we hopped
on a call and we talked about it. You said, I want to talk to the people who are doing interesting
things in the data space, both so we can just learn about what's happening out there and then
meet the people behind it. Do you feel like you understand the data space, the data stack better as a result of doing the show?
Like what,
what sort of been the impact for you personally,
as you think about,
I mean,
you work in the data space every day,
like,
has it been helpful for you?
Okay.
That's an interesting question.
I feel like I got answers,
but I also,
it also created like new questions.
Right.
But I think at the end,
what,
what matters is like to try and get in contact with people
and see how they are thinking and why they are doing the things that they are doing.
Because at the end, the market is so young.
There are so many things happening.
Not all of them are going to survive.
And of course, not all of them have the best way to do things or whatever.
So we don't really know yet what will be happening in a couple of years from now.
But having like this kind of contact with passionate people who really love what they
are doing, like, okay, we had people joining us, all of them.
I mean, they have done like amazing stuff, right?
Very smart people, very honest people, like with why they are doing the things
that they are doing so yeah i think that's the for me the most important part of like this show
is like this kind of connection with all these people like i think it's what really keeps both
of us i mean okay i'm talking more about myself, but I think that also like applies to you,
but I think it's what keeps us like doing it. So yeah, I mean, okay, we say that we do it because
we want to share things with other people, but we are also selfish, right? So primarily we do
because we're having fun and we meet all these nice people. Yeah, super fun. It is kind of a
paradox because I agree with you. I think questions have been answered, but it's a paradox in that, you know, I think
some of the simple, some of the more simple things where we think about technology around
data warehouses or data lakes is sort of becoming crystallized across stacks across the board.
You know, it's kind of like, okay, we see patterns emerging there, but then you talk with, you talk with people who have developed really groundbreaking technologies
and a lot more questions open up, right? Because these people are really sort of pushing the
envelope of what can be done, which is super interesting. Okay. Let's just cover a couple
quick themes here of what we talked about. The first thing I want to ask you about
is, so I'm just going to rattle off the main themes from the episodes that I jotted down
in my notes as I was reviewing the season. So we talked a ton about ML. So machine learning
as a service, ML ops, the emphasis is kind of saying, okay, ML may be like the next step beyond
analytics, right? So
data stack to serve analytics. And then once you get that sorted out, it's sort of you serve ML
use cases. We talked about batch versus stream, which was super interesting. And then sort of like
federated data, which was really interesting. And so sort of that tension. And then we talked a lot
about observability. Actually, we talked to several companies who are trying to solve for the challenges that you run into with all this data sort of even thinking around how you deal with data
is increasingly adopting thought patterns from software engineering. And that actually is
reflected both, I would say, in the team structure, as well as the tools that people are trying to
build. Observability, for example, right? I mean, that's sort of a direct adoption. So,
as a software engineer, tell me what you think about that. That was just a consistent theme
we heard throughout the entire season. Yeah. I mean, I think it's very reasonable to happen.
There is a reason that we have all these different disciplines that they have, like the term
engineering in them them from mechanical engineering
to chemical engineering to software engineering to i don't know whatever other like engineering
we have at the end when you're engineering social engineering yeah i mean at the end when you
engineer something like there are some very specific principles that they are shared across
like all the different disciplines all right and I don't think that data would be something different.
I think actually it's like an indication of this space maturing.
That's what is happening right now.
So it matures.
It has to be much more serious.
It has to deliver much more consistent results.
And that's when you start moving, let's say, from the experimentation phase to the engineering phase, where now you need to put processes in place.
Now you need to ensure quality.
Now you need to observe things and make sure that they work in the way that they should be working, right?
So how you do that?
I don't know.
I mean, obviously, like in data, things are different compared to infrastructure observability, for example, or like whatever else.
But still, the principles remain the same. We have a process. We need to observe the process.
There are some data about these data or these processes.
So we have some metadata that we need to track and try to reason using these numbers and see like, OK, can we trust the data?
Can we trust our pipelines? Can we trust our data lake or whatever?
So we are going to see more and more these principles being applied with anything that
has to do with data. We have companies that are doing versioning, for example. I don't think we
had anyone on this season, but there are companies out there like PackyDerm, for example, they are
doing data versioning, right?
GitOps, I mean, at the end,
we will get something like GitOps for data.
So what I'm trying to say is that
if we try to detach ourselves from what is happening
and like take some distance,
we will observe,
and that's something that we've said many times, right?
That the data engineer today is like a role,
that it's like a hybrid between engineering and operations right
that's again like an indication that we are still early probably this is going to break into
different roles and then you might have data and data engineering right where someone is responsible
like for writing like all the stuff that we need to execute there and then we have someone who's
like operating all this software or whatever and yeah yeah, I think the next couple of months, maybe years, like one or two years, they are going to be what is going to define how exactly and mature like this discipline.
I would say, though, because you mentioned at the beginning the email part and that we consider email as the next step or whatever.
I wouldn't say like if I learned something is that actually email is not the next step.
Like email is something that's out there, right?
What is happening though inside the companies is that email and analytics are kind of like two different functions, right?
And in many cases, this also reflects on like the infrastructure that the companies are using.
You'll see like a completely different infrastructure
that a mail is using compared to the BI function, for example. One is using a data warehouse,
the other might be using a data lake. I think one pattern that we are going to see a lot,
especially with the lake house paradigm or whatever, is the merge of these two into one.
So we will see that everyone inside the company
is going to be using one infrastructure.
And if you want, like, okay,
I'll refer to a term that we usually make fun of,
which is the data message.
If there is, like, as I see it right now, value in this term, it's exactly this unification.
We are going to have one infrastructure for all the data practitioners inside the company.
We are not going to have them all separated as we have them right now.
And I think this is happening now.
How exactly is it going to happen?
Which paradigm is going to succeed at the end?
Who is going to, like, if it's going to be called data mesh, data networks, I don't know.
I mean, it doesn't matter.
But there is a unification that's going to happen in terms of, like, how data is accessed and how it's used inside the organization.
I agree with that. And I think a big driver of that is sort of right? If you think about common data schemas,
tooling that can sort of enable common ML use cases on top of existing technology,
there are a lot of things that are making it way more accessible, which is super exciting.
Let's talk quickly about observability. So we talked with a couple of companies, Big Eye and LightUp, and then it was
a common topic in general, but what do you think about observability, right? And so, and let me,
I'll give just a little bit of context here. So we said in one of our recent episodes,
the stack is expanding, right? It's not contracting in complexity. It's actually
expanding in complexity, which creates all sorts of problems in terms of
being able to understand whether there are problems across the stack.
What do you think about the observability space with data?
Is that a, I mean, is it a huge need?
Do you think that those companies are solving like a really true problem?
What are your thoughts?
Yeah, obviously they are solving like a problem.
There's no discussion on that.
The thing is that I think it's still early
when it comes like to how observability
can be successfully implemented.
I'll give an example, right?
Like we had, if you consider not just decision,
but the whole show, right?
In the past, we have also
talked with companies like Avvo, for example,
right? Who are,
they don't call it observability, they call it
like quality, right?
And they are focusing more on like
the streaming side of things,
while companies like BigEye,
at least now, are focusing more on like
the data that is at rest in the
data warehouse to figure out
what's going on there. But we see that we have, let's say, two sides of the same coin. Again,
it's about data quality and figuring out if we can trust our data. Now, what's the best way to do it?
Is it best to rely on an architecture where everything happens on the data warehouse?
Or you have a more decentralized architecture where quality and observability is something that fits part of the whole workflow that we have and the whole stack?
That remains to be seen.
I think right now all these companies, they are tackling the same problem from a different angle. And at the end, the market is going to decide who's going to win based on like which one of
this is like, let's say the most important angle. Because at the end, what happens like with markets
in general, we have consolidation and like we end up with a platform that does everything,
blah, blah, blah, like all these things. I agree. My hot take is that I think there's going to be some combination of both.
If you think about the sort of micro problem of data quality and capture, that's really, really important for certain teams in a localized sense, right? So if we have data coming in,
that's driving some sort of like very personalized experience, for example, it probably makes sense to be like very rigorous on capture.
Now, I'm not saying that's not important for analytics and other things, but I think about observability as sort of a more comprehensive solution that crosses certain points of the stack as opposed to a rigorous
approach to ingest. But like you said, it remains to be seen. It's a fascinating problem. I think
maybe one of the most interesting ones beyond maybe data lineage, which has come up a couple
of times. Yeah. Okay. That's what you saw solve at the end but i think it would be interesting to
have and maybe that's something that we should include in our shows from now on like to also
interview or interview i mean chat with vcs who have invested in this place because
the thing is that we are talking about product categories that are so new that, I mean, you
don't know.
Whatever you are going to see today probably is not going to be true in a couple of months
from now, right?
Yeah.
So it would be interesting to see these people that, okay, they invest their money and they
have every reason to do it as early as possible, why they do it, and what is the thesis behind
that for this stage of the market?
Because, okay, if we are talking about data warehouses, I mean, it doesn't make sense to ask the investors.
It's better to go to the companies right now.
But for probability and quality, I think that it's the right time where it's going to be much interesting to hear what
a VC has to say, not even the founder.
Sure.
Yeah.
That's such an interesting proxy for sort of what the vision is of the problems that
are being solved as people sort of look at their horizon.
Yeah, because from a portfolio management perspective also, that's something that these people are doing.
There are also correlations between all these different companies
and their investments.
So it would also be interesting to hear on how they see this category
related to other categories in data that they might also be investing in.
Anyway, I think it's something that I think it's worth doing, like find someone who is very active in investing in data-related companies and get them on the show.
All right. Listeners, if you hate that idea, go to datastackshow.com and fill out the form and
tell us because if not, we're going to get someone on the show. Okay. Last question,
because we're coming up to time here.
One other subject that we discussed a lot was the modern data stack.
So we talked about this with someone who's been at Mixpanel for over a decade, and they're sort of migrating to this paradigm where they view the warehouse as an essential component
of the data stack, which is really interesting for, you know, sort of a product analytics
company. And then we had a panel with DBT, Databricks, Fivetran, Hinge, and then actually a VC that's
pretty active. That may not have actually made it into the season three. So that's a preview
for everyone coming up. I have mixed feelings about the subject of the modern data stack.
In one regard, I think about some of
the episodes where we talked with people who just sort of assumed the basic components.
You need like good ingestion. You need a single source of truth. You need to be able to move data
easily. You need a sort of flexible pipelines. And that was kind of like, what are you doing
with the data? And then we also had episodes where people were talking about serious problems with sort of
any one of those components of the data stack. But I think probably one of the most interesting
things was just hearing people who are practitioners actually trying to explain
what it's like to use the modern data stack.
And they just have way a lower emphasis on the tool set and more on what it enables them to do, which I think is really interesting.
So with that theme, do you feel like you understand the modern data stack better?
Or are you more convinced that people like me are making it into a marketing term?
Okay, I don't think that there is, let's say, some kind of clear definition of what this modern data stack is.
There's no such thing. I mean, and we did, I think, a very good attempt to make things more clear on the episode that we recorded.
But I think the consensus is still that, okay, it depends. And probably what is today,
it's not going to be like tomorrow. And usually when you attack these kinds of problems where you
have very, let's say semantic issues? Like people cannot agree on the definition.
I think you need to, again, take some distance
and focus not that much on the definition,
but on the words that we are using and why we are using them.
And why do I say that?
The most important thing at the end is that whatever is happening right now in the market
is going to be a stack.
And what's important about a stack?
You don't have one component that can work on its own.
No matter what, this is not going to be like I'm going out there and I'm buying a CRM where
I go to Salesforce and that's it.
It works.
In SaaS, for example, which was like,
let's say the previous wave of innovation,
you didn't talk about a SaaS stack.
I need Shopify together with CRM and I don't know,
like Marketo.
Some email tool or whatever.
Yeah, like you didn't need all of them together
in order to have something that operates.
You could have each one of them.
Did the companies at the end buy all of them?
Yeah, they did, but they didn't go out there to buy them as a stack.
That's what I'm trying to say.
So what is, I think, very, very interesting and very, very important
is that this is a space where synergies are very important.
There's no one tool, one platform that will come and be like,
we are doing everything.
Even if we are talking about Snowflake, right? Or, know, like Databricks or Google even. It's not like I can go to Google
right now and not use any other tool to do my job. No, you will need probably to use something for
pipelining or something else like for, I don't know, for versioning or observability, whatever. So I would say that for people that they are getting angry with
and thinking that this thing is like a marketing term
and it's just used by the market to convince them to go and buy,
don't think in this way.
Don't be defensive.
At the end, focus on the words that are used
and the terms that they are used
and try to understand how this is going to affect your work in the future. Because as a buyer,
you will never be a buyer of one product. You will always have to choose many different products and
how they work together is going to be important. That's why we also see that partnerships in this
space is something that is starting in companies much, much earlier
than what happened with the SaaS companies of the past, for example.
So yeah, if you want my opinion, that's what I would say about data stack and the importance
of the modern data stack.
And the rest about the definitions and who's going to be the winner of each one of the
data stack parts, it remains to be seen and we will see.
It doesn't matter at the end that much for the market, right?
I mean, for the owners and the people who work there, it matters a lot.
But for the market, it doesn't matter.
I agree.
Well, we're at time.
Let me just do a couple of quick thank yous.
We talked with Ben, the Seattle data guy from Facebook,
Ananth, who runs the Data Engineering Newsletter,
and his day job is at Zendesk.
Great episode.
Tristan from Continual AI, James Serra at EY.
Bart, who runs the Data on Kubernetes community, which is great.
Of course, he mentioned Mixpanel, which is a really fun episode on the modern data stack
and the warehouse.
We also talked with Pete Goddard from Deephaven, and that's a really interesting episode on the modern data stack and the warehouse. We also talked with Pete Goddard from Deep Haven,
and that's a really interesting episode on sort of the difference between batch and streaming
and doing stuff extremely fast. They have some pretty cool stuff going on there.
Jeff Chow from Stripe, stream processing was a fascinating conversation. Definitely check
that one out if you haven't. We talked with Igor from
Big Eye, Scott from InterSystems, who talked about Data Federation, which is really interesting.
We talked about making ETL optional, another federation conversation with Jeff, or sorry,
Justin Borgman from Starburst, which was a really great conversation as well. We talked about
open source with Ashley from Benthos, really cool tool. And that was a
great conversation, a great mascot for the open source project. And he's just a hilarious guy.
We talked about data design with Kevin from Touchless Technology. We talked about IoT,
which was a great episode, not a theme, but a great episode. And we talked with Rob from Thing
Logics, and he talked about how
he uses his own technology on his cattle farm in Oregon, which was amazing. We talked about
ETL versus ELT with Matillion, which is a great conversation as well. We talked with Airbyte
about open source and ETL, which was a super fun conversation. And we talked about data teams, which was a really
interesting conversation with Srivastan, who works at Robinhood and actually has a long history at a
bunch of other data companies, which was really, really a good conversation as well. So definitely
subscribe if you haven't. That's just a quick rundown of some of the highlights of season three.
And we will catch you on the next one. Many, many exciting episodes that we've already recorded for
season four that will come out early next year. We hope you enjoyed this episode of the Data Stack
Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes
every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.