The Data Stack Show - 22: Season One Recap with Eric Dodds and Kostas Pardalis
Episode Date: January 29, 2021Season One of The Data Stack Show is in the books, and in this episode, Kostas and Eric take a look back at some of the biggest takeaways, trends, and topics from the season. With some great guests al...ready set for season two, the next slate of episodes is shaping up to take an even deeper dive into the world of data and the people shaping it.Key points in the conversation include:Patterns with data warehouses and data lakes (3:38)Looking back at the people behind the data and their stories (8:12)Minimizing flaws while remembering that data is built by humans, for humans (11:02) Using proven technology and making mature solutions (15:20)Data involves a significant amount of trust (23:38)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists,
and the teams and people consuming data products. I'm Eric Dodds.
And I'm Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it.
All right, we are wrapping up Season 1 of the Data Stack Show. We had 21 episodes in the first season, which is amazing.
It feels like it wasn't that many,
but we actually talked to a ton of different people.
Kostas, how do you feel now that you have completed
the first season of a podcast that you
started? Ah, it was an amazing experience. Actually, for me, it was also my first experience with
podcasts. And it was a much, much better experience than I expected, to be honest, full of surprises.
And as you said, like a ton of people
that we had the opportunity to chat with
and many different people actually in these cases.
I mean, when we started this podcast,
the main thing we had in our mind
was that we want to focus around data, right?
And I think one of the most amazing outcomes
of all these conversations that we had in this season
is that at the end, data is, at least today, pretty much behind almost everything.
We had the opportunity, for example, to talk with companies that,
on the first look, you don't think that they are a data company or they are working with data,
but at the end, they are.
If you remember with Slabdash, for example, the episode, which is a productivity tool,
but at the end it's built on top of a ton of data to do that.
Same thing also with Panther Labs, where they are a security tool.
But as we were discussing with Jack, the main outcome of the conversation was that
security is a data problem.
So, yeah, it was an amazing journey, actually, and full of surprises and amazing guests.
And I'm really looking forward to continue doing this on the next season.
Yeah, I agree.
I think the concept of everything or the reality of everything running on data is really interesting
when you look back at our shows. I'm thinking about sort of two sides of that coin, and I know
there are different ways to look at it, but when we talked with Bookshop and Mason, he talked about
the data that they have to wrangle just in order to display a correct listing of books on their
e-commerce website for books. And that's sort of core to the product, right? They have to
leverage that data in order to deliver a product that provides a good experience.
But then on the other hand, I'm thinking about our conversation with Jason from BIND in the health insurance space. And he talked a lot about delivering data products to
other parts of the organization, right? So of course, they work on things that feed the core
product, but then you have data driving marketing, you have data driving product, you have data
driving sort of testing
in different areas of the company. So it's just interesting to think about data both driving
sort of the core product use cases, but also being a product itself that's consumed by other
parts of the organization. Yeah, absolutely. And if you think about it, I mean, it looks like
data is everywhere, which makes sense.
And I think that's another great, a little bit more technical outcome from all the conversations that we had, that there are some emerging patterns out there in terms at least of the
architectures that the companies are using.
I'll give some examples.
Even from the first episode that we had with Mattermost, for example,
we saw this pattern of building everything around the data warehouse, right?
Where you have like the pipelines, which is one part of the architecture.
The pipelines are pulling the data from the different sources
that the company has and all the data that they need.
Push it into a data warehouse.
The data warehouse might be something like Snowflake, for example.
And actually, one big difference compared to the past, because it's a new pattern,
but more and more people prefer to just extract and load the data inside the data warehouse
and implement any kind of complex, let's say, transformation logic on top of the data warehouse,
which is a result of huge changes
that have happened in the space of the data warehousing,
mainly the scalability of the solutions
in terms both of processing storage
and of course cost, which is very important.
And also these platforms have become more and more powerful
in terms of the expressivity that they have, like the things that you can do.
And that's where it's a point where you see technologies like DBT come into play, right?
Another very common layer that emerges inside companies in terms of like building the infrastructure.
With DBT, you have a layer where you can transform and model your data with all the benefits around that.
And then, of course, you have the consumption part of your architecture
where you have your BI tools connecting there
and using something like LookML again with modeling.
That has to do with modeling,
but this time LookML is more on the visualization part.
And that's a very common pattern that we see.
I think it's pretty much,
it exists in pretty much like every company
that we discussed with.
But there is also another pattern,
let's say, that emerges
which has to do with data lakes.
And the data lake is something
that we see coming up more and more
when we are talking with organizations that they incorporate data science in their practices.
And the interesting thing, if you ask me, is that the first texture pattern is more about building the infrastructure for the internal consumption of data. If you notice, Eric, we talked about BI, right? In the first version where BI is like 100%
something that's going to drive your organization and then like other departments, like your
marketing, your finance. With data science, there we see how data can become a product that's exposed
to your own customers. I think this is the most powerful use case around data science. Of course, you can use
data science to also do things that are consumed internally, like lead scoring, for example,
right? Or do some forecasting for your sales or finance. But I think that the most powerful
part of data science and how it is used is like in building products that are going to drive like
the customer experience. And there, the requirements are a little bit different.
We had some hints around that in the first season,
but we are going to have even more exposure to these use cases in season two.
It's one of the things that I'm really excited and I'm looking forward to the next season.
Yeah, I agree. I think that we didn't necessarily have a plan for a breakdown of the types of roles that we would talk to in terms of guests who work in the data space.
But we did talk with several people who work specifically in data science.
So I mentioned Jason from Bind.
We talked with Stephen Bailey from Emuta, who does a ton of stuff in data science. So I mentioned Jason from Bind. We talked with Stephen Bailey from Emuda,
who does a ton of stuff in data science, Ari and Osman, who works for HomeSnap.
We talked with multiple people doing data science, and it is really interesting
to look at that subset of the data world where there are different requirements than,
you know, sort of your quote unquote standard, you know, pulling data, processing it,
and then either sending it to tools or preparing it for BI. You know,
one thing that I think is really interesting that really stuck out to me
was we got the chance to,
how do I want to say this, sort of look at and discuss the people behind data and
behind the technology in a variety of different ways. So first of all, I think one of the things
that I've come to enjoy most about the show is that there's lots of cool technology. There's
lots of companies doing really neat things with data,
but the people who are doing those things have really interesting stories. So I think about
Andrew from Ernest who, you know, works on real estate transactions and is doing some really
interesting things there. And just hearing about, you know, his work as a marine biologist, you know, working, you know, in the ocean. And
that was just fascinating. Stephen Bailey, who I mentioned before, has a PhD in childhood
reading or education. That's just really, really interesting to see that. But then
the other angle of that that we saw, I think, is the human element in the actual work.
So, you know, we talked with Duke Haba, who was at Cognizant and has just had a long career working in AI.
And the way that he addressed it is really interesting.
He talked about the people fear AI or they're skeptical of AI, you know, or AI produces a bad result.
And so you tend to blame the technology, but he said, there are actually people behind that.
We need to remember that. And then the last, or, well, there's many more, but the last one I'll
mention that stuck out was Ari and Osman at HomeSnap talking about building models for
sort of predicting the time it would take to sell your home,
depending on the price, you know, so you have a slider, you say, okay, if I price my home at
this number, it should sell in, you know, 25 days, you know, and of course the time lengthens if you
raise the price, but we talked about how the model, you know, it's, it's hard to train a model
to account for human perception around things like
if the price is too low or too high, you know, people have questions. And you talked about ways
that you can incorporate that. But all that to say, I just really enjoy, I think one of my favorite
things is getting to meet the people who are doing this work and hear about their lives, you know,
kind of outside of their data work and the things that influence their their lives, you know, kind of outside of their data work
and the things that influence their data work, you know, whether that's prior experiences
or other projects.
And then also just the human element of actually doing the work of data and how, you know,
technology still requires, you know, a real human element in order to produce a great
experience for users.
Absolutely.
I think you are touching like a great, great point and a great insight for users. Absolutely. I think you are touching a great, great point
and a great insight that we tend to forget, I think,
especially as the people that we're working in technology,
which is that technology in general,
regardless if it is around data or not,
it is built by humans and it is built for humans.
So it's not going, I mean,
it's not the responsibility of the technology, right?
At the end, if something goes wrong,
it's because our models or our architectures,
they have some kind of flow.
And of course they are going to have a flow.
There's no way that we can build something
that is going to be so complex and perfect
from the first time that we get it public. Iterations are needed and anything
that has to do especially with data and especially anything that has to do with data science and
data analytics it's something that it's the productization of these practices are something
very new. So it will take time. It will take time both for the people responsible to build them, to build the best practices and the engineering practices
on how to engineer something that is going to be the best possible solution
and more predictable in the outcomes,
but also for the people that are using these tools and products.
They have to get educated, right?
Technology is still technology.
And data science model is a model
that predicts something very specific.
It's not a human that we have in front of us
that can adapt without the intervention of other humans.
So it's a very interesting point.
I think we need to always remember that, as I said,
these are things that are built by humans and for humans.
And we have a lot of work in front of us to improve them and figure out what's the best products that we can build,
which brings me to another outcome of all the conversations that we had,
which has to do with the maturity of this market and the industry.
We saw where there are like specific technologies
that they are really mature right now.
So for example, anything that has to do
with data pipelines, right?
We talked about ETL that became ELT.
That's a pretty mature part of the stack
where I would say the products
are almost like commoditized right now, right?
And there are some parts of the stack
that are very mature.
Same also with data warehousing, right now, right? And there are some parts of the stack that are very mature.
Same also with data warehousing.
But there's also a huge, huge space for new products and a lot of opportunities.
And we saw that with, for example,
if you remember one of the first interviews that we had
that was with Meroxa, with the VARIS, about CDC, right?
CDC is very hot.
I mean, it's something that we see many companies right now
trying to come up with solutions and products to address CDC.
Then something that is super hot is anything that has to do
with data governance, which, by the way, is funny
because it's something that in the past, at least,
it was you would hear like data governance and would be like, okay, that's something super boring that only Fortune 500 companies care about.
And suddenly it becomes something that is relevant for everyone, right?
And you see many companies trying to address, we had Immuta, right, which is just one part of data governance and they address it and that's the access control. Then we have
iteratively our last episode where guys over there they are working on trying to figure out how to do
improve the quality and control the quality of data. So that's super hot and I'm pretty sure
like in the next couple of months we are going to see more and more companies appearing trying to
solve these problems and of course just make the connection with the architectural patterns that we were
talking earlier, there's a lot of space for products around anything that has to do with
what is today called MLOps, which all the operations around how to productize ML, machine
learning and data science.
And as I said, more about this on season two,
but I think it's going to be very fascinating. And just to get to something that I think it's
one of your favorite outcomes, it's about the importance also of boring technology, right?
What do you think about this? Yeah, that came up on multiple episodes. When we were talking with people from Bookshop, they brought up the boring stack. Sometimes I'll ask for a stack breakdown just because it's interesting to see different ways that people shape their data stacks around the needs of different businesses. And it was interesting to talk with Mason. And his response was,
we actually have a pretty boring stack, but it works. And I know that we have an episode coming
out in season two with a company called LeafLink. and we have a similar conversation there.
And I think in that episode, I won't give too much away, but we talked about the balance between building something that is scalable and reliable that works for your users and your company now
with an eye to the future, as opposed to just adopting new technology, because
it's new, right, or it has some promise, like that's not always the best decision. And so we
talked to a lot of companies who, you know, we certainly people are doing really interesting
things. But in many cases, they say we have a really boring, but very stable and very efficient
stack. And I just found that
fascinating. Pushing the limits is really cool, but I think we saw several very mature
developers and data engineering and data science practitioners who have seen a lot and have
implemented a lot of things and really build for something that's going to be scalable,
maintainable, and provide a great experience. But that's from my perspective. I'm interested
in your thoughts around that because you have the actual experience of building all sorts of
different architectures. So how did that hit you that we saw that repeated over and over across
episodes? Yeah, I think at the end, I mean, from an engineering at least perspective,
okay, it's always, you know, people who are in technology, they always love to play with new
toys, right? And in some cases, it's really hard to control the excitement of using like the latest
shiny thing that came out there and promises to solve like another problem or an existing problem in a much better
way. But if you approach this from a more mature engineering approach point of view, at the end,
you cannot really build something new without having a very stable foundation, right? I mean,
you can, but probably you're going to end up having some really bad nights where you won't be able to sleep
because things will go really, really wrong.
And it will be super hard to find resources
to solve your problems.
So if you want my opinion,
the best way when you are trying to build a new product
and solve a problem that wasn't being solved before
or in a new way, one of
the best choices that you can do is the foundations of your solution to be based on proven technology.
So what we actually see with these companies that they have this approach is that you have
some really mature and experienced engineering teams behind that they know that I shouldn't focus on trying to debug
and understand how this new shiny thing works.
But instead, I should free my mind and let it focus on the problem
that I'm trying to solve with my own product.
So, yeah, I think it's a great sign of maturity
and good engineering practice at the end.
And as we approach and chat with more companies, I think this pattern will appear more and more.
By the way, just to add something here, we might think on the other hand that, yeah, but look at Google, look at Netflix, right?
They are using all these new shiny tools
and they are building their own shiny tools to do that.
But when we get into this thought process,
we forget something very important
that these companies,
they are addressing problems at a scale
that it's completely new.
They are really pushing the frontiers of technology
and what is
available right now. And at the same time, they have the resources and the talent to build new
solutions that can address these unique challenges that they have, right? So from starting a company
to becoming a company that has to deal with the traffic, for example, all the reliability that Netflix has,
that's a huge, huge road ahead, right?
So that's another thing that we always need to keep in mind
when we see these amazing companies using
and building all these amazing products.
Speaking of Netflix, I mean, it was a real honor
to speak with Ionis from Netflix
and hear about all of the various challenges they
have. But when we asked him about how they evaluate using or even building new technologies,
I think it can be, we can have the perception easily that, well, there's a bunch of engineers
at this company and they have tons of resources and so they can just try all these new things. That's not necessarily untrue, but the reality
is that they have a very balanced approach to making those decisions, especially around
building something new. And there are multiple people inside the company involved in those
decisions. You know, you talked about people from the business side, even, you know, sort of thinking through like, do these things
make sense for us to build, right? We have a problem, it's creating a customer experience,
it's creating, you know, some sort of friction and business process, but they don't have a,
you know, they're not reactive in the way that they adopt or build new technologies. And so,
yeah, it was really cool to hear about that from Netflix and hear, you know, it isn't just a free
for all of exploring new technology. They actually have a very principled approach to the way that
they use new stuff. Yeah, absolutely, Eric. And also, we need to keep in mind the unique nature of a company like Netflix, right?
I'm pretty sure that if we were talking with someone who is from the, I don't know, like the production teams that they have, right, that they produce the shows, for example, what we would hear would be a little bit different than what we heard from Ioannis. And there is a good reason for that,
because at the end, Netflix, their main business is in producing content and shows, right?
Technology is there to support that. And that gives also like a different, let's say, freedom
and flexibility to the teams that they work to build the technical backbone of the company. If you go, for example, to a company like Snowflake, which, by the way, is the company
where Ioannis works today.
So maybe in the future, we should also try to have another episode with him and see how
the two environments are different or the technologies are different and the products.
I'm pretty sure that at this point,
Snowflake has some very strict methodologies and processes
in terms of how to introduce new technologies
or how to introduce new practice
or even change the core product that they have.
So it's very easy to get excited
when we see just one announcement
or one blog post from a company,
especially at that scale. But we should always try to remember what the company does, why
does that, and that at the end, each company is different. And the culture is also different. And
that all these things make the whole process of how to approach new technologies very different
from company to company. Without saying that one is better than the other, right?
At the end, what matters is the success of the product and the company.
And it seems that even companies with very different approaches, they might succeed.
So that's great.
Sure.
Well, one last thing before we close out the season with a season wrap-up show.
The last subject I wanted to touch on was
the subject of trust. And we talked about this with multiple people. I think one of the first
times that it came to the forefront of conversation, actually, and we didn't necessarily
use the word trust, but I think about our conversation with Axel from Pool. And aside from him telling
an amazing story about Paul Graham telling him his startup idea was horrible right to his face,
which is one of my favorite stories from the season, he had a really simple but really
powerful piece of advice for early stage companies where he said, you need to be
very diligent about collecting the data. But he said, I always, especially in the early stage,
you know, sort of pre, you know, being able to have statistical significance and make decisions
based on that. He said, I always use the data as a way to figure out which customers I should talk to directly in order to learn about how
they're using my product. And that sort of brought up this idea that data involves a significant
amount of trust, right? So both trust and the data, as Axel pointed out. And then we also talked
with multiple other guests who talked about how that works inside of an organization.
So that showed up in terms of companies where we talked with the data engineer who was, I think, about Stephen Bailey from Emuda, who said that there was just a huge lack of trust in data.
And then that sort of colors the way that people think about data engineering and the operations around that.
And turning that around
is a really significant effort. And then we also talked about with iteratively the impact of their
work around data governance for companies that adopt them. And they really pointed to trust as
well. It creates more harmony between teams because everyone is really confident around the
data. And I think another component of that
that came up was the people who are consuming the data are making decisions with it, right? And so
the more confidence and trust that they have, the faster they can move, the better that is for the
business. So that was a really powerful topic, I think, in terms of summarizing a lot of the topics that came up around data.
What did you think about trust as sort of a summary of a lot of the things we discussed
with our guests? Oh, yeah, absolutely. I think that, as you said, like trust both on the data
itself and also on the teams involved, because keep in mind that data inside an organization is an interdisciplinary thing, right?
It's super important.
And that's why we see the emergence of data governance
and all these companies that are trying to tackle
the different aspects of data governance, right?
From access control to quality of data.
And this is something that actually I think,
I mean, I keep saying that,
but I'm really excited about
what's going to happen in the next season.
But especially as we get inside MLOps,
where, as we said,
machine learning is not just about
the impact that you have inside the organization
with the data,
but it's also how to deliver innovative products
to the customer. They're trusting the data, trusting the data, but it's also hard to deliver innovative products to the customer.
There, trusting the data, trusting the models, trusting the products that are built on top
of the data, it's going to be huge.
So we will see a lot of work that's going to be done on that.
And it can be like a total disaster if, for example, your models or your data exposed to your customers in a way that it might be perceived like something wrong.
Like, I'll give you an example.
We all experienced what happened at the Capitol building, right? the recommendation engine on YouTube where, I think it was on YouTube, where there was a video
from all these very sad things that were happening there. And the recommendation engine was
recommending products that are related with survival and guns and stuff like that, which,
I mean, if you think of that just from the perspective of the data itself,
it makes sense, right?
They are two related concepts,
but they are two very wrongly related concepts at that specific time.
So the more data is exposed and become like an integral part
of the products that we build,
the trust of both the data that we are using
and what we build on top of that, it's going to be super, super important.
And that brings me to my last highlight of all these discussions that we had the past week, which is around open source.
And I think trust, it's another, I mean, open source is another way of building more trust over the technologies that we are
using.
And it's a very foundational component of building like technologies that we can use
with our data in the best possible way.
Absolutely.
Well, I am extremely excited about doing another season.
We already have several episodes recorded that I'm really excited about.
And I'm excited.
We'll cover other topics.
Please feel free to reach out to us with your feedback.
We'd love to hear what we're doing well, what we're not doing well,
what types of guests you would like to hear,
what types of topics you would like us to cover.
So feel free to ping us on that.
Costas' email is costas at ruddersack.com.
I'm eric at ruddersack.com and we will catch you in the next season.