Software Huddle - The Real Work of Data Engineering with Joe Reis
Episode Date: February 27, 2024Today, we have Joe Reis on the show. Joe is the co author of the book, Fundamentals of Data Engineering, probably the best and most comprehensive book on data engineering you could think to read. We ...talk about the culture of Data Engineering, Relationship with Data Science, the downside of chasing bleeding edge technology in approaches to Data Modeling. Joe's got lots to say, lots of opinions and is super knowledgeable. So even if Data Engineering, Data Science isn't your thing. We think you're still going to really enjoy listening to the interview.
Transcript
Discussion (0)
How do you define a data engineer?
And I guess, like, how is that different than a software engineer for those that are maybe less intimately familiar with this world?
We define a data engineer as somebody who manages the, you know, the data lifecycle as it pertains to the role as a data engineer, right?
Well, you know, embracing the undercurrents of security, orchestration, data management, architecture, software engineering, and so forth. Do you think that data engineers are not always given as much credit or respect within an
organization as maybe other forms of engineering?
I think that's definitely been historically the case for sure.
But I would actually extend that to data in general.
Data is typically misunderstood.
Yeah.
Do you think the data lake, as historically know it, it's just going to go
away? Hey folks, it's Sean from Software Huddle. And boy, do I have a great one today because Joe
Reese is on the show. Joe is the co-author of The Fundamentals of Data Engineering, probably the
best and most comprehensive book on data engineering you could think to read. One of the
things I really liked about the book is that it's really focused on fundamentals, which, you know, hence the name. But it's not about specific technologies, but more the core principles of
that are widely applicable to any data stack. And beyond the book, we talk about the culture,
data engineering, relationship with data science, the downside of chasing bleeding edge technology,
and approaches to data modeling. Joe's got lots to say, lots of opinions,
and is super knowledgeable. So even if data engineering, data science isn't your thing,
I think you're still going to really enjoy listening to the interview.
All right, last thing before I get you over to the show, Alex and I will be in Miami in April at the Shift Developer Conference, speaking and also doing interviews. If you're in the area,
you should come on by, see some great talks, say hi to us. You can learn more about the conference at shift.infobip.com slash US.
And as always, feel free to reach out to Alex or myself on LinkedIn or X
if you want to connect or share feedback about the show.
All right, let's head over to my interview with Joe Reese.
Joe, welcome to Software Huddle.
Hey, what's going on?
Not too much.
As I was mentioning, I'm here recording, not my usual spot today.
I'm recording from my day job's office, so don't have my normal backdrop.
Got this rather boring backdrop.
But hopefully the sound, the office crowd doesn't come in until much later in the day
and it doesn't get too loud in here.
Okay, perfect.
Sounds good for now.
Yeah.
Anyway, I'm excited to talk to you.
I've listened to a lot of hours of your podcast. I know you're not going to hold back. You're going to bring the heat, which I like. So I thought to start off, we could talk a bit about your journey to where you are today. I know you studied math in university. And how did you kind of go from there to machine learning, data science, data engineering to where you are now? I mean, back then, I would say it's a pretty natural progression
career-wise.
Back when I was studying math, there weren't a lot of
career options available.
Data wasn't a cool thing either.
So yeah, I think that the path,
this is in the late 90s, early 2000s,
your path of a math degree was teach,
go work for the government, become an actuary uh you could
probably go be a bartender um i was actually a dj at clubs so that's how i made my money um
yeah then i got a job actually as an analyst but more doing data sciencey type work so
you know a lot of predictive work a lot of um optimization type work as well so that
was i guess a good application of my degree but fast forward to you know the 2010s and that's
when machine learning i think was taking off i think i started getting interested for real in
it back in 2009 i think it was it was like i gotta probably transition to this because it's
been gpus are starting to become available available to the public for machine learning purposes.
I think the cloud was also sort of facilitate,
you know,
machine learning.
I remember computers back then were pretty,
pretty crappy.
So it's like you needed all the horsepower you could get.
So,
yeah,
I think,
I think we were playing and using like an Xbox back then for a machine
learning purposes.
But yeah, anyway, that's how I got into it.
But yeah, I haven't looked back since.
And that was a self-taught journey at that point in terms of learning?
Yeah, self-taught.
I mean, there weren't a lot of resources out there at the time.
I mean, I think popular packages might have been like Weka or something like that back in the day.
Remember that one?
Yeah.
But that kind of got me into data engineering because I was working at a machine learning startup. like that back in the day. Remember that one? Yeah. But it was just, you know,
that kind of got me into data engineering because I was working at a machine learning startup. We were doing automated machine learning.
This was in the early 2010s.
And, you know, back then there wasn't
a playbook for, I guess, whatever you call
MLOps now, right? You'd have to
kind of roll your own
model hosting and
all this stuff. And I came up
with the automated feature engineering system
way back in the day.
And it wasn't really a rule book for any of that.
So you just kind of had to make it up as you went along.
But you got to do what you got to do.
Right.
Yeah, so I read your book recently,
The Fundamentals of Data Engineering.
Condolences and thank you.
Well, I had a couple observations.
So first thing, I've interviewed a number of people who've written books.
And I'm not going to name names or anything, but let's just say not all business books are created equal.
There's a large number of business books that are essentially like a 1,500-word blog post stretched into 140 pages of filler.
And I would say your book is absolutely not this.
It's basically a pretty meaty textbook.
And I was actually reading it a few weeks ago on a Saturday night after my wife and I got the kids down. And my wife came into the living room and she said, saw me reading
a textbook on the couch. And she's like, are you reading a textbook? Are you some kind of sociopath?
And the other thing, there's, I think, a lot of like tactic-based books in the space of data
engineering, you know, Spark for data engineers, Hadoop, or Databricks for dummies or something like that. But your book
is essentially, for the most part, technology agnostic. So why was it? Or why did you think
it was important to sort of focus on fundamentals and start there? I mean, there really wasn't
any coverage of the fundamentals of data engineering back when Matt Housley and I
decided to write the book as you
point out there's no shortage of blog posts there's no shortage of uh books really on you
know data engineering on you know technology x y or z or you know various cloud platforms and those
are great books but i but up to that point i don't think anyone had really taken the time to describe
data engineering from first principles you know and so that that was actually
the the hard part of this book was if you kind of peel back the field like what is it exactly and so
that was the task that interestingly enough uh you know our acquisitions editor at o'reilly she told
us not to do this book uh because she said it was going to be really hard for two first-time authors
to try and define an entire field from you know with the ground up but i don't know we told her we're kind of kind of dumb kind of crazy so why not then she eventually
came around to the gar wave of the world and you know but i can't say it was easy you know when
you try and define something from first principles and you don't want to have the crutch of a
technology it's i would say it's it's not easy yeah i mean you have to kind of take a step back
at each time that you're describing sort of these things it's like if you're talking
if you you know think about a particular technology it's like well what does that
technology actually enable people to do and then how do you sort of define that as a larger thing
and where it fits into this you know world that we call data engineering exactly and the crux is
definitely understanding what's not going to change as
quickly in a field that changes very quickly right so and we distilled that down to a few things it's
like you know um and i don't see any situation where you're not going to obtain data from some
sort of a source system you know you're not there's no situation where you're probably not
going to have some storage mechanism for that data, ingestion patterns and queries, transformation modeling, and then
serving it. I think all those are fairly
rugged.
They'll be around for a bit.
Then all the
undercurrents, I think, are what supports it all.
It was an interesting thought experiment.
I don't know that I would change anything on it.
So I'll probably add a couple
things, but that's about it.
You can come up with a new version in a few years.
Matt and I are thinking already about what we'd add in the new version.
I think orchestration probably is going to be a standalone chapter, but I'm not going to give away too much yet because I haven't formulated it.
But, you know, you start nitpicking things about your book after it's written.
And the idea was for it to be relevant at least five years from now. And I think it will be. Yeah. I mean, if you look at like sort of classic first principles books, you know, from
like Knuth's books, for example, on the fundamentals or the art of programming,
you know, like those, you know, he made up a programming language to describe these things.
That's like, so that you're not even walked into the technology of like the programming language
and sort of the, you know, languages
fall in and out of popularity.
He basically, well, I'm not even going to, I'm just going to ignore that and essentially
make up my own language to kind of describe these different things.
And those started in, I don't know, like, you know, 40 years ago, and they're still
as fundamental as they are today.
So I think that the sort of first principles approach approach doesn't if you have that, you create these kind of classic works that people
can go back to 10, 20 years down the road.
Oh, for sure. Yeah, all these books were great. Then also the
Martin Kleppman's Designing Data-Intensive Applications. I thought that was
definitely the de facto data engineering book at the time, even though
it was actually written for software engineers and not data people, i thought was interesting but you know that was the one that everyone
gravitated towards it's it's one of the greatest books of all time i think at the same time you
know if you look at how data engineering changed from the time martin published that book in what
2016 2017 to today or when we started writing our book in 2020 i think a lot changed right
the tools became a lot more abstract your your need to know a lot of the underlying gory details
of the outlines in the book, I would argue most data engineers at some point should be
familiar with it, but that shouldn't. You don't need to
operate at that level in your day-to-day job typically anymore, right? Things have just
gotten a lot simpler. So that allows you to kind of step back.
I would say our book is a prequel to his book. You know, I'm currently reviewing the next version of his book,
and it's good. We'll come along. Okay. Yeah. In your book, you mentioned,
at least in the time of writing, there was 91,000 unique definitions of the data engineer.
So how do you define a data engineer? And I guess like, how is that different than a software engineer for those that are maybe less intimately familiar with this world?
Yeah, yeah.
The number came from the amount of just what is a data engineer?
I did a unique query search in Google on that, and that came up with 91,000 results.
And so that was interesting. But we define a data engineer as somebody who manages the data lifecycle as it pertains to
the role as a data engineer, right?
While
embracing the undercurrents of security, orchestration,
data management, architecture, software
engineering, and so forth.
And the
TLDR is definitely a data engineer who gets
data from somewhere, does something useful to it, and
serves it for downstream stakeholders
like analysts and machine learning and AI use cases
and maybe reverse ETL and similar use cases.
So there's sort of the bridge between software engineers
and downstream data use cases, so to speak.
Yeah, so then if you think about software engineering
as potentially building, I don't know,
an application that's going to be used by some end user
in a B2B, B2C sense, the sort of end user,
the data engineers, the analysts, the data science,
the AI engineer.
Yeah, exactly.
And I think that's as it stands today.
I mean, there's definitely an argument that these roles
will be kind of melding together at some point
and perhaps evolving, especially as data use cases traditionally, which are kind of maybe more internal facing, become more external facing and application based.
And then at that point, I think there's more of a feedback loop between data and whatever software engineers are doing.
Right.
And so I think that's sort of the next progression.
We should talk about the last chapter of the book.
We call it the, quote, live data stack,
kind of a tongue-in-cheek plan,
where it's the modern data stack.
But the notion really is that feedback loops become tighter,
streaming becomes a first-class citizen,
and then event-based architectures means that
there's really no definitive line between applications and quote data use cases
at the end of the day.
It's all sort of the same thing.
But we'll see if that happens.
I think it will.
Do you think that data engineers
are not always given as much credit or respect
within an organization
as maybe other forms of engineering?
I think that's definitely been historically the case for sure.
But I would actually extend that to data in general.
Data is typically misunderstood,
probably misapplied and underutilized.
I think part of that is there's a sense of FOMO around data
where if I'm not doing data,
then I'm obviously doing something wrong.
So you'll hire a bunch of data people
but have really no idea how to properly utilize them.
And so that's,
whereas the software engineers,
for example,
right?
Like an application is,
is a very,
I guess the impact of what a software engineer does is very immediate,
right?
I make a feature and the features out there,
I make tweaks to the feature and those tweaks are,
you know,
available for use and so forth.
So,
whereas data is a bit more silent in some ways, right?
It's sort of, it's like air.
It's everywhere, but you don't see it.
But it impacts a lot of things and it powers a lot of things.
But I would say a lot of it's just due to the immaturity of the data field.
I always like to say that data feels like it's about, you know, 5, 10, 15 years behind software.
So, yeah, where would you say,
so if you have software engineering,
maybe that's sort of the most developed discipline
in the engineering space.
And then you have data engineering.
And then I guess like now you have sort of AI, ML, engineering.
Like where would you put data engineering
between those on like spectrum of maturity?
I think you kind of nailed it, really.
The data engineering is sort of in the middle.
ML engineering, because it's just newer, that's going to be the least mature.
But ultimately, we do have a guidebook, though, and a good rubric,
and that's just paying attention to what software engineers have already done for a while,
and I think modeling what works and tweaking it for you know our specific use cases but it's um software's done a lot of great
things you know they've done a lot of things correctly i think you know they've uh kind of
paved the way for for the fields if it's funny because if you look at like there's data ops
there's ml ops and these are all just borrowing practices from software engineering.
Yeah, I mean, it's a sign of maturity of any discipline where you get more specialization. If you look at being a medical doctor from 100 plus years ago, one doctor was delivering a baby, pulling a splinter and performing surgery. But now you have people who just specialize in
surgery of one specific organ or one specific type
and they're specialists in that thing. And I think we're getting, we're not quite,
maybe quite there in engineering, but there's more and more specialization because
the scale of the systems are much more complicated, bigger.
It's hard for anybody, one person essentially,
to be able to be an expert in everything
and keep all that stuff in their head.
I mean, yeah, software engineering, it definitely has progressed.
I think medical analogy is a really interesting one, right?
Because you used to be kind of the general doctor
that would go deliver babies in a barn and, you know...
Apply the leeches.
Yeah, leeching and bloodletting and all that stuff back
in the day and that's and you realize it's things things things mature and you know there's a lot to
borrow i'd be but when i say so you know that it feels like the data feels about 5 10 15 years
behind software that that's not it's not always going to be a linear um kind of look back if you
know what i mean like there's there's definitely things to borrow, so hopefully that means that
the data field can catch up a lot more quickly.
You already have a good rubric to go off of.
Stuff does work. I guess the more that these fields
kind of blend together with the use cases of data
powering more applications and so forth and AI models
becoming more front and center, I think you're going to see
a pretty interesting intersection over the next few years that
is going to change stuff and it's already changing workflows too of software engineers i mean if you
look at what's going on a lot of people are using copilot now and things like that to generate code
and you know that's a that wasn't a thing a few years ago. Now it's kind of the default.
Yeah, and I would say the other thing too that helps mature an area
as well as bring new talent
where they might have gone
to more traditional software engineering
is how important or what's the hotness of it.
And data is really the love language of AI.
And of course, there's a lot of stuff going on
in the AI world right now. So people are trying to figure out like, you know, how do I fit into
here? How do I kind of, I want to go where, you know, to the companies that are at the forefront
of technology work with the best people. So you're going to, I think, see a natural, natural,
like gravitation towards people seeking more of a data role than probably before as well oh yeah yeah and it's interesting
seeing you know the role is definitely you know the popularity of data has changed like you said
when i got into it it was like the probably the least cool job you could think of i mean you know
at that time everyone's getting their mba right i wanted to be a quant i wanted to go to finance i
was maybe the path i was in i was you know i so i'm gonna be an actuary because that's super exciting too um joking uh but you know it's it's it got popular right data science became the the cool
job and everyone became a data scientist and then and they realized that it's kind of hard to do
data science without data so that's what data engineering came into being. And now it's all about AI. And so who knows?
It'll be pretty interesting.
With respect to data engineering,
how important do you think it is
sort of understanding the fundamentals
of computer science,
like algorithms, data structures?
You know, is understanding that a DAG
is used for, you know,
querying planning and optimization
or, you know, maybe a B-tree
for some forms of indexing.
Like, does that matter to a data engineer's day-to-day?
I think it does.
I wouldn't say that you, as a data engineer,
a good junior data engineer needs to know all that.
I would say what we cover in the book is probably what I
would say from a
beginner to intermediate level data
engineer. But yeah, as you become more senior,
I would expect that you're going to know
a lot of this stuff.
How would you read an explain plan in a database, for example, right?
For a query, if you don't understand, you know, various things like B-trees and so forth, right?
That an explain plan will give you.
And so I think in various indexites, yeah, definitely algorithmic complexity and O notation and that kind of stuff is super necessary,
especially as you're starting to custom build.
I think early on when you're starting out,
if you're just using a data ingestion tool like a Fivetran or an Airbyte
or Estuary or Portable or or something like that and just moving
data to a snowflake like probably don't need to know a ton of stuff right you would want to know
how to write performant queries and hopefully understand ingestion patterns but you can get
i mean a lot of the tooling is abstracted away you know a lot of that stuff that's kind of goes
back to the whole notion of our book versus say martin clevin's book which is more about the internals of how all this stuff works right but
i think at some point for you to graduate towards being not very competent professionally obviously
you need to start knowing this stuff and having a computer science background i would say gives
you a huge advantage over this stuff because you already learn this stuff but i mean you'll have
to learn it one way or the other but there's great resources out there you know database internals
terrific book um i think everyone should read.
It's a bit dense, but that's kind of what you got to do.
Yeah, you mentioned this earlier as well,
but essentially the tooling's gotten much better.
Things are easier in many ways.
So where are sort of the hard problems in data engineering today?
I think the hard, you know, thinking a lot about this, I actually don't think that there's
much of a tooling gap at this point for solving classical data problems.
And by that, I mean, classical analytical problems, or data warehouse or data lake houses
needed, right?
What I do think is happening is that there's actually a skills gap and a knowledge gap
and a competency gap between the people using the tools and the potential that the tools provide i actually
think this is the biggest gap in our industry right now in the data industry is actually
not really understanding the fundamental practices things like data modeling for example
right things like how do databases work i think because a lot of the tooling abstracts away,
you know, a lot of functionality, and just, which is great. That's how that's what technology should
do. But at the same time, it helps you to understand what's going on under the hood and
understand, again, for data modeling, for example, correct ways of how to think about your data,
you know, at a conceptual level, for example, and how it relates back to the business or the organization you're in,
and then translating that data down to, you know,
something that's performant, you know, from a storage and query,
the physical layer of data modeling, right?
And knowing all the techniques.
You know, if I talk to data professionals these days,
actually a lot of them don't understand
or haven't heard of the classical data modeling techniques. mentioned relational modeling to people they kind of look at you with
a blank stare they may have heard of it once but couldn't really explain it and you know but i mean
understanding why that's why it's important to use relational modeling and when you'd want to
use it i think is you know that's just data people i would say software engineers do
you know i think a lot of practices are just just kind of thrown by the wayside or forgotten
about and this is... But then consultancies like mine or companies like mine, when we
come into situations, we're certainly glad that there's a lack of knowledge and best
practices.
Because you wouldn't have a business otherwise.
I mean, at the end of the day, consulting is just knowledge arbitrage. It's literally
all it is, right? But at the same time, I'm is just knowledge arbitrage it's literally all it is
right but but at the same time you know i'm writing books on this kind of stuff and i feel
like that's you know and i you know i'm kind of um i don't see moving away from consulting but
definitely uh i think my biggest focus right now is just education i feel like that is single
handedly the biggest gap that we have as an industry like i said there's no shortage of
technologies at this point i mean new technologies always come about to solve new problems that's the
nature of technology but um i think there's just a huge gap between
what we're capable of doing and where we need to go.
Where do you think that skills gap is coming? Are we moving away from perhaps teaching some
of these fundamentals at the university level? Or is it people are taking different paths to
a career in data where they might be skipping over the fundamentals and be focused more on tooling. It's kind of like, you know, if you go to a boot camp,
80% of graduates from boot camp are usually focused on front end, they're learning React,
but then you, you know, you might not be learning sort of the fundamentals of actual like computer
science and how a programming language works. And you're kind of just focused purely on
the technology. I do agree with his observation, 100%, so we're if you look at how um you know
data data boot camps for example right it's like okay so what's the first thing we're going to
learn probably python and sql right because that's widely used tools i mean the whole intent is to
get somebody a job right so you can check check the boxes on a resume say okay this person gets
python and sql and so forth but if i were to ask you um okay so given
given this um the setup you know let's say a company for example right and this is what the
company looks like this is what they do how would you think about their data needs
right like what what what are we trying to do in the first place i think giving people the ability
to see and observe a situation is this um situation is lacking in the techniques to assess that.
And then obviously, as you point out with computer science, understanding big O notation, for example, right?
Like, oh, am I going to write like four, like three, four loops nested together?
Is that a good idea?
You know, or do they just create like, you know qubit complexity in my uh you know what i
just did there you know this is it but if you all you know are for loops for example you don't
understand the the impact of nesting these things you know like i see this all the time right or
the difference between state stateless and stateful programming right so again um you know if you're
using a distributed system you you want to write stateless code right uh and you don't want to
write things that are stateful for very obvious reasons but this is again isn't really taught as
far as i can tell so you know i see a lot of things and so i think there's a few reasons for
this right um obviously it's people come into it from different angles and are trying to you know
i think as quickly as possible get the skills that they need to to check the boxes on a job
description so they can get a job and they can't blame them. That was what you probably want
to do if you want to get a job. And it just takes a lot of time to master the fundamentals, you know,
and it's just not one of the things I think people are incentivized to do, especially at their jobs.
It's like, you know, and I'm going to blame a couple of things. I'm going to blame maybe the
cult of agile for people that are working at jobs because the cult of agile, you know, agile, the manifesto started out as, as being a basically, um,
a manifesto that describes how we would continuously deliver software.
Right.
Um, what this also in an iterative fashion, but what this also means is that people took
that and started thinking that, okay, two weeksweek sprints is the same as being agile. But it's really hard to, I think, to sit down and master the fundamentals
and really think if you're only operating in two-week sprints, for example, right?
And I think this is one of the dichotomies in data
is that we're trying to apply a software engineering framework to data,
which is fundamentally different in a lot of ways, right?
With data, you're trying to compile data,
in a large case, to get context to the entire enterprise.
This is not the same thing as delivering features
as you would in software engineering.
But now we have to, you know,
data is very much a thinking person's sport,
as I say in a lot of my talks.
And this is a fundamentally different thinking exercise
than plowing away at two-week sprints. So I think that's also part of it, where you just don't have the time to sit back and really and um you know plowing away at two-week sprints so i think that's
also part of it where you just don't have the time to sit back and really assess you know what are we
trying to do from first principles and like what should i learn from my best practice standpoint
and so um yeah i'm sure i could blame a lot of other things but i'll i'll blame those for now
well i think when you sort of you know, blindly potentially apply like a methodology,
you might not realize there's also consequences to choosing that path.
Like, it's just like, you know, if you think about, you know, certain KPIs, you know, we
tend to optimize the metrics that we measure.
So if you measure the wrong thing, what does that end up doing?
It might actually steer you in the wrong direction.
Like, I know during my time at Google, at Google, everybody's focused on performance reviews that happen
twice a year. So if something is a project that takes longer than six months, people are more
reluctant to do it because even if it potentially has more impact and it's the right thing to do,
but what do I show on my next performance review? And that impacts the bonus I get at the end of the year.
It impacts my ability to get promoted
and all this sort of stuff.
So you become sort of hyper-focused
on these short-term wins
and that leads to short-sighted, I think, choices
when it comes to building a product.
That's really interesting, Sean.
Yeah, you're absolutely right.
I think it was Charlie Munger.
He's Warren Buffett's business partner.
He said it best.
If you show me the incentive, I'll show you the outcome. right um i think it was charlie munger he's warren buffett's business partner he said it best if you
show me the incentive i'll show you the outcome so you know what you described it very much fits
that i mean you know you have a performance review is you're gonna that's what you're gonna
be if that's what you're measured towards that's what you're gonna improve upon simple as that
right so it's yeah it definitely does come down to exactly what you described. And so that's for good and for bad, right?
So I want to talk a little bit about, you know, what's kind of going on in the world of data.
So you started with like data warehouses.
How has that area sort of changed over the past decade or so?
And where do you see it going?
I mean, I think the biggest shift is really the
movement to the cloud and the modern data stack.
So if you go back 10 years ago,
or even further than that, I would say
15, for argument's
sake, right? I mean, your options at the time,
if you're a company and you want data warehousing
capability, you could obviously roll your own
relational database and they'll get you some
way of the way there. It's fine.
A data warehouse is meant to be an architectural paradigm.'s not meant to be a um a specific technology as you
know bill bill inman would point out um but with respect to the data warehousing technologies i
think what's changed is you know the modern data sec um you know i think really democratized um
you know the use of data warehousing.
And by that I mean November of 2012 is when I sort of put the,
at least in my opinion, is when the modern data set started.
And that was with the release of AWS Redshift.
So before you'd have to get these expensive contracts for data warehouses,
usually on-prem.
If you're getting Teradata, for example,
you might be charged by the amount of cores that you're using
and that can be
very expensive.
There's a lot of
onerous details in that.
Redshift came out
and they're like,
fine, it's 25 cents
per hour
for a core.
Digital processing unit
or data processing unit
at the time, DPU.
That was pretty cool.
So for
pretty cheap,
you get a data warehouse and it runs in the cloud,
and that's pretty cool.
And that ushered in a lot of new technologies, right?
Cloud-based data ingestion tools, ELT became a thing for better or for worse.
Snowflake, I think they started working on that around 2012 as well.
And so you kind of couple this with the rise of data science too,
and you started seeing, I think, at the beginning of the 2010s, the data science and data warehousing
workloads are quite different. And I think there was actually a lot of animosity between the data
warehousing crowd, the data science crowds were data science, sort of this, this hot new thing
and data warehousing was, you know, this is old stodgy kind of blue shirt and khakis type of thing and uh you know and
then i remember data scientists were claiming oh yeah like you know data warehousing is going to
die sql is going to die i've heard these pronouncements countless times right but that
was like that was for years people were saying this kind of stuff you know like oh we're just
gonna be all be writing in notebooks at some point right and so you know i think spark uh blew the lid off of a lot of things too when
that came out right uh um you know because for the longest time you had to write map reduce
hadoop and that's painful for everybody um you know but spark i think opened the made it a lot
simpler and a lot faster and so you know that's probably around 2014 i think spark open source came out
and then what was really interesting is you know databricks for example they they sort of um you
know they were a data science first company back in the day and i remember using a lot of their
stuff and thought it was pretty awesome then you start seeing a convergence though right i would
say kind of the late 2010s is when you started seeing, you know, lake house come onto the scene. And, um, I forgot to mention data lakes.
That was a big thing. So as well, right. But that, so, I mean,
the notion was, Oh, we'll just collect all this data and maybe we'll use it
later. Right. But I think what happened was,
do you ever watch those like hoarder shows on a cable?
I know what you're talking about. Yeah.
Yeah. Pretty awesome. I love watching these shows
for some reason. I'm a pretty sick person.
But that's what
a lot of people's, a lot of companies' data
systems
ended up looking like. Yeah, we don't know what
we're going to do with this, but we know
there's some value in there somewhere, so we're going to
hold on to it. Right, yeah.
It's like having value in some
smelly pizza box that's like five years old that you just want to keep around for some reason and you know so that happened i think
people quickly rise at the data lake that maybe there are probably other ways to do this right
because it discovered like curating data sets and discovering data sets i mean it becomes almost
impossible at some point just because it's like how the hell do you find anything you can't right
and so then gdR comes along in 2018.
And well, now you have to be able to find your data.
And if you want to delete it, for example,
because you can find pretty heavily if you can't.
So I think that's when people started taking
at least governance a bit more seriously.
Because before that, it's like it was a free-for-all.
You know, it's...
Well, even now there's a ton of companies
that are just sitting on like a mountain of data, unstructured that's like you know encrypted in a bucket somewhere and they're
like we can't touch it because it's got you know pii or phi or you know something in it and uh but
someday we'll be able to do something with this someday so it's interesting and then you so i mean
that's what's changed i think is, is you just saw the combination and
sort of the convergence of data science and traditional data warehousing analytical workloads
to the point now where these systems are very much converging.
I mean, you haven't, you know, there's data fabric in Azure, you know, there's Databricks,
Snowflake, I mean, are basically on track to be the same product, in my opinion, feature
for feature, at least.
Yeah. I mean, they basically want to own all the data at this point.
Yeah, BigQuery is dope and Redshift is, I think, catching up.
So, yeah, I mean, it's so the lake house paradigm is sort of where everything is going towards.
Right. And that's that was a big evolution.
Yeah. Do you think the data lake, as you we know it, is just going to go away?
I think so.
I don't see much of a need for it.
I mean, because you can get the best of both worlds with a lake house,
and you can combine your structured or semi-structured data sets along with your unstructured data.
With a management and a governance layer on top of that, I think that's a key distinction,
because otherwise you would literally just have a data lake as we called it in the past. And nothing wrong with data lakes. I just think that, you know,
the world's kind of outgrown that chaos
that it provides. I mean, you'd have to be a very disciplined individual to do a data
lake, you know, or a disciplined organization.
I mean, it certainly is done. A lot of companies have done it successfully, but that's, I don't think it's
everyone's cup of tea. So this convergence is going on.
Do you think that we now also have the emergence
of all these vector database companies?
Do you see that also moving into essentially a single platform?
Yeah, I think inevitably it will.
Yeah.
I mean, you're starting to see this kind of workload already, right?
Yeah, I mean, I think like MongoDB now is supporting vectors, for example.
Yeah, it makes a lot of sense.
And so, yeah, I think you're going to see a convergence of all this stuff,
especially with the rise of unstructured data sets.
Because I think for the longest time, there were definitely people,
definitely companies with a kind of great use case for for uh
unstructured data but it was sort of this tale of two worlds right you had the structured data
people over here and the unstructured over here and these worlds didn't really um you know talk
and but now you know with the rise of um you know uh generative ai everywhere right i mean
these worlds will collide they have to but it's it's going to pose some very
interesting dilemmas um but i mean the set that you know because you can use general ai with you
know text images all the above in a multimodal setting then it's like why the hell not so so
you're absolutely right vector databases will i think become first-class citizens in pretty much
every infrastructure i don't because i mean you're going to need that similarity search capability, right?
Yeah, I mean, it's a way to essentially take action on the infrastructure data.
Do you think now that we have more technologies
that allow us to actually leverage the infrastructure data, that's changing in any way
the type of work that data engineers are doing and responsible for?
It's a good question, and it's something that I was thinking about last week quite heavily.
I did a podcast episode on this and it's,
I think it was titled things I didn't expect to see or something like that.
I don't know.
I never remember what I record.
But it was,
so it was at,
it's at Matillion's demo last week or is that their,
their event hosting it with,
with Mark Baccanetti.
And we had a bunch of AI announcements.
Every company is announcing some sort of AI feature.
What I thought was really interesting with this is
okay, so they're letting you do prompts now
to create data pipelines.
I thought that was interesting.
I'm still on the fence
of whether that's good or not.
Because if you're dealing with a
stochastic system where if I give you a prompt, i don't know what the output's actually going to be
100 of the time yeah reproducibility is a is a is a problem yeah it's like so am i going to do
like prompt reviews with my with my team now uh you know what would that look like if i were in a
cicd pipeline for example right so i think that's that's an interesting one but hey it's happening yeah i mean ideally like when i think about
something like cicd or uh you know i don't know orchestration or something like that like
i know reliability and being the reproducibility are really important um and those are two areas where it is not necessarily
the core value of, you know, Gen AI right now.
No, it's exactly the opposite.
It's just like, so, you know,
how are you going to unit test these things, for example?
Right, I have a prompt.
I don't think anybody knows the answer to that.
A pipeline, right, it's exactly,
it's like, so is this a good idea to do?
I don't know, but we're doing it anyway.
So then, you know, the other thing they demoed,
which I thought was pretty cool,
was the ability to use large language models
to comb through a bunch of text data
in your data pipeline.
I think that was pretty cool
in the way that it's just super easy to do.
Because this is traditionally a tool
that was more for structured data sets
and just querying databases.
But now I can come up with sentiment analysis
or some sort of analysis on my, say,
customer review text data.
And so these worlds will collide.
And it can do the same thing with images and stuff.
So I think it's going to unlock a lot of capabilities
that so far probably data engineers
weren't even thinking of, right?
Because the workflow is typically okay.
So I'll just get data, put it somewhere,
serve it for downstream use cases.
But those downstream use cases are,
well, they're growing, right?
So I think that's pretty exciting.
And it will change the workflow
of data engineers for sure.
Vector databases, again, that's another big one, right?
So, you know, I think for data engineers now,
you're going to be intersecting the worlds
of ML engineering and MLOps in ways that probably, at least this time last year probably weren't even thinking about yeah this
time last year most people didn't know what a vector was so no i mean the the hype back in the
you know through 2021 to now was the feature stores right you don't hear much about those
right now yeah because you don't need features and oh, oh, well, Gen AI will find the features for you, I guess.
Sure.
So, you know, you were talking a little bit about sort of the MapReduce Hadoop era. people maybe entering the world of data systems now to really understand the impact that those systems had
on large-scale data problems back in the day.
But a lot of those modeling approaches,
even things like, I don't know, using one large table
in order to overcome limitations of the technology,
we had at disposal at a given time.
Eventually those things got simpler and the underlying framework was taken
care of in terms of the complexity. So I guess
looking at today, what sort of hacks or workarounds are we doing
today that you think will go away and eventually just be something that kind of magically
happens behind the scenes for us because we have a proper abstraction layer?
Good question. I've been thinking about this a lot from the context of generative AI, actually,
and what it can do for data modeling. I think the hacks we have right now,
and I talk about data modeling a lot because that's what I'm writing a book on. That's pretty
much the only thing on my mind at this point. But it has huge impacts, right? Hacky data models mean that you can spend more time than you need to on getting really bad answers, for one.
So double whammy, right?
What do I mean by this?
Well, if you have an inefficiently created data set, say a giant white table, right?
The impacts can be quite severe.
For example, you might have lots of duplicates.
You might have lots of redundancies. You might have have you probably don't even know what the hell is in
there right query patterns are chained together right so if you're using um you know certain
transformation tools you don't understand modeling practices well you're going to basically create
probably just a bunch of tables or one big table or somewhere thing in between or just a bunch of
queries i think those are those are the hacks right now it's like we're we're super reactive trying to answer various questions and
what that means is you have just an enormous amount of sprawl right so if you thought the
data lake was bad you know i i would challenge people to think that or to at least observe the
the workings of their own data practices for example and and ask okay so how many how many
queries do i have that are
sitting out there right now how consistent are they in arriving to some sort of base level of
truth and the questions i'm being asked right can i consistently answer questions from the
data sets i have often probably not right you have very much divergence of um you know truth
so i think a couple things i've been thinking of is okay so
generative ai could certainly help this in some ways i think especially when coupled with knowledge
graphs um i think you're going to see the rise of graphs to provide more context in terms of data
that people have and generative ai i think will just be a um um i think obviously the being able
to search through data sets is is great and they'll get better um i think
also there's there's a capability of it actually being able to go back and reformat uh data sets
into a better form right i've been experimenting with this myself and i think there's actually
quite a bit of promise in this um but yeah it's still super early days we'll see but those are
the hacks i see right now it's just i think, back to the nature of the type of work that people do.
You're reacting to questions and needs of the business and constantly firefighting.
But it's ironic because we're supposed to be data professionals and knowledge workers and stuff.
But we're constantly just always in firefighting mode. Do you see that's because there's, you know,
due to like a lack of resourcing traditionally in the space.
So you,
you don't have time to sort of take a step back and do the proper planning.
You're more just like you're taking orders at a restaurant essentially and
reacting to those in the moment.
Oh yeah. I think that's a very good analogy, Sean. Yeah, absolutely correct.
Yeah. You're understaffed under resource especially
now you know data teams have been kind of cut to the bone especially you know like along with other
teams it's like you're going to do what you can to get by in the day and that's about it or you
know what you can to get by in the sprint yeah i feel like all like sort of uh non uh like directly
revenue generating teams uh have been suffered over the last year in terms of where the cuts
are making yeah i mean that's the reality of a business right that's going to happen and that's
just you know so if you're if you're um on those teams that are still around then yeah you're going
to be expected to do a lot more with a lot less and that's just how it is and so yeah you're not
going to have time you're just going to do what you can to get your job done right because again
it's by hitting it whatever kpi you're supposed to hit whether time. You're just going to do what you can to get your job done, right? Because again, it's about hitting whatever KPI you're supposed to hit,
whether that's real or imagined, right?
And most companies don't have KPIs.
That's the crazy thing.
Most teams I've seen don't even have a sense of a KPI.
So you're just going to try to strive towards whatever you think is the right thing to do
or whatever you think your boss thinks is the right thing to do.
It's all about self-preservation. So of course you're gonna you know and restaurant's a
good example i mean i've worked in restaurants before maybe you have as well it's a very
stressful environment right it's like i can't think of a more of a pressure cooker than a
place like that but that's that's a lot of that's a lot of teams these days too
so you're absolutely right yeah one of the things you mentioned there was this challenge around data duplication.
We were talking about the lake and the challenges with the data lake.
It's become a management nightmare because you have all this data just sitting there.
You have to be very disciplined about actually cataloging it to be able to do anything with it.
But I think even in the structured world, one of the major problems that companies have is just duplication of data, especially when we're talking about PII,
and that leads to this huge sprawl problem. And you create the same problems where you don't know
what you're storing, where it's storing. And in the world of GDPR, as well as other 100 plus
regulations that exist in the world, it becomes very, very difficult to be compliant as well as secure the data.
So I guess, what are your thoughts on that?
Do you see this as a big challenge for companies that you work with as well?
Huge problem.
Right?
And especially with the popularity of SaaS applications, right?
Because now you have the ability to have no control of your data model.
And then you have every ability to duplicate data to your heart's content.
And all kinds of systems that don't talk to each other typically.
Right?
Yeah, I mean, the temptation's there.
And in a lot of cases, data duplication occurs,
or triplication or quadruplication, whatever.
How many replications do you want to talk about?
I mean, you can do this to your heart's content
in all these different systems.
No way of reconciling it.
So you might have different versions of a customer
or different versions of products.
Cool, awesome.
That's the world we're in right now, though.
So there's this quote from your book where you said,
it's easy to get caught up in chasing bleeding egg technology
while losing sight of the core purpose of data engineering
despite designing robust and reliable systems
to carry data through the full life cycle
and serve it according to the needs of the end users.
So there's this ever-changing landscape of tools,
some of the things that we touched on in this
that data engineers can use and learn.
Do you think that we get too fixated and in love with tools
and these beautifully complex modern data stacks when perhaps something
simple could just do the job? Oh yeah, all the time.
Well, I mean, you've got to consider it from the angle of an engineer too.
I mean, you get paid to engineer stuff and you're always trying to find the
cool new thing, right?
Because that's, you just want to tinker with stuff, right?
And I think there's an element of resume-driven development too in this where, you know,
constantly looking at what's the hot new technology, what's going to help me get my next job, right?
I'm not going to be a COBOL programmer.
It's not that cool.
I could probably make a ton of money doing that.
You know what I mean?
But it's like, that's just always a temptation. It just that's uh yeah it's it's i think it's just human nature though
the grass is always greener on the other side so you know it's a there's always there's always a
new open source technology you should try out there's always a new vendor with a cool product
that you should that they tell you you should go try out and so there's a lot of noise right but
you know you gotta spend your time also focusing on what the business needs and,
um, you know, that's important, but,
and the fundamentals are hard to do too. Right. I mean, cause again, it's,
I always tell people you should, if you master the fundamentals,
it makes it a lot easier to assess all these new technologies.
Cause it's like rarely is there anything that's like completely novel and new
out there? I would say very, very rarely does that happen.
Often what happens is that there are just variations and permutations and you know some sort of combinatorial stuff on that existing ideas and technologies and that's how
things morph i very rarely do you see something that comes out left field that's completely new
right doesn't doesn't really happen in our field yeah do you think that a lot of this i don't like
getting better at sort
of making these decisions in terms of technology choices or knowing when hey like i don't need to
like apply this whole stack when i could do this pretty simply with like i don't know a spreadsheet
or something like that comes from you know just you know mature sort of maturity in the space and
and experience yeah i think it you got to spend your cycles uh sort of getting in the space and experience? Yeah, I think you got to spend your cycles
sort of getting your butt kicked around a bit.
You know, and I think that it brings you back to reality.
You know, like I do a lot of stuff in spreadsheets.
Why?
Because it's really easy to do.
And it costs me nothing, right?
And they're super efficient.
They're unreasonably effective for a lot of stuff.
You know, but I think what you realize is you know the
really it comes down to you as the individual and how effective you are at solving problems the tools are just there to to be tools right but i think when you start out right you want
to compensate with your lack of knowledge and lack of skills with tools that's a temptation
because it's like well i know how to use these tools i probably don't know what i'm doing right and
that's that's the kind of the fresher mentality that i've seen at least uh but that changes i
think ultimately you end up i mean my favorite tool is really just a pad and paper these days
and drawing out what i think the solution should be um and going on long walks and thinking about the problem
that's my secret weapon i don't need technologies to do that up front i certainly needed to help
implement things but then you you know what you need to do right but that you know if you're
coming out of college you're not going to have that ability why would you you don't have the
experience you wouldn't know how to solve it from, you know, but that just comes with time and comes with getting a lot of bruises.
Yeah.
You got to over engine here a few systems before you realize that maybe you
spend a little bit more time planning before executing.
Oh yeah.
And I think the big question you need to ask is why,
why are we doing this at all?
Like what, what is the objective?
Like, you know, I think if you can,
if you can treat
things as a journalist and approach it from that perspective you can just ask really good questions
you'll have better context sometimes the answer is you don't need to design this you don't need
to build this at all actually so um that that is an answer as well not everything has to be built
right yeah i mean you could uh you could potentially even you know
pay for service like exactly yes like i was reading this i was reading an article on hacker
news um a few weeks ago uh somebody had um i think it was at uber they they wrote their own spreadsheet
like they created a spreadsheet because like the excel or whatever didn't do what they needed to do
and so they uh over engineered the spreadsheet. And then I think something happened and it was never really used at all.
Yeah.
Right.
Yeah.
I feel like, so like, you know, Google had built up a culture of sort of engineering
everything.
And I think in the early days it made sense because one of their core assets was engineering
and they were doing things at scale well beyond essentially like a lot of services existed.
But now that's not really the case but there's still sort of this historic culture around like hey we we're
not going to use salesforce we're going to you know write our own crm or we're not going to use
hubspot we'll use our own marketing automation but then you have these kind of like internal
tools that are subpar to really what the industry standards are uh in some fashions because you know
internal tools are never going
to get the same resourcing that uh you know google searches or you know ads yeah that's exactly it
right i think last time i checked it as a google cloud partner they use salesforce for the google
cloud scrm now right so it's like you know you can't escape uh the inevitability that there are
better tools out there sometimes and maybe you don't have the best tool but you're absolutely right that's a temptation i mean i know people who have written
their own databases when like my sql or postgres would have done just great they're like oh i have
to build this and like i guess if that's your boss what lets you do i don't know why you would do this
but yeah i mean even as a lot of database companies are um you know start with postgres and then use
the extensions to you know do whatever they need that's why there's so many like postgres uh like sort of core core postgres i mean it's kind of
like a you know an operating system like let's start with the unix kernel and then go from there
we don't need to reinvent that piece oh exactly and postgres is awesome like you know linux is
awesome i mean use these as we say in the book you know and we write in i think in chapter four
about choosing technologies build versus buy it's like you know you should build build when it's a competitive advantage to you and it's uniquely yours.
Like what you mentioned with Google, like they were operating at a scale, solving problems in a way that nobody on Earth is doing.
Of course, you're going to have to build this in your own.
Like, you know, I mean, I don't think it's for lack of trying, you know, off the shelf stuff and breaking it.
Right. I mean, they did that. And so, you know, this isthe-shelf stuff and breaking it, right? I mean, they did that.
And so, you know, this is what they had to do.
Like, you know, I talked to Jordan Tagani
about, you know, the work they did with BigQuery, right?
He was a founding engineer for that.
And it's like, yeah, you're building that
because it's a system that you need
and doesn't exist right now.
You have to run analytics and tons of data.
It's like, you can't really do that.
You're going to have to build it, right?
And so that's
but you got up and i think you gotta understand like where you are as a company and as a as a team right like most companies aren't google and you don't need to do this and so you know i think
there's a temptation for engineers software data or whatever to read like google's blog uber's blog
netflix's blog and say okay i'm gonna go to that at my company. And it's like, maybe it'll work, but
do you have the same problem in that same way?
Also, do you have 100 engineers to throw at a problem that's
non-core to whatever it is your business model is?
Right. Exactly.
What do you think are the big unsolved problems in data engineering?
Yeah, it's a good question.
I think it's really about integrating data into, you know, like I said, more application workflows.
And that whole feedback loop, I think, is like one of the big sort of unsolved problems.
I would say, again, like the capability of solving the classical data problems.
I'm talking about analytics, for example,
I think that we have the technology to do that right now.
We've had it for a long time.
So I think it's a combination, again, of skills and practices.
I think that that's one of the big problems for data engineering
is just, I think, leveling up on the concepts,
I think, to be most effective at your job.
But we already touched on that,
and I think there's a lot of reasons for this.
But the big unsolved that. I think there's a lot of reasons for this. But the big
unsolved problem, I would say, is
that feedback loop between
just in the data lifecycle and bringing it full
circle. I think we're going to
continue solving.
I'll throw out a trigger word
for the audience and people will have an aversion
to this or like it, but data mesh, I think
has a capability of helping
solve this problem.
But, you know, we're not there yet.
Have you seen anybody actually implement
some version of data mesh?
I've seen people, I've talked to people
who have said that they've implemented
some version of it, right?
But if you were to talk to people like Jim Actigani,
who's a
really good friend of mine i mean i think that she maybe have a has a different opinion of that
right so i think it's it's sort of in the eye of the beholder but i think the notion of the
of data sharing and decentralized way like that i think has been done to some degree but i think
according to her um perspective maybe there's some work to do on it. But we'll get there. I think. I hope so.
But what it means, though, and one of the conclusions that he draws on, which I don't think
gets discussed a lot, is that it actually changes the shape of the roles people have.
So if you're a data product developer, as she calls it,
you're bringing together software engineering and data practices
all into one. So the notion of a data engineer, software engineer, ML engineer, this all kind of goes away. And it's just now you're delivering data products. And I think that's, that is a kind of the fundamental shift, which if you were to take, you know, what she proposes to a logical conclusion is exactly what would happen. So whether we get there or not, I don't know.
That's a debate for another podcast.
Yeah, that's a whole topic.
So as we start to wrap up,
is there anything else you'd like to share?
And how can people reach out to you,
get in contact with you?
Yeah, LinkedIn's good.
Send me a message.
I usually respond.
If you send me a sales pitch, I will not respond.
So you'll actually be moved to the other box
where that's purgatory for messages.
Yeah, LinkedIn's good.
Yeah, I'm taking a break from speaking.
I think that we're recording this in kind of late November.
But I'm taking a break from international travel
for several months.
I'm working on a...
I've got to finish my book,
so that's coming out first half of next year.
Then got a course I'm working on with deep learning AI on data engineering.
So that's going to be pretty dope.
Really looking forward to that.
So that specialization is so you can keep an eye out for that too.
Can't commit to a date on when that'll be out,
but I'm a heads down in those two projects right now,
as well as starting a new company.
So certainly a content and publishing company that will be announced early next year. Yeah. down on those two projects right now as well as starting a new company um so it's a new content
publishing company that will be announced um early next year yeah got a lot going on yeah
it sounds like a lot yeah awesome yeah and crossfit crossfit we forgot to talk about that so uh
gonna get back gonna get back into shape doing that stuff i think we uh we actually have a mutual
friend um so we're talking about on the show so uh yeah she's gonna be
doing some programming for me that's uh colleen fox if she's listening uh shout out so yeah
colleen's uh amazing athlete uh um far superior i can't comment on i don't know what your athletic
ability is joe but i'm gonna just warrant to guess that uh colleen's is uh above yours and
certainly above mine oh it is it is yeah even though she's quote retired uh from crossfit she still will like
completely uh mutilate anybody she competes with so yeah just uh but it's cool i think
hanging out with people like that because it's i like to unplug from uh the data uh stuff as well
it's it's fun but it's nice to you know hang out and do other stuff but she's a data person too
so it's kind of funny so uh anyway but yeah yeah do you crossfit much uh yeah uh
five times a week jesus okay it's a lot yeah that's how i uh reset my brain um you know a
little bit tired if you do something physically hard there's no way i can be you know sort of
thinking too deeply about uh you know work and other things which i spend most of
my time kind of thinking about so oh man yeah just go to fran every day or something yeah exactly
there you go it's gross well thanks uh joe for so much for for coming um you know for those
listening i highly recommend the fundamentals of data engineering a fantastic book and hopefully
you know once you some of these other projects land um you know if you want to come back and
chat about them i'm happy happy to have you back down there.
Yeah, I'd love to.
Love to.
Or we can do it in person.
We can do a CrossFit workout and do a podcast.
Yeah, there we go.
All our breasts and sweaty.
Probably before because we'll be really winded after.
Yeah.
Awesome.
Thank you so much.
Cheers.
Yeah, thanks, dude.
All right, take care.